Hello Andre, I really appreciate your post, it has been very useful.
I have a question about the post if you don't mind. Do you think this technique will be useful to initialize weights in Conv2D layers? I should assume that the number of inputs into each neuron is the kernel size, for example if kernel size is 3x3 the sigma = sqrt(2/9)?
Thanks! Things should work fine for Conv2D layers. It sounds like you'd be using n_i = 9 here, but do keep in mind that most libraries do this for you automatically depending on the initializer you use. Check our Keras' glorot_uniform for example.
Hello Andre! Thank you for this nice article.
I was curious about the He initialization. In your article you write that we have to double the formula from the Xavier init (in short). This makes sense to me. But how come, that we also ignore the number of outgoing layers?
Hi Daniel, thanks for the question. This is actually explained a bit in the He initialization paper. Check out the comparison between equations (10) and (14) towards the end of page 4. The important part is:
if the initialization properly scales the backward signal, then this is also the case for the forward signal
This is based on the simplifying assumption that the activation function behaves like f(X) = X in which case the derivative is 1. So definitely not applicable to all activation functions, but it still seems to give useful results regardless.
Comments
Hello Andre!
I found this article very useful. I like the graphs you made for the distribution of weights throughout the layers (testing many parameters).
Do you have a reproducible code to share?
Thanks, Louis
Thanks Louis! I put up the code used to generate the plots into a folder of our article materials GitHub repository.
Hello Andre ! Thanks for sharing your knowledge, it was really helpful.
Hello Andre, Very good explanation with cool graphs :)
Hello Andre, I really appreciate your post, it has been very useful.
I have a question about the post if you don't mind. Do you think this technique will be useful to initialize weights in Conv2D layers? I should assume that the number of inputs into each neuron is the kernel size, for example if kernel size is 3x3 the sigma = sqrt(2/9)?
Thanks! Things should work fine for Conv2D layers. It sounds like you'd be using n_i = 9 here, but do keep in mind that most libraries do this for you automatically depending on the initializer you use. Check our Keras' glorot_uniform for example.
I also posted some code you can use as a reference on our intoli-article-materials repo.
Hello Andre! Thank you for this nice article. I was curious about the He initialization. In your article you write that we have to double the formula from the Xavier init (in short). This makes sense to me. But how come, that we also ignore the number of outgoing layers?
Greetings, Daniel
Hi Daniel, thanks for the question. This is actually explained a bit in the He initialization paper. Check out the comparison between equations (10) and (14) towards the end of page 4. The important part is:
So you could use either one.
Hello Andre, I found this article is very useful, and thanks for sharing your knowledge.
I have a question: for forward part, why , not ?
Same for the backward part, why could we assume that derivative of activation function f is 0?
Cheers!
This is based on the simplifying assumption that the activation function behaves like f(X) = X in which case the derivative is 1. So definitely not applicable to all activation functions, but it still seems to give useful results regardless.
Hello Andre , It is not harmonic mean right it is reciprocal of the average of two consecutive layers
It's actually the harmonic mean, but of 1/n_i and 1/n_{i+1} which are inverted in the denominator.
great article
Hi, your article is quite helpful for my current research and the issues I met.
Have you by any chance read this paper - Dying ReLU and Initialization: Theory and Numerical Examples?
Seems it largely reduces the chance of getting a dying ReLU network. However, the procedure seems quite complicated though.