my.remarkbox.com

unverified 7y, 168d ago

Hello Andre!

I found this article very useful. I like the graphs you made for the distribution of weights throughout the layers (testing many parameters).

Do you have a reproducible code to share?

Thanks, Louis

remark

andre 7y, 159d ago

Thanks Louis! I put up the code used to generate the plots into a folder of our article materials GitHub repository.

remark

unverified 6y, 320d ago

Hello Andre ! Thanks for sharing your knowledge, it was really helpful.

remark

unverified 6y, 318d ago

Hello Andre, Very good explanation with cool graphs :)

remark

unverified 6y, 297d ago

Hello Andre, I really appreciate your post, it has been very useful.

I have a question about the post if you don't mind. Do you think this technique will be useful to initialize weights in Conv2D layers? I should assume that the number of inputs into each neuron is the kernel size, for example if kernel size is 3x3 the sigma = sqrt(2/9)?

remark

andre 6y, 290d ago [edited]

Thanks! Things should work fine for Conv2D layers. It sounds like you'd be using n_i = 9 here, but do keep in mind that most libraries do this for you automatically depending on the initializer you use. Check our Keras' glorot_uniform for example.

I also posted some code you can use as a reference on our intoli-article-materials repo.

remark

unverified 6y, 283d ago

Hello Andre! Thank you for this nice article. I was curious about the He initialization. In your article you write that we have to double the formula from the Xavier init (in short). This makes sense to me. But how come, that we also ignore the number of outgoing layers?

Greetings, Daniel

remark

andre 6y, 282d ago [edited]

Hi Daniel, thanks for the question. This is actually explained a bit in the He initialization paper. Check out the comparison between equations (10) and (14) towards the end of page 4. The important part is:

if the initialization properly scales the backward signal, then this is also the case for the forward signal

So you could use either one.

remark

3gtsqPKl 6y, 211d ago [edited]

Hello Andre, I found this article is very useful, and thanks for sharing your knowledge.

I have a question: for forward part, why $X_{i+1} = sum(Wi * Xi)$ , not $X_{i+1} = f(sum(Wi * Xi))$ ?

Same for the backward part, why could we assume that derivative of activation function f is 0?

Cheers!

remark

andre 6y, 177d ago

This is based on the simplifying assumption that the activation function behaves like f(X) = X in which case the derivative is 1. So definitely not applicable to all activation functions, but it still seems to give useful results regardless.

remark

unverified 6y, 178d ago

Hello Andre , It is not harmonic mean right it is reciprocal of the average of two consecutive layers

remark

andre 6y, 177d ago

It's actually the harmonic mean, but of 1/n_i and 1/n_{i+1} which are inverted in the denominator.

remark

unverified 6y, 91d ago

great article

remark

unverified 6y, 58d ago

Hi, your article is quite helpful for my current research and the issues I met.

Have you by any chance read this paper - Dying ReLU and Initialization: Theory and Numerical Examples?

Seems it largely reduces the chance of getting a dying ReLU network. However, the procedure seems quite complicated though.

remark