Topic: https://intoli.com/blog/neural-network-initialization/
hide preview

What's next? verify your email address for reply notifications!

unverified 6y, 276d ago

Hello Andre!

I found this article very useful. I like the graphs you made for the distribution of weights throughout the layers (testing many parameters).

Do you have a reproducible code to share?

Thanks, Louis

remark link
hide preview

What's next? verify your email address for reply notifications!

andre 6y, 267d ago

Thanks Louis! I put up the code used to generate the plots into a folder of our article materials GitHub repository.

hide preview

What's next? verify your email address for reply notifications!

unverified 6y, 64d ago

Hello Andre ! Thanks for sharing your knowledge, it was really helpful.

hide preview

What's next? verify your email address for reply notifications!

unverified 6y, 62d ago

Hello Andre, Very good explanation with cool graphs :)

hide preview

What's next? verify your email address for reply notifications!

unverified 6y, 41d ago

Hello Andre, I really appreciate your post, it has been very useful.

I have a question about the post if you don't mind. Do you think this technique will be useful to initialize weights in Conv2D layers? I should assume that the number of inputs into each neuron is the kernel size, for example if kernel size is 3x3 the sigma = sqrt(2/9)?

remark link
hide preview

What's next? verify your email address for reply notifications!

andre 6y, 34d ago [edited]

Thanks! Things should work fine for Conv2D layers. It sounds like you'd be using n_i = 9 here, but do keep in mind that most libraries do this for you automatically depending on the initializer you use. Check our Keras' glorot_uniform for example.

I also posted some code you can use as a reference on our intoli-article-materials repo.

hide preview

What's next? verify your email address for reply notifications!

unverified 6y, 27d ago

Hello Andre! Thank you for this nice article. I was curious about the He initialization. In your article you write that we have to double the formula from the Xavier init (in short). This makes sense to me. But how come, that we also ignore the number of outgoing layers?

Greetings, Daniel

remark link
hide preview

What's next? verify your email address for reply notifications!

andre 6y, 26d ago [edited]

Hi Daniel, thanks for the question. This is actually explained a bit in the He initialization paper. Check out the comparison between equations (10) and (14) towards the end of page 4. The important part is:

if the initialization properly scales the backward signal, then this is also the case for the forward signal

So you could use either one.

hide preview

What's next? verify your email address for reply notifications!

3gtsqPKl 5y, 320d ago [edited]

Hello Andre, I found this article is very useful, and thanks for sharing your knowledge.

I have a question: for forward part, why , not ?

Same for the backward part, why could we assume that derivative of activation function f is 0?

Cheers!

remark link
hide preview

What's next? verify your email address for reply notifications!

andre 5y, 286d ago

This is based on the simplifying assumption that the activation function behaves like f(X) = X in which case the derivative is 1. So definitely not applicable to all activation functions, but it still seems to give useful results regardless.

hide preview

What's next? verify your email address for reply notifications!

unverified 5y, 287d ago

Hello Andre , It is not harmonic mean right it is reciprocal of the average of two consecutive layers

remark link
hide preview

What's next? verify your email address for reply notifications!

andre 5y, 286d ago

It's actually the harmonic mean, but of 1/n_i and 1/n_{i+1} which are inverted in the denominator.

hide preview

What's next? verify your email address for reply notifications!

unverified 5y, 199d ago

great article

hide preview

What's next? verify your email address for reply notifications!

unverified 5y, 167d ago

Hi, your article is quite helpful for my current research and the issues I met.

Have you by any chance read this paper - Dying ReLU and Initialization: Theory and Numerical Examples?

Seems it largely reduces the chance of getting a dying ReLU network. However, the procedure seems quite complicated though.

hide preview

What's next? verify your email address for reply notifications!