Self-Normalizing Neural Networks

MrQuincle · on June 10, 2017

+ Problem: deep nets working fine if they are recurrent, but for forward nets, depth doesn't seem to do the job.

+ Normalization is beneficial for learning (per unit zero means and unit variance). It can be batch normalization, layer normalization, or weight normalization (if trained layer for layer and previous layer normalized).

+ Perturbations through stochastic gradient descent, stochastic regularization (dropout) does not destroy the normalized properties for CNNs, but it does so for forward nets.

+ Self-normalizing net uses a mapping g: O -> O that maps mean and variance to the next layer for each observation. Iteratively applying this mapping leads to a fixed point.

+ The activation function to do so is not a sigmoid, ReLU, etc. but a function that is linear for positive x and exponential in x for negative x; the scaled exponential linear unit.

+ Intuitively: for negative net inputs the variance is decreased, for positive net inputs the variance is increased.

+ For very negative values the variance decrease is stronger. For inputs close to zero the variance increase is stronger.

+ For large invariance in one layer, the variance gets decreased more in the next layer, and vice versa.

+ Theorem 2 states that the variance can be bounded from above and hence there are not exploding gradients.

+ Theorem 3 states that the variance can be bounded from below and does not vanish.

+ Stochasticity is introduced by a variant on dropout called alpha dropout. This is a type of dropout that leaves mean and variance invariant.

I think the paper gives a nice view on handling gradients in deep nets.

visarga · on June 10, 2017

That's a great summary.

The promise of this work is that we can have fully connected nets 30 layers deep, or more. Up until now they didn't work for more than 2-3 layers in depth. The fully connected nets have been untamed and wild until now, but now they can be made to behave.

Now that it has been shown to be possible, in a few months we could see more solutions.

chuckbot · on June 10, 2017

On principle you're right, but at least for computer vision the number of layers you mention are a bit off. VGG16 worked well with 16 layers without any special handling. ResNet went to >150 layers by using shortcuts, which kind of cracked the problem already. This paper gives us more insight and maybe a more elegant solution.

edit: Just realized you said 2/3 _fully connected layers_, which is right. But for convolutions we needed skip connections, too, to get them to work. Any reason you single out fully connected layers?

jimfleming · on June 10, 2017

Regarding your edit, the authors of the paper in question focus on FNNs and note the reason in the paper:

> Both RNNs and CNNs can stabilize learning via weight sharing, therefore they are less prone to these perturbations. In contrast, FNNs trained with normalization techniques suffer from these perturbations and have high variance in the training error (see Figure 1).

Essentially FNNs stand to benefit more from this work than CNNs or RNNs.

guygurari · on June 10, 2017

That is the point of the paper: making deep fully-connected networks work.

gwern · on June 10, 2017

Reddit discussion: https://www.reddit.com/r/MachineLearning/comments/6g5tg1/r_s...

return0 · on June 10, 2017

They already have a tensorflow implementation of SELU https://github.com/bioinf-jku/SNNs

unixpickle · on June 13, 2017

I'm not sure I see why tanh couldn't be used to the same effect. If you use 1.6*tanh(x) as your activation function, it pushes small variances higher and high variances lower and gets you to a variance of ~1 after many layers. Obviously not as rigorous, just an observation.

daveguy · on June 10, 2017

Page 87 of the paper, Appendix A4.2 starts the comparison between problem sets.

Edits:

Looks impressive, best or near best on most, but I wish they had bolded best of set.

Still not sure how the regularization squares with the rapid precision fitting to the training set data in Figure 1.

nl · on June 11, 2017

That Appendix!

Next time someone claims people don't have a theoretical understanding of how NNs work point them at that.

sherjilozair · on June 11, 2017

And tell them to explain it the rest of us too, since they're so cool and mathy.

posterboy · on June 11, 2017

As I understand it, the theory is not balf of the issue, but the ammount of data processed and generated in training is too much to verify to comprehend manually. So research is currently adding printf debugging, to let the NN explain what it sees, instead of blindly trusting the results.