+ Problem: deep nets working fine if they are recurrent, but for forward nets, depth doesn't seem to do the job.
+ Normalization is beneficial for learning (per unit zero means and unit variance). It can be batch normalization, layer normalization, or weight normalization (if trained layer for layer and previous layer normalized).
+ Perturbations through stochastic gradient descent, stochastic regularization (dropout) does not destroy the normalized properties for CNNs, but it does so for forward nets.
+ Self-normalizing net uses a mapping g: O -> O that maps mean and variance to the next layer for each observation. Iteratively applying this mapping leads to a fixed point.
+ The activation function to do so is not a sigmoid, ReLU, etc. but a function that is linear for positive x and exponential in x for negative x; the scaled exponential linear unit.
+ Intuitively: for negative net inputs the variance is decreased, for positive net inputs the variance is increased.
+ For very negative values the variance decrease is stronger. For inputs close to zero the variance increase is stronger.
+ For large invariance in one layer, the variance gets decreased more in the next layer, and vice versa.
+ Theorem 2 states that the variance can be bounded from above and hence there are not exploding gradients.
+ Theorem 3 states that the variance can be bounded from below and does not vanish.
+ Stochasticity is introduced by a variant on dropout called alpha dropout. This is a type of dropout that leaves mean and variance invariant.
I think the paper gives a nice view on handling gradients in deep nets.
The promise of this work is that we can have fully connected nets 30 layers deep, or more. Up until now they didn't work for more than 2-3 layers in depth. The fully connected nets have been untamed and wild until now, but now they can be made to behave.
Now that it has been shown to be possible, in a few months we could see more solutions.
On principle you're right, but at least for computer vision the number of layers you mention are a bit off. VGG16 worked well with 16 layers without any special handling. ResNet went to >150 layers by using shortcuts, which kind of cracked the problem already. This paper gives us more insight and maybe a more elegant solution.
edit: Just realized you said 2/3 _fully connected layers_, which is right. But for convolutions we needed skip connections, too, to get them to work. Any reason you single out fully connected layers?
Regarding your edit, the authors of the paper in question focus on FNNs and note the reason in the paper:
> Both RNNs and CNNs can stabilize learning via weight sharing, therefore they are less prone to these perturbations. In contrast, FNNs trained with normalization techniques suffer from these perturbations and have high variance in the training error (see Figure 1).
Essentially FNNs stand to benefit more from this work than CNNs or RNNs.
I'm not sure I see why tanh couldn't be used to the same effect. If you use 1.6*tanh(x) as your activation function, it pushes small variances higher and high variances lower and gets you to a variance of ~1 after many layers. Obviously not as rigorous, just an observation.
As I understand it, the theory is not balf of the issue, but the ammount of data processed and generated in training is too much to verify to comprehend manually. So research is currently adding printf debugging, to let the NN explain what it sees, instead of blindly trusting the results.
+ Normalization is beneficial for learning (per unit zero means and unit variance). It can be batch normalization, layer normalization, or weight normalization (if trained layer for layer and previous layer normalized).
+ Perturbations through stochastic gradient descent, stochastic regularization (dropout) does not destroy the normalized properties for CNNs, but it does so for forward nets.
+ Self-normalizing net uses a mapping g: O -> O that maps mean and variance to the next layer for each observation. Iteratively applying this mapping leads to a fixed point.
+ The activation function to do so is not a sigmoid, ReLU, etc. but a function that is linear for positive x and exponential in x for negative x; the scaled exponential linear unit.
+ Intuitively: for negative net inputs the variance is decreased, for positive net inputs the variance is increased.
+ For very negative values the variance decrease is stronger. For inputs close to zero the variance increase is stronger.
+ For large invariance in one layer, the variance gets decreased more in the next layer, and vice versa.
+ Theorem 2 states that the variance can be bounded from above and hence there are not exploding gradients.
+ Theorem 3 states that the variance can be bounded from below and does not vanish.
+ Stochasticity is introduced by a variant on dropout called alpha dropout. This is a type of dropout that leaves mean and variance invariant.
I think the paper gives a nice view on handling gradients in deep nets.