TBD: I have not examined this stuff for a long time and it is probably out of date.

How do we get generalisation from neural networks? As in all ML it is probably about controlling overfitting to the training set by some kind of regularization. Some weird stuff goes on though.

## Implicit regularisation

Interesting study of this in infinite width neural networks.

## Early stopping

e.g. (Prechelt 2012). Don’t keep training your model. The regularisation method that actually makes learning go faster, because you don’t bother to do as much of it.

## Noise layers

See NN ensembles,

### Input perturbation

Parametric noise layer. If you are hip you will take this further and do it by…

## Regularisation penalties

\(L_1\), \(L_2\), dropout… Seems to be applied to weights, but rarely to actual neurons.

See Compressing neural networks for that latter use.

This is attractive but has an expensive hyperparameter to choose.

### Adversarial training

See GANS for one type of this.

### Bayesian optimisation

Choose your regularisation hyperparameters optimally even without fancy reversible learning but designing optimal experiments to find the optimum loss. See Bayesian optimisation.

## Normalization

### Weight Normalization

Pragmatically, controlling for variability in your data can be very hard in, e.g. deep learning, so you might normalise it by the batch variance. Salimans and Kingma (Salimans and Kingma 2016) have a more satisfying approach to this.

We present weight normalization: a reparameterisation of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterisation is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

They provide an open implemention for keras, Tensorflow and lasagne.

## References

*Proceedings of the National Academy of Sciences*116 (32): 15849–54. https://doi.org/10.1073/pnas.1903070116.

*International Conference on Machine Learning*, 541–49. http://arxiv.org/abs/1802.01396.

*Neural Computation*12 (8): 1889–1900. https://doi.org/10.1162/089976600300015187.

*ICML*, 14.

*Proceedings of the 24th International Conference on Neural Information Processing Systems*, 2348–56. NIPS’11. USA: Curran Associates Inc. https://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.

*Advances in Neural Information Processing Systems*, 8570–81. http://arxiv.org/abs/1902.06720.

*Workshop on Learning to Generate Natural Language*. http://arxiv.org/abs/1708.00077.

*Proceedings of the National Academy of Sciences*117 (20): 10625–26. https://doi.org/10.1073/pnas.2001875117.

*ICML*, 2113–22. http://www.jmlr.org/proceedings/papers/v37/maclaurin15.pdf.

*Proceedings of ICML*. http://arxiv.org/abs/1701.05369.

*Advances In Neural Information Processing Systems*. http://arxiv.org/abs/1609.01596.

*Proceedings of the IEEE International Conference on Computer Vision*, 5296–5304.

*Neural Networks: Tricks of the Trade*, edited by Grégoire Montavon, Geneviève B. Orr, and Klaus-Robert Müller, 53–67. Lecture Notes in Computer Science 7700. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_5.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901–1. Curran Associates, Inc. http://papers.nips.cc/paper/6114-weight-normalization-a-simple-reparameterization-to-accelerate-training-of-deep-neural-networks.pdf.

*The Journal of Machine Learning Research*15 (1): 1929–58. http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf.

*Proceedings of ICLR*. http://arxiv.org/abs/1611.03530.

## No comments yet. Why not leave one?