TBD: I have not examined this stuff for a long time and it is probably out of date.

How do we get generalisation from neural networks? As in all ML it is probably about controlling overfitting to the training set by some kind of regularization.

## Early stopping

e.g. (Prechelt 2012). Donβt keep training your model. The regularisation method that actually makes learning go faster, because you donβt bother to do as much of it. Interesting connection to NN at scale

## Stochastic weight averaging

Izmailov et al. (2018) Pytorchβs introduction to Stochastic Weight Averaging has all the diagrams and references we could want. Also this ends up having some interesting connection to Bayesian posterior uncertainty.

## Weight penalties

\(L_1\), \(L_2\), dropoutβ¦ Seems to be applied to weights, but rarely to actual neurons.

See Compressing neural networks for that latter use.

This is attractive but has a potentially expensive hyperparameter to choose. Also, should we penalize each weight equally, or are there some expedient normalization schemes? For that, see the next section:

## Normalization

Mario Lezcano, in the PyTorch Tutorials mentions

Regularizing deep-learning models is a surprisingly challenging task. Classical techniques such as penalty methods often fall short when applied on deep models due to the complexity of the function being optimized. This is particularly problematic when working with ill-conditioned models. Examples of these are RNNs trained on long sequences and GANs. A number of techniques have been proposed in recent years to regularize these models and improve their convergence. On recurrent models, it has been proposed to control the singular values of the recurrent kernel for the RNN to be well-conditioned. This can be achieved, for example, by making the recurrent kernel orthogonal. Another way to regularize recurrent models is via βweight normalizationβ. This approach proposes to decouple the learning of the parameters from the learning of their norms. To do so, the parameter is divided by its Frobenius norm and a separate parameter encoding its norm is learnt. A similar regularization was proposed for GANs under the name of "spectral normalization". This method controls the Lipschitz constant of the network by dividing its parameters by their spectral norm, rather than their Frobenius norm.

### Weight Normalization

Pragmatically, controlling for variability in your data can be very hard in, e.g. deep learning, so you might normalise it by the batch variance. Salimans and Kingma (Salimans and Kingma 2016) have a more satisfying approach to this.

We present weight normalization: a reparameterisation of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterisation is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

They provide an open implementation for keras, Tensorflow and lasagne.

### Adversarial training

See GANS for one type of this.

## References

*arXiv:1412.8690 [Cs, Math, Stat]*, December.

*arXiv:1702.02604 [Cs, Stat]*, February.

*arXiv:1612.02734 [Cs]*, December.

*Acta Numerica*30 (May): 87β201.

*arXiv:1404.7456 [Cs, Stat]*, April.

*Acta Numerica*30 (May): 203β48.

*Proceedings of the National Academy of Sciences*116 (32): 15849β54.

*International Conference on Machine Learning*, 541β49.

*Neural Computation*12 (8): 1889β1900.

*arXiv:1610.01989 [Cs, Stat]*, September.

*arXiv:2012.00152 [Cs, Stat]*, November.

*ICML*, 14.

*arXiv:1512.05287 [Stat]*.

*arXiv:1712.06541 [Cs, Stat]*, December.

*Proceedings of the 24th International Conference on Neural Information Processing Systems*, 2348β56. NIPSβ11. USA: Curran Associates Inc.

*arXiv:1509.01240 [Cs, Math, Stat]*, September.

*arXiv:1612.04010 [Cs]*, December.

*Proceedings of the 38th International Conference on Machine Learning*, 4563β73. PMLR.

*arXiv:1710.05468 [Cs, Stat]*, October.

*Proceedings of the 31st International Conference on Neural Information Processing Systems*, 972β81. Red Hook, NY, USA: Curran Associates Inc.

*arXiv:1612.04468 [Cs, Stat]*, December.

*Advances in Neural Information Processing Systems*, 8570β81.

*Workshop on Learning to Generate Natural Language*.

*Proceedings of the National Academy of Sciences*117 (20): 10625β26.

*Proceedings of the 32nd International Conference on Machine Learning*, 2113β22. PMLR.

*Proceedings of ICML*.

*Advances In Neural Information Processing Systems*.

*arXiv:1606.07326 [Cs, Stat]*, June.

*Proceedings of the IEEE International Conference on Computer Vision*, 5296β5304.

*Medium*(blog).

*Neural Networks: Tricks of the Trade*, edited by GrΓ©goire Montavon, GeneviΓ¨ve B. Orr, and Klaus-Robert MΓΌller, 53β67. Lecture Notes in Computer Science 7700. Springer Berlin Heidelberg.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901β1. Curran Associates, Inc.

*arXiv:1805.11604 [Cs, Stat]*, April.

*arXiv:1607.00485 [Cs, Stat]*, July.

*arXiv:1611.06791 [Cs]*, November.

*The Journal of Machine Learning Research*15 (1): 1929β58.

*arXiv:2006.00294 [Cs, Math, Stat]*, May.

*arXiv:1611.03131 [Cs, Stat]*, November.

*arXiv:1805.08000 [Cs]*, May.

*Proceedings of ICLR*.

## No comments yet. Why not leave one?