Regularising neural networks

Generalisation for street fighters



TBD: I have not examined this stuff for a long time and it is probably out of date.

How do we get generalisation from neural networks? As in all ML it is probably about controlling overfitting to the training set by some kind of regularization.

Early stopping

e.g. (Prechelt 2012). Don’t keep training your model. The regularisation method that actually makes learning go faster, because you don’t bother to do as much of it. Interesting connection to NN at scale

Stochastic weight averaging

Izmailov et al. (2018) Pytorch’s introduction to Stochastic Weight Averaging has all the diagrams and references we could want. Also this ends up having some interesting connection to Bayesian posterior uncertainty.

Noise layers

See NN ensembles,

Input perturbation

Parametric noise applied to the data.

Weight penalties

\(L_1\), \(L_2\), dropout… Seems to be applied to weights, but rarely to actual neurons.

See Compressing neural networks for that latter use.

This is attractive but has a potentially expensive hyperparameter to choose. Also, should we penalize each weight equally, or are there some expedient normalization schemes? For that, see the next section:

Normalization

Mario Lezcano, in the PyTorch Tutorials mentions

Regularizing deep-learning models is a surprisingly challenging task. Classical techniques such as penalty methods often fall short when applied on deep models due to the complexity of the function being optimized. This is particularly problematic when working with ill-conditioned models. Examples of these are RNNs trained on long sequences and GANs. A number of techniques have been proposed in recent years to regularize these models and improve their convergence. On recurrent models, it has been proposed to control the singular values of the recurrent kernel for the RNN to be well-conditioned. This can be achieved, for example, by making the recurrent kernel orthogonal. Another way to regularize recurrent models is via β€œweight normalization”. This approach proposes to decouple the learning of the parameters from the learning of their norms. To do so, the parameter is divided by its Frobenius norm and a separate parameter encoding its norm is learnt. A similar regularization was proposed for GANs under the name of "spectral normalization". This method controls the Lipschitz constant of the network by dividing its parameters by their spectral norm, rather than their Frobenius norm.

Weight Normalization

Pragmatically, controlling for variability in your data can be very hard in, e.g. deep learning, so you might normalise it by the batch variance. Salimans and Kingma (Salimans and Kingma 2016) have a more satisfying approach to this.

We present weight normalization: a reparameterisation of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterisation is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

They provide an open implementation for keras, Tensorflow and lasagne.

Adversarial training

See GANS for one type of this.

References

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. β€œLayer Normalization.” arXiv.
Bach, Francis. 2014. β€œBreaking the Curse of Dimensionality with Convex Neural Networks.” arXiv:1412.8690 [Cs, Math, Stat], December.
Bahadori, Mohammad Taha, Krzysztof Chalupka, Edward Choi, Robert Chen, Walter F. Stewart, and Jimeng Sun. 2017. β€œNeural Causal Regularization Under the Independence of Mechanisms Assumption.” arXiv:1702.02604 [Cs, Stat], February.
Baldi, Pierre, Peter Sadowski, and Zhiqin Lu. 2016. β€œLearning in the Machine: Random Backpropagation and the Learning Channel.” arXiv:1612.02734 [Cs], December.
Bardes, Adrien, Jean Ponce, and Yann LeCun. 2022. β€œVICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.” arXiv.
Bartlett, Peter L., Andrea Montanari, and Alexander Rakhlin. 2021. β€œDeep Learning: A Statistical Viewpoint.” Acta Numerica 30 (May): 87–201.
Baydin, Atilim Gunes, and Barak A. Pearlmutter. 2014. β€œAutomatic Differentiation of Algorithms for Machine Learning.” arXiv:1404.7456 [Cs, Stat], April.
Belkin, Mikhail. 2021. β€œFit Without Fear: Remarkable Mathematical Phenomena of Deep Learning Through the Prism of Interpolation.” Acta Numerica 30 (May): 203–48.
Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. β€œReconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off.” Proceedings of the National Academy of Sciences 116 (32): 15849–54.
Belkin, Mikhail, Siyuan Ma, and Soumik Mandal. 2018. β€œTo Understand Deep Learning We Need to Understand Kernel Learning.” In International Conference on Machine Learning, 541–49.
Bengio, Yoshua. 2000. β€œGradient-Based Optimization of Hyperparameters.” Neural Computation 12 (8): 1889–1900.
Dasgupta, Sakyasingha, Takayuki Yoshizumi, and Takayuki Osogami. 2016. β€œRegularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences.” arXiv:1610.01989 [Cs, Stat], September.
Domingos, Pedro. 2020. β€œEvery Model Learned by Gradient Descent Is Approximately a Kernel Machine.” arXiv:2012.00152 [Cs, Stat], November.
Finlay, Chris, JΓΆrn-Henrik Jacobsen, Levon Nurbekyan, and Adam M Oberman. n.d. β€œHow to Train Your Neural ODE: The World of Jacobian and Kinetic Regularization.” In ICML, 14.
Gal, Yarin, and Zoubin Ghahramani. 2016. β€œA Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” In arXiv:1512.05287 [Stat].
Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2017. β€œSize-Independent Sample Complexity of Neural Networks.” arXiv:1712.06541 [Cs, Stat], December.
Graves, Alex. 2011. β€œPractical Variational Inference for Neural Networks.” In Proceedings of the 24th International Conference on Neural Information Processing Systems, 2348–56. NIPS’11. USA: Curran Associates Inc.
Hardt, Moritz, Benjamin Recht, and Yoram Singer. 2015. β€œTrain Faster, Generalize Better: Stability of Stochastic Gradient Descent.” arXiv:1509.01240 [Cs, Math, Stat], September.
Im, Daniel Jiwoong, Michael Tao, and Kristin Branson. 2016. β€œAn Empirical Analysis of the Optimization of Deep Network Loss Surfaces.” arXiv:1612.04010 [Cs], December.
Immer, Alexander, Matthias Bauer, Vincent Fortuin, Gunnar RΓ€tsch, and Khan Mohammad Emtiyaz. 2021. β€œScalable Marginal Likelihood Estimation for Model Selection in Deep Learning.” In Proceedings of the 38th International Conference on Machine Learning, 4563–73. PMLR.
Ioffe, Sergey, and Christian Szegedy. 2015. β€œBatch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv.
Izmailov, Pavel, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. β€œAveraging Weights Leads to Wider Optima and Better Generalization,” March.
Kawaguchi, Kenji, Leslie Pack Kaelbling, and Yoshua Bengio. 2017. β€œGeneralization in Deep Learning.” arXiv:1710.05468 [Cs, Stat], October.
Kelly, Jacob, Jesse Bettencourt, Matthew James Johnson, and David Duvenaud. 2020. β€œLearning Differential Equations That Are Easy to Solve.” In.
Klambauer, GΓΌnter, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017. β€œSelf-Normalizing Neural Networks.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 972–81. Red Hook, NY, USA: Curran Associates Inc.
Koch, Parker, and Jason J. Corso. 2016. β€œSparse Factorization Layers for Neural Networks with Limited Supervision.” arXiv:1612.04468 [Cs, Stat], December.
Lee, Jaehoon, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. 2019. β€œWide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent.” In Advances in Neural Information Processing Systems, 8570–81.
Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. 2017. β€œBayesian Sparsification of Recurrent Neural Networks.” In Workshop on Learning to Generate Natural Language.
Loog, Marco, Tom Viering, Alexander Mey, Jesse H. Krijthe, and David M. J. Tax. 2020. β€œA Brief Prehistory of Double Descent.” Proceedings of the National Academy of Sciences 117 (20): 10625–26.
Maclaurin, Dougal, David Duvenaud, and Ryan Adams. 2015. β€œGradient-Based Hyperparameter Optimization Through Reversible Learning.” In Proceedings of the 32nd International Conference on Machine Learning, 2113–22. PMLR.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. β€œVariational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML.
Nguyen Xuan Vinh, Sarah Erfani, Sakrapee Paisitkriangkrai, James Bailey, Christopher Leckie, and Kotagiri Ramamohanarao. 2016. β€œTraining Robust Models Using Random Projection.” In, 531–36. IEEE.
NΓΈkland, Arild. 2016. β€œDirect Feedback Alignment Provides Learning in Deep Neural Networks.” In Advances In Neural Information Processing Systems.
Pan, Wei, Hao Dong, and Yike Guo. 2016. β€œDropNeuron: Simplifying the Structure of Deep Neural Networks.” arXiv:1606.07326 [Cs, Stat], June.
Papyan, Vardan, Yaniv Romano, Jeremias Sulam, and Michael Elad. 2017. β€œConvolutional Dictionary Learning via Local Processing.” In Proceedings of the IEEE International Conference on Computer Vision, 5296–5304.
Perez, Carlos E. 2016. β€œDeep Learning: The Unreasonable Effectiveness of Randomness.” Medium (blog).
Prechelt, Lutz. 2012. β€œEarly Stopping β€” But When?” In Neural Networks: Tricks of the Trade, edited by GrΓ©goire Montavon, GeneviΓ¨ve B. Orr, and Klaus-Robert MΓΌller, 53–67. Lecture Notes in Computer Science 7700. Springer Berlin Heidelberg.
Salimans, Tim, and Diederik P Kingma. 2016. β€œWeight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901–1. Curran Associates, Inc.
Santurkar, Shibani, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. 2019. β€œHow Does Batch Normalization Help Optimization?” arXiv:1805.11604 [Cs, Stat], April.
Scardapane, Simone, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. 2016. β€œGroup Sparse Regularization for Deep Neural Networks.” arXiv:1607.00485 [Cs, Stat], July.
Srinivas, Suraj, and R. Venkatesh Babu. 2016. β€œGeneralized Dropout.” arXiv:1611.06791 [Cs], November.
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. β€œDropout: A Simple Way to Prevent Neural Networks from Overfitting.” The Journal of Machine Learning Research 15 (1): 1929–58.
Taheri, Mahsa, Fang Xie, and Johannes Lederer. 2020. β€œStatistical Guarantees for Regularized Neural Networks.” arXiv:2006.00294 [Cs, Math, Stat], May.
Xie, Bo, Yingyu Liang, and Le Song. 2016. β€œDiversity Leads to Generalization in Neural Networks.” arXiv:1611.03131 [Cs, Stat], November.
You, Zhonghui, Jinmian Ye, Kunming Li, and Ping Wang. 2018. β€œAdversarial Noise Layer: Regularize Neural Network By Adding Noise.” arXiv:1805.08000 [Cs], May.
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. β€œUnderstanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.