Regularising neural networks

Generalisation for street fighters

TBD: I have not examined this stuff for a long time and it is probably out of date.

How do we get generalisation from neural networks? As in all ML it is probably about controlling overfitting to the training set by some kind of regularization. Some weird stuff goes on though.

Implicit regularisation

Interesting study of this in infinite width neural networks.

Early stopping

e.g. (Prechelt 2012). Don’t keep training your model. The regularisation method that actually makes learning go faster, because you don’t bother to do as much of it.

Noise layers

See NN ensembles,

Input perturbation

Parametric noise layer. If you are hip you will take this further and do it by…

Regularisation penalties

\(L_1\), \(L_2\), dropout… Seems to be applied to weights, but rarely to actual neurons.

See Compressing neural networks for that latter use.

This is attractive but has an expensive hyperparameter to choose.

Adversarial training

See GANS for one type of this.

Bayesian optimisation

Choose your regularisation hyperparameters optimally even without fancy reversible learning but designing optimal experiments to find the optimum loss. See Bayesian optimisation.


Weight Normalization

Pragmatically, controlling for variability in your data can be very hard in, e.g. deep learning, so you might normalise it by the batch variance. Salimans and Kingma (Salimans and Kingma 2016) have a more satisfying approach to this.

We present weight normalization: a reparameterisation of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterisation is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

They provide an open implemention for keras, Tensorflow and lasagne.


Bach, Francis. 2014. “Breaking the Curse of Dimensionality with Convex Neural Networks.” December 30, 2014.
Bahadori, Mohammad Taha, Krzysztof Chalupka, Edward Choi, Robert Chen, Walter F. Stewart, and Jimeng Sun. 2017. “Neural Causal Regularization Under the Independence of Mechanisms Assumption.” February 8, 2017.
Baldi, Pierre, Peter Sadowski, and Zhiqin Lu. 2016. “Learning in the Machine: Random Backpropagation and the Learning Channel.” December 8, 2016.
Baydin, Atilim Gunes, and Barak A. Pearlmutter. 2014. “Automatic Differentiation of Algorithms for Machine Learning.” April 28, 2014.
Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. “Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off.” Proceedings of the National Academy of Sciences 116 (32): 15849–54.
Belkin, Mikhail, Siyuan Ma, and Soumik Mandal. 2018. “To Understand Deep Learning We Need to Understand Kernel Learning.” In International Conference on Machine Learning, 541–49.
Bengio, Yoshua. 2000. “Gradient-Based Optimization of Hyperparameters.” Neural Computation 12 (8): 1889–1900.
Dasgupta, Sakyasingha, Takayuki Yoshizumi, and Takayuki Osogami. 2016. “Regularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences.” September 22, 2016.
Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” November 30, 2020.
Finlay, Chris, Jörn-Henrik Jacobsen, Levon Nurbekyan, and Adam M Oberman. n.d. “How to Train Your Neural ODE: The World of Jacobian and Kinetic Regularization.” In ICML, 14.
Gal, Yarin, and Zoubin Ghahramani. 2016. “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” In.
Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2017. “Size-Independent Sample Complexity of Neural Networks.” December 18, 2017.
Graves, Alex. 2011. “Practical Variational Inference for Neural Networks.” In Proceedings of the 24th International Conference on Neural Information Processing Systems, 2348–56. NIPS’11. USA: Curran Associates Inc.
Hardt, Moritz, Benjamin Recht, and Yoram Singer. 2015. “Train Faster, Generalize Better: Stability of Stochastic Gradient Descent.” September 3, 2015.
Im, Daniel Jiwoong, Michael Tao, and Kristin Branson. 2016. “An Empirical Analysis of the Optimization of Deep Network Loss Surfaces.” December 12, 2016.
Kawaguchi, Kenji, Leslie Pack Kaelbling, and Yoshua Bengio. 2017. “Generalization in Deep Learning.” October 15, 2017.
Kelly, Jacob, Jesse Bettencourt, Matthew James Johnson, and David Duvenaud. 2020. “Learning Differential Equations That Are Easy to Solve.” In.
Klambauer, Günter, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017. “Self-Normalizing Neural Networks.” June 8, 2017.
Koch, Parker, and Jason J. Corso. 2016. “Sparse Factorization Layers for Neural Networks with Limited Supervision.” December 13, 2016.
Lee, Jaehoon, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. 2019. “Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent.” In Advances in Neural Information Processing Systems, 8570–81.
Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. 2017. “Bayesian Sparsification of Recurrent Neural Networks.” In Workshop on Learning to Generate Natural Language.
Loog, Marco, Tom Viering, Alexander Mey, Jesse H. Krijthe, and David M. J. Tax. 2020. “A Brief Prehistory of Double Descent.” Proceedings of the National Academy of Sciences 117 (20): 10625–26.
Maclaurin, Dougal, David K. Duvenaud, and Ryan P. Adams. 2015. “Gradient-Based Hyperparameter Optimization Through Reversible Learning.” In ICML, 2113–22.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML.
Nguyen Xuan Vinh, Sarah Erfani, Sakrapee Paisitkriangkrai, James Bailey, Christopher Leckie, and Kotagiri Ramamohanarao. 2016. “Training Robust Models Using Random Projection.” In, 531–36. IEEE.
Nøkland, Arild. 2016. “Direct Feedback Alignment Provides Learning in Deep Neural Networks.” In Advances In Neural Information Processing Systems.
Pan, Wei, Hao Dong, and Yike Guo. 2016. DropNeuron: Simplifying the Structure of Deep Neural Networks.” June 23, 2016.
Papyan, Vardan, Yaniv Romano, Jeremias Sulam, and Michael Elad. 2017. “Convolutional Dictionary Learning via Local Processing.” In Proceedings of the IEEE International Conference on Computer Vision, 5296–5304.
Perez, Carlos E. 2016. “Deep Learning: The Unreasonable Effectiveness of Randomness.” Medium. November 6, 2016.
Prechelt, Lutz. 2012. “Early StoppingBut When?” In Neural Networks: Tricks of the Trade, edited by Grégoire Montavon, Geneviève B. Orr, and Klaus-Robert Müller, 53–67. Lecture Notes in Computer Science 7700. Springer Berlin Heidelberg.
Salimans, Tim, and Diederik P Kingma. 2016. “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901–1. Curran Associates, Inc.
Santurkar, Shibani, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. 2019. “How Does Batch Normalization Help Optimization?” April 14, 2019.
Scardapane, Simone, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. 2016. “Group Sparse Regularization for Deep Neural Networks.” July 2, 2016.
Srinivas, Suraj, and R. Venkatesh Babu. 2016. “Generalized Dropout.” November 21, 2016.
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” The Journal of Machine Learning Research 15 (1): 1929–58.
Taheri, Mahsa, Fang Xie, and Johannes Lederer. 2020. “Statistical Guarantees for Regularized Neural Networks.” May 30, 2020.
Xie, Bo, Yingyu Liang, and Le Song. 2016. “Diversity Leads to Generalization in Neural Networks.” November 9, 2016.
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR.

Warning! Experimental comments system! If is does not work for you, let me know via the contact form.

No comments yet!

GitHub-flavored Markdown & a sane subset of HTML is supported.