Why does deep learning work?

Are we in the pocket of Big VRAM?

No time to frame this well, but there are a lot of versions of the question, so… pick one. The essential idea is that we say: Oh my, that deep learning model I just trained had terribly good performance compared with some simpler thing I tried. Can I make my model simpler and still get good results? Or is the overparameterization essential? Can I know a decent error bound? Can I learn anything about underlying system by looking at the parameters I learned?

And the answer is not β€œyes” in any satisfying general sense. Pfft.

Synthetic tutorials

Magic of (stochastic) gradient descent

Going deep stochastically

The SGD fitting process looks processes from statistical mechanics.

Proceed with caution, since there is a lot of messy thinking here. Here are some things I’d like to read, but whose inclusion should not be taken as a recommendation. The common theme is using ideas from physics to understand deep learning and other directed graph learning methods.

There are also arguments that SGD is doing some kind of MCMC simulation from the problem posterior (Mandt, Hoffman, and Blei 2017) as in NN ensembles or learning a kernel machine (Domingos 2020).

… with saddle points

tl;dr it looks like you need to worry about saddle points but you probably do not (Lee et al. 2017, 2016).

Magic of SGD+overparameterization

Looking at a different part of the problem, the combination of overparameterization and SGD is argued to be the secret (Allen-Zhu, Li, and Song 2018b)

Our main finding demonstrates that, for state-of-the-art network architectures such as fully-connected neural networks, convolutional networks (CNN), or residual networks (Resnet), assuming there are n training samples without duplication, as long as the number of parameters is polynomial in \(n\), first-order methods such as SGD can find global optima of the training objective efficiently, that is, with running time only polynomially dependent on the total number of parameters of the network.

Function approximation theory

Ignoring learnability, the pure function-approximation results are an interesting literature. If you can ignore that troublesome optimisation step, how general a thing can your neural network approximate as its depth and width and sparsity changes? The most recent thing I looked at is (ElbrΓ€chter et al. 2021), which also has a survey of that literature. See also (BΓΆlcskei et al. 2019; Wiatowski and BΓΆlcskei 2015). They derive some suggestive results, for example, that scaling with depth of network is vastly more favourable than in widthfor a fixed weight budget.

Crazy physics stuff I have not read

Wiatowski et al, (Wiatowski, Grohs, and BΓΆlcskei 2018; Shwartz-Ziv and Tishby 2017) argue that looking at neural networks as random fields with energy propagation dynamics provides some insight to how they work. Haber and Ruthotto leverage some similar insights to argue you can improve NNs by looking at them as ODEs.

Lin and Tegmark, argue that statistical mechanics provides inside to deep learning, and neuroscience (Lin and Tegmark 2016b, 2016a). Maybe on a similar tip, Natalie Wolchover summarises (Mehta and Schwab 2014). Charles H Martin. Why Deep Learning Works II: the Renormalization Group.

There is also a bunch more file under statistical mechanics of statistics.

There is nothing to see here

There is another school again, which argues that much of deep learning is not so interesting after all when you blur out the more hyperbolic claims with a publication bias filter. e.g. Piekniewski, Autopsy of a deep learning paper

Machine learning sits somewhere in between [science and engineering]. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

There is also the fourth kind of papers, which indeed contain an idea. The idea may even be useful, but it happens to be trivial. In order to cover up that embarrassing fact a heavy artillery of β€œacademic engineering” is loaded again, such that overall the paper looks impressive.


  • Simon J.D. Prince’s new book Understanding Deep Learning (Prince 2022)
  • Gradient Dissent, a list of reasons that large backpropagation-trained networks might be worrisome. There are some interesting points in there, and some hyperbole. Also: If it were true that there are externalities from backprop networks (i.e. that they are a kind of methodological pollution that produces private benefits but public costs) then what kind of mechanisms should be applied to disincentives them?
  • C&C Against Predictive Optimization.


Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2018a. β€œOn the Convergence Rate of Training Recurrent Neural Networks,” October.
β€”β€”β€”. 2018b. β€œA Convergence Theory for Deep Learning via Over-Parameterization,” November.
Anderson, Alexander G., and Cory P. Berg. 2017. β€œThe High-Dimensional Geometry of Binary Neural Networks.” arXiv:1705.07199 [Cs], May.
Arora, Sanjeev, Nadav Cohen, and Elad Hazan. 2018. β€œOn the Optimization of Deep Networks: Implicit Acceleration by Overparameterization.” arXiv:1802.06509 [Cs], February.
Baldassi, Carlo, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. 2016. β€œUnreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes.” Proceedings of the National Academy of Sciences 113 (48): E7655–62.
Barron, A.R. 1993. β€œUniversal Approximation Bounds for Superpositions of a Sigmoidal Function.” IEEE Transactions on Information Theory 39 (3): 930–45.
Bartlett, Peter L., Andrea Montanari, and Alexander Rakhlin. 2021. β€œDeep Learning: A Statistical Viewpoint.” Acta Numerica 30 (May): 87–201.
Belilovsky, Eugene, Michael Eickenberg, and Edouard Oyallon. 2019. β€œGreedy Layerwise Learning Can Scale To ImageNet.” In International Conference on Machine Learning, 583–93. PMLR.
Belkin, Mikhail. 2021. β€œFit Without Fear: Remarkable Mathematical Phenomena of Deep Learning Through the Prism of Interpolation.” Acta Numerica 30 (May): 203–48.
Berner, Julius, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. 2021. β€œThe Modern Mathematics of Deep Learning.”
BΓΆlcskei, Helmut, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. 2019. β€œOptimal Approximation with Sparsely Connected Deep Neural Networks.” SIAM Journal on Mathematics of Data Science 1 (1): 8–45.
Chang, Bo, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. 2018. β€œReversible Architectures for Arbitrarily Deep Residual Neural Networks.” In arXiv:1709.03698 [Cs, Stat].
Choromanska, Anna, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. 2015. β€œThe Loss Surfaces of Multilayer Networks.” In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 192–204.
Chou, Hung-Hsu, Holger Rauhut, and Rachel Ward. 2023. β€œRobust Implicit Regularization via Weight Normalization.” arXiv.
Dalalyan, Arnak S. 2017. β€œFurther and Stronger Analogy Between Sampling and Optimization: Langevin Monte Carlo and Gradient Descent.” arXiv:1704.04752 [Math, Stat], April.
Dauphin, Yann, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. β€œIdentifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” In Advances in Neural Information Processing Systems 27, 2933–41. Curran Associates, Inc.
Domingos, Pedro. 2020. β€œEvery Model Learned by Gradient Descent Is Approximately a Kernel Machine.” arXiv:2012.00152 [Cs, Stat], November.
ElbrΓ€chter, Dennis, Dmytro Perekrestenko, Philipp Grohs, and Helmut BΓΆlcskei. 2021. β€œDeep Neural Network Approximation Theory.” IEEE Transactions on Information Theory 67 (5): 2581–2623.
Gilbert, Anna C., Yi Zhang, Kibok Lee, Yuting Zhang, and Honglak Lee. 2017. β€œTowards Understanding the Invertibility of Convolutional Neural Networks.” arXiv:1705.08664 [Cs, Stat], May.
Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2017. β€œSize-Independent Sample Complexity of Neural Networks.” arXiv:1712.06541 [Cs, Stat], December.
Haber, Eldad, and Lars Ruthotto. 2018. β€œStable Architectures for Deep Neural Networks.” Inverse Problems 34 (1): 014004.
Haber, Eldad, Lars Ruthotto, Elliot Holtham, and Seong-Hwan Jun. 2017. β€œLearning Across Scales - A Multiscale Method for Convolution Neural Networks.” arXiv:1703.02009 [Cs], March.
Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp. 2010. β€œFinding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” arXiv.
Hu, Hang, Zhao Song, Omri Weinstein, and Danyang Zhuo. 2022. β€œTraining Overparametrized Neural Networks in Sublinear Time.” arXiv.
Im, Daniel Jiwoong, Michael Tao, and Kristin Branson. 2016. β€œAn Empirical Analysis of the Optimization of Deep Network Loss Surfaces.” arXiv:1612.04010 [Cs], December.
Jentzen, Arnulf, Benno Kuckuck, and Philippe von Wurstemberger. 2023. β€œMathematical Introduction to Deep Learning: Methods, Implementations, and Theory.” arXiv.
Jin, Chi, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. 2017. β€œHow to Escape Saddle Points Efficiently.” In PMLR, 1724–32.
Kawaguchi, Kenji, Leslie Pack Kaelbling, and Yoshua Bengio. 2017. β€œGeneralization in Deep Learning.” arXiv:1710.05468 [Cs, Stat], October.
Khan, Mohammad Emtiyaz, and HΓ₯vard Rue. 2022. β€œThe Bayesian Learning Rule.” arXiv.
Lee, Jason D., Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2017. β€œFirst-Order Methods Almost Always Avoid Saddle Points.” arXiv:1710.07406 [Cs, Math, Stat], October.
Lee, Jason D., Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2016. β€œGradient Descent Converges to Minimizers.” arXiv:1602.04915 [Cs, Math, Stat], March.
Levy, Kfir Y. 2016. β€œThe Power of Normalization: Faster Evasion of Saddle Points.” arXiv:1611.04831 [Cs, Math, Stat], November.
Lin, Henry W., and Max Tegmark. 2016a. β€œCritical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language.” arXiv:1606.06737 [Cond-Mat], June.
β€”β€”β€”. 2016b. β€œWhy Does Deep and Cheap Learning Work so Well?” arXiv:1608.08225 [Cond-Mat, Stat], August.
Lipton, Zachary C. 2016. β€œStuck in a What? Adventures in Weight Space.” arXiv:1602.07320 [Cs], February.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. β€œStochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April.
Mehta, Pankaj, and David J. Schwab. 2014. β€œAn Exact Mapping Between the Variational Renormalization Group and Deep Learning.” arXiv.
Neu, Gergely, Gintare Karolina Dziugaite, Mahdi Haghifam, and Daniel M. Roy. 2021. β€œInformation-Theoretic Generalization Bounds for Stochastic Gradient Descent.” arXiv:2102.00931 [Cs, Stat], August.
Olah, Chris, Alexander Mordvintsev, and Ludwig Schubert. 2017. β€œFeature Visualization.” Distill 2 (11): e7.
Pascanu, Razvan, Yann N. Dauphin, Surya Ganguli, and Yoshua Bengio. 2014. β€œOn the Saddle Point Problem for Non-Convex Optimization.” arXiv:1405.4604 [Cs], May.
Perez, Carlos E. 2016. β€œDeep Learning: The Unreasonable Effectiveness of Randomness.” Medium (blog).
Philipp, George, Dawn Song, and Jaime G. Carbonell. 2017. β€œGradients Explode - Deep Networks Are Shallow - ResNet Explained.” arXiv:1712.05577 [Cs], December.
Prince, Simon J.D. 2022. Understanding Deep Learning. MIT Press.
Roberts, Daniel A. 2021a. β€œWhy Is AI Hard and Physics Simple?” arXiv.
β€”β€”β€”. 2021b. β€œSGD Implicitly Regularizes Generalization Error.” arXiv.
Roberts, Daniel A., Sho Yaida, and Boris Hanin. 2021. β€œThe Principles of Deep Learning Theory.” arXiv:2106.10165 [Hep-Th, Stat], August.
Rolnick, David, and Max Tegmark. 2017. β€œThe Power of Deeper Networks for Expressing Natural Functions.” arXiv:1705.05502 [Cs, Stat], May.
Rosenfeld, Amir, and John K. Tsotsos. 2018. β€œIntriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing.”
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. β€œLearning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36.
Ruthotto, Lars, and Eldad Haber. 2020. β€œDeep Neural Networks Motivated by Partial Differential Equations.” Journal of Mathematical Imaging and Vision 62 (3): 352–64.
Sagun, Levent, V. Ugur Guney, Gerard Ben Arous, and Yann LeCun. 2014. β€œExplorations on High Dimensional Landscapes.” arXiv:1412.6615 [Cs, Stat], December.
Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. β€œOpening the Black Box of Deep Neural Networks via Information.” arXiv:1703.00810 [Cs], March.
Song, Le, Santosh Vempala, John Wilmes, and Bo Xie. 2017. β€œOn the Complexity of Learning Neural Networks.” arXiv:1707.04615 [Cs], July.
Unser, Michael. 2019. β€œA Representer Theorem for Deep Neural Networks.” Journal of Machine Learning Research 20 (110): 30.
Wiatowski, Thomas, and Helmut BΓΆlcskei. 2015. β€œA Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction.” In Proceedings of IEEE International Symposium on Information Theory.
Wiatowski, Thomas, Philipp Grohs, and Helmut BΓΆlcskei. 2018. β€œEnergy Propagation in Deep Convolutional Neural Networks.” IEEE Transactions on Information Theory 64 (7): 1–1.
Xie, Bo, Yingyu Liang, and Le Song. 2016. β€œDiversity Leads to Generalization in Neural Networks.” arXiv:1611.03131 [Cs, Stat], November.
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. β€œUnderstanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR.
β€”β€”β€”. 2021. β€œUnderstanding Deep Learning (Still) Requires Rethinking Generalization.” Communications of the ACM 64 (3): 107–15.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.