Why does deep learning work?

Are we in the pocket of Big VRAM?

No time to frame this well, but there are a lot of versions of the question, so… pick one. The essential idea is that we say: Oh my, that deep learning model I just trained had terribly good performance compared with some simpler thing I tried. Can I make my model simpler and still get good results? Or is the overparameterization essential? Can I know a decent error bound? Can I learn anything about underlying system by looking at the parameters I learned?

And the answer is not “yes” in any satisfying general sense. Pfft.

The SGD fitting process looks a lot like simulated annealing and like there should be some nice explanation from the statistical mechanics of simulated annealing. There are other connections to physics-driven annealing methods and physics-inspired Boltzmann machines etc. TBC. C&C statistical mechanics of statistics. But it’s not the same, so fire up the paper mill!

Proceed with caution, since there is a lot of messy thinking here. Here are some things I’d like to read, but whose inclusion here should not be taken as a recommendation. The common theme is using ideas from physics to understand deep learning and other directed graph learning methods.

Lin and Tegmark, argue that statistical mechanics provides inside to deep learning, and neuroscience (Lin and Tegmark 2016b, 2016a). Maybe on a similar tip, Natalie Wolchover summarises (Mehta and Schwab 2014). Charles H Martin. Why Deep Learning Works II: the Renormalization Group.

Ignoring learnability, the pure function-approximation results are an interesting literature. The most recent thing I looked at here is (Grohs et al. 2019), which also has a survey of that literature. They have some suggestive results, for example, that scaling with depth of network is vastly favourable for a fixed weight budget than in width of a neural network. The invariances they invoke seem to include the Lin and Tegmark stuff as a special case, or at least a goodly chunk of the invariances. I need to return to these results and work out who said what.

Wiatowski et al, (Wiatowski, Grohs, and Bölcskei 2018; Shwartz-Ziv and Tishby 2017) argue that looking at neural networks as random fields with energy propagation dynamics provides some insight to how they work. Haber and Ruthotto leverage some similar insights to argue you can improve NNs by looking at them as Hamiltonian ODEs.

Looking at a different part of the problem, the combination of overparameterization and SGD is argued to be the secret to learning the net well by Zeyuan Allen-Zhu, Yuanzhi Li and Zhao Song.

There is another strang again, whcih argues that much of deep learning is not so interesting after all and many claims must be read through publication bias filter. e.g. Piekniewski, Autopsy of a deep learning paper

Machine learning sits somewhere in between [science and engineering]. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

There is also the fourth kind of papers, which indeed contain an idea. The idea may even be useful, but it happens to be trivial. In order to cover up that embarrassing fact a heavy artillery of "academic engineering" is loaded again, such that overall the paper looks impressive.

Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2018a. “On the Convergence Rate of Training Recurrent Neural Networks,” October. https://arxiv.org/abs/1810.12065.

———. 2018b. “A Convergence Theory for Deep Learning via over-Parameterization,” November. https://arxiv.org/abs/1811.03962.

Anderson, Alexander G., and Cory P. Berg. 2017. “The High-Dimensional Geometry of Binary Neural Networks,” May. http://arxiv.org/abs/1705.07199.

Arora, Sanjeev, Nadav Cohen, and Elad Hazan. 2018. “On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization,” February. http://arxiv.org/abs/1802.06509.

Baldassi, Carlo, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. 2016. “Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes.” Proceedings of the National Academy of Sciences 113 (48): E7655–E7662. https://doi.org/10.1073/pnas.1608103113.

Barron, A. R. 1993. “Universal Approximation Bounds for Superpositions of a Sigmoidal Function.” IEEE Transactions on Information Theory 39 (3): 930–45. https://doi.org/10.1109/18.256500.

Belilovsky, Eugene, Michael Eickenberg, and Edouard Oyallon. 2019. “Greedy Layerwise Learning Can Scale to ImageNet.” In International Conference on Machine Learning, 583–93. PMLR. http://proceedings.mlr.press/v97/belilovsky19a.html.

Bölcskei, Helmut, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. 2019. “Optimal Approximation with Sparsely Connected Deep Neural Networks.” SIAM Journal on Mathematics of Data Science, February. https://doi.org/10.1137/18M118709X.

Chang, Bo, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. 2018. “Reversible Architectures for Arbitrarily Deep Residual Neural Networks.” In. http://arxiv.org/abs/1709.03698.

Choromanska, Anna, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. 2015. “The Loss Surfaces of Multilayer Networks.” In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 192–204. http://proceedings.mlr.press/v38/choromanska15.html.

Dalalyan, Arnak S. 2017. “Further and Stronger Analogy Between Sampling and Optimization: Langevin Monte Carlo and Gradient Descent,” April. http://arxiv.org/abs/1704.04752.

Gilbert, Anna C., Yi Zhang, Kibok Lee, Yuting Zhang, and Honglak Lee. 2017. “Towards Understanding the Invertibility of Convolutional Neural Networks,” May. http://arxiv.org/abs/1705.08664.

Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2017. “Size-Independent Sample Complexity of Neural Networks,” December. http://arxiv.org/abs/1712.06541.

Grohs, Philipp, Dmytro Perekrestenko, Dennis Elbrächter, and Helmut Bölcskei. 2019. “Deep Neural Network Approximation Theory,” January. https://arxiv.org/abs/1901.02220v1.

Haber, Eldad, and Lars Ruthotto. 2018. “Stable Architectures for Deep Neural Networks.” Inverse Problems 34 (1): 014004. https://doi.org/10.1088/1361-6420/aa9a90.

Haber, Eldad, Lars Ruthotto, Elliot Holtham, and Seong-Hwan Jun. 2017. “Learning Across Scales - A Multiscale Method for Convolution Neural Networks,” March. http://arxiv.org/abs/1703.02009.

Kawaguchi, Kenji, Leslie Pack Kaelbling, and Yoshua Bengio. 2017. “Generalization in Deep Learning,” October. http://arxiv.org/abs/1710.05468.

Lin, Henry W., and Max Tegmark. 2016a. “Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language,” June. http://arxiv.org/abs/1606.06737.

———. 2016b. “Why Does Deep and Cheap Learning Work so Well?” August. http://arxiv.org/abs/1608.08225.

Lipton, Zachary C. 2016. “Stuck in a What? Adventures in Weight Space,” February. http://arxiv.org/abs/1602.07320.

Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. “Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April. http://arxiv.org/abs/1704.04289.

Mehta, Pankaj, and David J. Schwab. 2014. “An Exact Mapping Between the Variational Renormalization Group and Deep Learning,” October. http://arxiv.org/abs/1410.3831.

Olah, Chris, Alexander Mordvintsev, and Ludwig Schubert. 2017. “Feature Visualization.” Distill 2 (11): e7. https://doi.org/10.23915/distill.00007.

Perez, Carlos E. n.d. “Deep Learning: The Unreasonable Effectiveness of Randomness.” Medium. https://medium.com/intuitionmachine/deep-learning-the-unreasonable-effectiveness-of-randomness-14d5aef13f87#.g5sjhxjrn.

Philipp, George, Dawn Song, and Jaime G. Carbonell. 2017. “Gradients Explode - Deep Networks Are Shallow - ResNet Explained,” December. http://arxiv.org/abs/1712.05577.

Rolnick, David, and Max Tegmark. 2017. “The Power of Deeper Networks for Expressing Natural Functions,” May. http://arxiv.org/abs/1705.05502.

Rosenfeld, Amir, and John K. Tsotsos. 2018. “Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing.”

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.

Ruthotto, Lars, and Eldad Haber. 2018. “Deep Neural Networks Motivated by Partial Differential Equations,” April. http://arxiv.org/abs/1804.04272.

Sagun, Levent, V. Ugur Guney, Gerard Ben Arous, and Yann LeCun. 2014. “Explorations on High Dimensional Landscapes,” December. http://arxiv.org/abs/1412.6615.

Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. “Opening the Black Box of Deep Neural Networks via Information,” March. http://arxiv.org/abs/1703.00810.

Song, Le, Santosh Vempala, John Wilmes, and Bo Xie. 2017. “On the Complexity of Learning Neural Networks,” July. http://arxiv.org/abs/1707.04615.

Wiatowski, Thomas, and Helmut Bölcskei. 2015. “A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction.” In Proceedings of IEEE International Symposium on Information Theory. http://arxiv.org/abs/1512.06293.

Wiatowski, Thomas, Philipp Grohs, and Helmut Bölcskei. 2018. “Energy Propagation in Deep Convolutional Neural Networks.” IEEE Transactions on Information Theory 64 (7): 1–1. https://doi.org/10.1109/TIT.2017.2756880.

Xie, Bo, Yingyu Liang, and Le Song. 2016. “Diversity Leads to Generalization in Neural Networks,” November. http://arxiv.org/abs/1611.03131.

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR. http://arxiv.org/abs/1611.03530.