Why does deep learning work?

Are we in the pocket of Big VRAM?

No time to frame this well, but there are a lot of versions of the question, so… pick one. The essential idea is that we say: Oh my, that deep learning model I just trained had terribly good performance compared with some simpler thing I tried. Can I make my model simpler and still get good results? Or is the overparameterization essential? Can I know a decent error bound? Can I learn anything about underlying system by looking at the parameters I learned?

And the answer is not “yes” in any satisfying general sense. Pfft.

Synthetic tutorials

Going deep stochastically

The SGD fitting process looks processes from statistical mechanics.

Proceed with caution, since there is a lot of messy thinking here. Here are some things I’d like to read, but whose inclusion should not be taken as a recommendation. The common theme is using ideas from physics to understand deep learning and other directed graph learning methods.

There are also arguments that SGD is doing some kind of MCMC simulation from the problem posterior as in NN ensembles or learning a kernel machine .

tl;dr it looks like you need to worry about saddle points but you probably do not .

Magic of SGD+overparameterization

Looking at a different part of the problem, the combination of overparameterization and SGD is argued to be the secret

Our main finding demonstrates that, for state-of-the-art network architectures such as fully-connected neural networks, convolutional networks (CNN), or residual networks (Resnet), assuming there are n training samples without duplication, as long as the number of parameters is polynomial in $$n$$, first-order methods such as SGD can find global optima of the training objective efficiently, that is, with running time only polynomially dependent on the total number of parameters of the network.

Function approximation theory

Ignoring learnability, the pure function-approximation results are an interesting literature. If you can ignore that troublesom optimisation step, how general a thing can your neural network approximate as its depth and width and sparsity chagges? The most recent thing I looked at is , which also has a survey of that literature. See also They have some suggestive results, for example, that scaling with depth of network is vastly favourable for a fixed weight budget than in width of a neural network.

Crazy physics stuff I have not read

Wiatowski et al, argue that looking at neural networks as random fields with energy propagation dynamics provides some insight to how they work. Haber and Ruthotto leverage some similar insights to argue you can improve NNs by looking at them as ODEs.

Lin and Tegmark, argue that statistical mechanics provides inside to deep learning, and neuroscience . Maybe on a similar tip, Natalie Wolchover summarises . Charles H Martin. Why Deep Learning Works II: the Renormalization Group.

There is also a bunch more file under statistical mechanics of statistics.

There is nothing to see here

There is another school again, which argues that much of deep learning is not so interesting after all when you blur out the more hyperbolic claims with a publication bias filter. e.g. Piekniewski, Autopsy of a deep learning paper

Machine learning sits somewhere in between [science and engineering]. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

There is also the fourth kind of papers, which indeed contain an idea. The idea may even be useful, but it happens to be trivial. In order to cover up that embarrassing fact a heavy artillery of “academic engineering” is loaded again, such that overall the paper looks impressive.

References

Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2018a. October.
———. 2018b. November.
Anderson, Alexander G., and Cory P. Berg. 2017. arXiv:1705.07199 [Cs], May.
Baldassi, Carlo, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. 2016. Proceedings of the National Academy of Sciences 113 (48): E7655–62.
Barron, A.R. 1993. IEEE Transactions on Information Theory 39 (3): 930–45.
Bartlett, Peter L., Andrea Montanari, and Alexander Rakhlin. 2021. Acta Numerica 30 (May): 87–201.
Belilovsky, Eugene, Michael Eickenberg, and Edouard Oyallon. 2019. In International Conference on Machine Learning, 583–93. PMLR.
Belkin, Mikhail. 2021. Acta Numerica 30 (May): 203–48.
Bölcskei, Helmut, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. 2019. SIAM Journal on Mathematics of Data Science, February.
Chang, Bo, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. 2018. In arXiv:1709.03698 [Cs, Stat].
Choromanska, Anna, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. 2015. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 192–204.
Dalalyan, Arnak S. 2017. arXiv:1704.04752 [Math, Stat], April.
Dauphin, Yann, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. In Advances in Neural Information Processing Systems 27, 2933–41. Curran Associates, Inc.
Domingos, Pedro. 2020. arXiv:2012.00152 [Cs, Stat], November.
Gilbert, Anna C., Yi Zhang, Kibok Lee, Yuting Zhang, and Honglak Lee. 2017. arXiv:1705.08664 [Cs, Stat], May.
Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2017. arXiv:1712.06541 [Cs, Stat], December.
Grohs, Philipp, Dmytro Perekrestenko, Dennis Elbrächter, and Helmut Bölcskei. 2019. arXiv:1901.02220 [Cs, Math, Stat], January.
Haber, Eldad, and Lars Ruthotto. 2018. Inverse Problems 34 (1): 014004.
Haber, Eldad, Lars Ruthotto, Elliot Holtham, and Seong-Hwan Jun. 2017. arXiv:1703.02009 [Cs], March.
Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp. 2010. arXiv.
Hu, Hang, Zhao Song, Omri Weinstein, and Danyang Zhuo. 2022. arXiv.
Im, Daniel Jiwoong, Michael Tao, and Kristin Branson. 2016. arXiv:1612.04010 [Cs], December.
Jin, Chi, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. 2017. In PMLR, 1724–32.
Kawaguchi, Kenji, Leslie Pack Kaelbling, and Yoshua Bengio. 2017. arXiv:1710.05468 [Cs, Stat], October.
Lee, Jason D., Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2017. arXiv:1710.07406 [Cs, Math, Stat], October.
Lee, Jason D., Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2016. arXiv:1602.04915 [Cs, Math, Stat], March.
Levy, Kfir Y. 2016. arXiv:1611.04831 [Cs, Math, Stat], November.
Lin, Henry W., and Max Tegmark. 2016a. arXiv:1606.06737 [Cond-Mat], June.
———. 2016b. arXiv:1608.08225 [Cond-Mat, Stat], August.
Lipton, Zachary C. 2016. arXiv:1602.07320 [Cs], February.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. JMLR, April.
Mehta, Pankaj, and David J. Schwab. 2014. arXiv:1410.3831 [Cond-Mat, Stat], October.
Neu, Gergely, Gintare Karolina Dziugaite, Mahdi Haghifam, and Daniel M. Roy. 2021. arXiv:2102.00931 [Cs, Stat], August.
Olah, Chris, Alexander Mordvintsev, and Ludwig Schubert. 2017. Distill 2 (11): e7.
Pascanu, Razvan, Yann N. Dauphin, Surya Ganguli, and Yoshua Bengio. 2014. arXiv:1405.4604 [Cs], May.
Perez, Carlos E. 2016. Medium (blog).
Philipp, George, Dawn Song, and Jaime G. Carbonell. 2017. arXiv:1712.05577 [Cs], December.
Prince, Simon J.D. 2022. Understanding Deep Learning. MIT Press.
Roberts, Daniel A., Sho Yaida, and Boris Hanin. 2021. arXiv:2106.10165 [Hep-Th, Stat], August.
Rolnick, David, and Max Tegmark. 2017. arXiv:1705.05502 [Cs, Stat], May.
Rosenfeld, Amir, and John K. Tsotsos. 2018. “Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing.”
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. Nature 323 (6088): 533–36.
Ruthotto, Lars, and Eldad Haber. 2018. arXiv:1804.04272 [Cs, Math, Stat], April.
Sagun, Levent, V. Ugur Guney, Gerard Ben Arous, and Yann LeCun. 2014. arXiv:1412.6615 [Cs, Stat], December.
Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. arXiv:1703.00810 [Cs], March.
Song, Le, Santosh Vempala, John Wilmes, and Bo Xie. 2017. arXiv:1707.04615 [Cs], July.
Unser, Michael. 2019. Journal of Machine Learning Research 20 (110): 30.
Wiatowski, Thomas, and Helmut Bölcskei. 2015. In Proceedings of IEEE International Symposium on Information Theory.
Wiatowski, Thomas, Philipp Grohs, and Helmut Bölcskei. 2018. IEEE Transactions on Information Theory 64 (7): 1–1.
Xie, Bo, Yingyu Liang, and Le Song. 2016. arXiv:1611.03131 [Cs, Stat], November.
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. In Proceedings of ICLR.
———. 2021. Communications of the ACM 64 (3): 107–15.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.