Statistical mechanics of statistics



Boaz Barak has a miniature dictionary for statisticians:

I’ve always been curious about the statistical physics approach to problems from computer science. The physics-inspired algorithm survey propagation is the current champion for random 3SAT instances, statistical-physics phase transitions have been suggested as explaining computational difficulty, and statistical physics has even been invoked to explain why deep learning algorithms seem to often converge to useful local minima.

Unfortunately, I have always found the terminology of statistical physics, “spin glasses”, “quenched averages”, “annealing”, “replica symmetry breaking”, “metastable states” etc… to be rather daunting

Jaan Altosaar’s guided translation is great.

Phase transitions in statistical inference

Cristopher Moore says

There is a deep analogy between statistical inference and statistical physics; I will give a friendly introduction to both of these fields. I will then discuss phase transitions in two problems of interest to a broad range of data sciences: community detection in social and biological networks, and clustering of sparse high-dimensional data. In both cases, if our data becomes too sparse or too noisy, it suddenly becomes impossible to find the underlying pattern, or even tell if there is one. Physics both helps us locate these phase transitions, and design optimal algorithms that succeed all the way up to this point. Along the way, I will visit ideas from computational complexity, random graphs, random matrices, and spin glass theory.

There is an overview lecture by Thomas Orton, which cites lots of the good stuff

Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behavior, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimization for physical systems. With these two perspectives in mind, we’ll then try to use belief propagation to do inference on the stochastic block model. We’ll see some heuristic techniques for determining when BP succeeds and fails in inference, as well as some numerical simulation results of belief propagation for this problem. Lastly, we’ll talk about where this all fits into what is currently known about efficient algorithms and information theoretic barriers for the stochastic block model.

See Igor Carron’s “phase diagram” list, and stuff like (Oymak and Tropp 2015). Likely there are connections to Erdős-Renyi giant components and other complex network things in probabilisitic graph learning. Read (Barbier 2015; Poole et al. 2016).

Replicator equations and evolutionary processes

See also evolution, game theory.

Gentle intro lecture by John Baez, Biology as Information Dynamics.

See (Baez 2011; Harper 2009; Shalizi 2009; Sinervo and Lively 1996).

Grokking

Neel Nanda, Tom Lieberum, A Mechanistic Interpretability Analysis of Grokking

Grokking (Power et al. 2022) is a recent phenomena discovered by OpenAI researchers, that in my opinion is one of the most fascinating mysteries in deep learning. That models trained on small algorithmic tasks like modular addition will initially memorise the training data, but after a long time will suddenly learn to generalise to unseen data.

This is a write-up of an independent research project I did into understanding grokking through the lens of mechanistic interpretability. My most important claim is that grokking has a deep relationship to phase changes. Phase changes, ie a sudden change in the model’s performance for some capability during training, are a general phenomena that occur when training models, that have also been observed in large models trained on non-toy tasks. For example, the sudden change in a transformer’s capacity to do in-context learning when it forms induction heads. In this work examine several toy settings where a model trained to solve them exhibits a phase change in test loss, regardless of how much data it is trained on. I show that if a model is trained on these limited data with high regularisation, then that the model shows grokking.

Annealing

See annealing.

References

Achlioptas, Dimitris, and Amin Coja-Oghlan. 2008. Algorithmic Barriers from Phase Transitions.” arXiv:0803.2122 [Math], October, 793–802.
Baez, John C. 2011. Renyi Entropy and Free Energy,” February.
Bahri, Yasaman, Jonathan Kadmon, Jeffrey Pennington, Sam S. Schoenholz, Jascha Sohl-Dickstein, and Surya Ganguli. 2020. Statistical Mechanics of Deep Learning.” Annual Review of Condensed Matter Physics 11 (1): 501–28.
Baldassi, Carlo, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. 2016. Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes.” Proceedings of the National Academy of Sciences 113 (48): E7655–62.
Barbier, Jean. 2015. Statistical Physics and Approximate Message-Passing Algorithms for Sparse Linear Estimation Problems in Signal Processing and Coding Theory.” arXiv:1511.01650 [Cs, Math], November.
Barbier, Jean, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. 2017. Phase Transitions, Optimal Errors and Optimality of Message-Passing in Generalized Linear Models.” arXiv:1708.03395 [Cond-Mat, Physics:math-Ph], August.
Braunstein, A., M. Mezard, and R. Zecchina. 2002. Survey Propagation: An Algorithm for Satisfiability.” arXiv:cs/0212002, December.
Castellani, Tommaso, and Andrea Cavagna. 2005. Spin-Glass Theory for Pedestrians.” Journal of Statistical Mechanics: Theory and Experiment 2005 (05): P05012.
Catoni, Olivier. 2007. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning.” IMS Lecture Notes Monograph Series 56: 1–163.
Chang, Bo, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. 2018. Reversible Architectures for Arbitrarily Deep Residual Neural Networks.” In arXiv:1709.03698 [Cs, Stat].
Chaudhari, Pratik, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. 2017. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys.” arXiv.
Choromanska, Anna, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. 2015. The Loss Surfaces of Multilayer Networks.” In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 192–204.
“Complexity, Entropy and the Physics of Information: The Proceedings of the 1988 Workshop on Complexity, Entropy, and the Physics of Information.” 1990. In. Addison-Wesley Pub. Co.
Feldman, David P. 2002. A Brief Introduction to: Information Theory, Excess Entropy and Computational Mechanics.”
Feng, Yu, and Yuhai Tu. 2021. The Inverse Variance–Flatness Relation in Stochastic Gradient Descent Is Critical for Finding Flat Minima.” Proceedings of the National Academy of Sciences 118 (9): e2015617118.
Goldt, Sebastian, and Udo Seifert. 2017. Stochastic Thermodynamics of Learning.” Physical Review Letters 118 (1): 010601.
Haber, Eldad, and Lars Ruthotto. 2018. Stable Architectures for Deep Neural Networks.” Inverse Problems 34 (1): 014004.
Harper, Marc. 2009. The Replicator Equation as an Inference Dynamic,” November.
Hasegawa, Yoshihiko, and Tan Van Vu. 2019. Uncertainty Relations in Stochastic Processes: An Information Inequality Approach.” Physical Review E 99 (6): 062126.
Hayou, Soufiane, Arnaud Doucet, and Judith Rousseau. 2019. On the Impact of the Activation Function on Deep Neural Networks Training.” In Proceedings of the 36th International Conference on Machine Learning, 2672–80. PMLR.
Krzakala, Florent, Lenka Zdeborova, Maria Chiara Angelini, and Francesco Caltagirone. n.d. Statistical Physics of Inference and Bayesian Estimation,” 44.
Lang, Alex H., Charles K. Fisher, Thierry Mora, and Pankaj Mehta. 2014. Thermodynamics of Statistical Inference by Cells.” Physical Review Letters 113 (14).
Lin, Henry W., and Max Tegmark. 2016a. Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language.” arXiv:1606.06737 [Cond-Mat], June.
———. 2016b. Why Does Deep and Cheap Learning Work so Well? arXiv:1608.08225 [Cond-Mat, Stat], August.
Mandelbrot, Benoit. 1962. The Role of Sufficiency and of Estimation in Thermodynamics.” The Annals of Mathematical Statistics 33 (3): 1021–38.
Marsland, Robert, and Jeremy England. 2018. Limits of Predictions in Thermodynamic Systems: A Review.” Reports on Progress in Physics 81 (1): 016601.
Mehta, Pankaj, and David J. Schwab. 2014. An Exact Mapping Between the Variational Renormalization Group and Deep Learning.” arXiv.
Moore, Cristopher. 2017. The Computer Science and Physics of Community Detection: Landscapes, Phase Transitions, and Hardness.” Bulletin of the EATCS, February.
Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. Progress Measures for Grokking via Mechanistic Interpretability.” arXiv.
Oymak, Samet, and Joel A. Tropp. 2015. Universality Laws for Randomized Dimension Reduction, with Applications.” arXiv:1511.09433 [Cs, Math, Stat], November.
Pavon, Michele. 1989. Stochastic Control and Nonequilibrium Thermodynamical Systems.” Applied Mathematics and Optimization 19 (1): 187–202.
Poole, Ben, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. 2016. Exponential Expressivity in Deep Neural Networks Through Transient Chaos.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 3360–68. Curran Associates, Inc.
Power, Alethea, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. 2022. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” arXiv.
Roberts, Daniel A., Sho Yaida, and Boris Hanin. 2021. The Principles of Deep Learning Theory.” arXiv:2106.10165 [Hep-Th, Stat], August.
Ruthotto, Lars, and Eldad Haber. 2020. Deep Neural Networks Motivated by Partial Differential Equations.” Journal of Mathematical Imaging and Vision 62 (3): 352–64.
Schoenholz, Samuel S., Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. 2017. Deep Information Propagation.” In.
Shalizi, Cosma Rohilla. 2009. Dynamics of Bayesian Updating with Dependent Data and Misspecified Models.” Electronic Journal of Statistics 3: 1039–74.
Shalizi, Cosma Rohilla, and Cristopher Moore. 2003. What Is a Macrostate? Subjective Observations and Objective Dynamics.” arXiv.
Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. Opening the Black Box of Deep Neural Networks via Information.” arXiv:1703.00810 [Cs], March.
Sinervo, B., and C. M. Lively. 1996. The Rock–Paper–Scissors Game and the Evolution of Alternative Male Strategies.” Nature 380 (6571): 240.
Sun, Jianhui, Ying Yang, Guangxu Xun, and Aidong Zhang. 2023. Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD.” ACM Transactions on Knowledge Discovery from Data 17 (2): 29:1–37.
Székely, Gábor J., and Maria L. Rizzo. 2017. The Energy of Data.” Annual Review of Statistics and Its Application 4 (1): 447–79.
Wolpert, David. 2017. Constraints on Physical Reality Arising from a Formalization of Knowledge,” November.
Wolpert, David H. 2006. Information Theory — The Bridge Connecting Bounded Rational Game Theory and Statistical Physics.” In Complex Engineered Systems, 262–90. Understanding Complex Systems. Springer Berlin Heidelberg.
Wolpert, David H. 2008. Physical Limits of Inference.” Physica D: Nonlinear Phenomena, Novel Computing Paradigms: Quo Vadis?, 237 (9): 1257–81.
———. 2018. Theories of Knowledge and Theories of Everything.” In The Map and the Territory: Exploring the Foundations of Science, Thought and Reality, 165. Cham: Springer.
———. 2019. Stochastic Thermodynamics of Computation,” May.
Wolpert, David H., Artemy Kolchinsky, and Jeremy A. Owen. 2017. A Space-Time Tradeoff for Implementing a Function with Master Equation Dynamics,” August.
Yarotsky, Dmitry, and Anton Zhevnerchuk. 2020. The Phase Diagram of Approximation Rates for Deep Neural Networks.” In Proceedings of the 34th International Conference on Neural Information Processing Systems, 33:13005–15. NIPS’20. Red Hook, NY, USA: Curran Associates Inc.
Zdeborová, Lenka, and Florent Krzakala. 2016. Statistical Physics of Inference: Thresholds and Algorithms.” Advances in Physics 65 (5): 453–552.
Zhang, Yao, Andrew M. Saxe, Madhu S. Advani, and Alpha A. Lee. 2018. Energy-Entropy Competition and the Effectiveness of Stochastic Gradient Descent in Machine Learning.” Molecular Physics 116 (21-22): 3214–23.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.