# Statistical mechanics of statistics

December 2, 2016 — June 2, 2023

dynamical systems
machine learning
neural nets
physics
pseudorandomness
sciml
statistics
statmech
stochastic processes

I’ve always been curious about the statistical physics approach to problems from computer science. The physics-inspired algorithm survey propagation is the current champion for random 3SAT instances, statistical-physics phase transitions have been suggested as explaining computational difficulty, and statistical physics has even been invoked to explain why deep learning algorithms seem to often converge to useful local minima.

Unfortunately, I have always found the terminology of statistical physics, “spin glasses”, “quenched averages”, “annealing”, “replica symmetry breaking”, “metastable states” etc… to be rather daunting

Jaan Altosaar’s guided translation is great.

## 1 Phase transitions in statistical inference

Cristopher Moore says

There is a deep analogy between statistical inference and statistical physics; I will give a friendly introduction to both of these fields. I will then discuss phase transitions in two problems of interest to a broad range of data sciences: community detection in social and biological networks, and clustering of sparse high-dimensional data. In both cases, if our data becomes too sparse or too noisy, it suddenly becomes impossible to find the underlying pattern, or even tell if there is one. Physics both helps us locate these phase transitions, and design optimal algorithms that succeed all the way up to this point. Along the way, I will visit ideas from computational complexity, random graphs, random matrices, and spin glass theory.

There is an overview lecture by Thomas Orton, which cites lots of the good stuff

Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behavior, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimization for physical systems. With these two perspectives in mind, we’ll then try to use belief propagation to do inference on the stochastic block model. We’ll see some heuristic techniques for determining when BP succeeds and fails in inference, as well as some numerical simulation results of belief propagation for this problem. Lastly, we’ll talk about where this all fits into what is currently known about efficient algorithms and information theoretic barriers for the stochastic block model.

See Igor Carron’s “phase diagram” list, and stuff like . Likely there are connections to Erdős-Renyi giant components and other complex network things in probabilisitic graph learning. Read .

## 2 Replicator equations and evolutionary processes

Gentle intro lecture by John Baez, Biology as Information Dynamics.

## 3 Grokking

Neel Nanda, Tom Lieberum, A Mechanistic Interpretability Analysis of Grokking

Grokking is a recent phenomena discovered by OpenAI researchers, that in my opinion is one of the most fascinating mysteries in deep learning. That models trained on small algorithmic tasks like modular addition will initially memorise the training data, but after a long time will suddenly learn to generalise to unseen data.

This is a write-up of an independent research project I did into understanding grokking through the lens of mechanistic interpretability. My most important claim is that grokking has a deep relationship to phase changes. Phase changes, ie a sudden change in the model’s performance for some capability during training, are a general phenomena that occur when training models, that have also been observed in large models trained on non-toy tasks. For example, the sudden change in a transformer’s capacity to do in-context learning when it forms induction heads. In this work examine several toy settings where a model trained to solve them exhibits a phase change in test loss, regardless of how much data it is trained on. I show that if a model is trained on these limited data with high regularisation, then that the model shows grokking.

See annealing.

## 6 References

Achlioptas, and Coja-Oghlan. 2008. arXiv:0803.2122 [Math].
Alexiadis, A., Simmons, Stamatopoulos, et al. 2020. Scientific Reports.
Ashton, Bernstein, Buchner, et al. 2022. Nature Reviews Methods Primers.
Baez. 2011.
Bahri, Kadmon, Pennington, et al. 2020. Annual Review of Condensed Matter Physics.
Baldassi, Borgs, Chayes, et al. 2016. Proceedings of the National Academy of Sciences.
Barbier. 2015. arXiv:1511.01650 [Cs, Math].
Barbier, Krzakala, Macris, et al. 2017. arXiv:1708.03395 [Cond-Mat, Physics:math-Ph].
Barnum, Barrett, Clark, et al. 2010. New Journal of Physics.
Braunstein, Mezard, and Zecchina. 2002. arXiv:cs/0212002.
Castellani, and Cavagna. 2005. Journal of Statistical Mechanics: Theory and Experiment.
Caticha. 2008.
Catoni. 2007. IMS Lecture Notes Monograph Series.
Chang, Meng, Haber, et al. 2018. In arXiv:1709.03698 [Cs, Stat].
Chaudhari, Choromanska, Soatto, et al. 2017.
Choromanska, Henaff, Mathieu, et al. 2015. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics.
“Complexity, Entropy and the Physics of Information: The Proceedings of the 1988 Workshop on Complexity, Entropy, and the Physics of Information.” 1990. In.
Feng, and Tu. 2021. Proceedings of the National Academy of Sciences.
Gershenfeld, N. 1996. IBM Systems Journal.
Gershenfeld, Neil. 2011. Computing.
Goldt, and Seifert. 2017. Physical Review Letters.
Haber, and Ruthotto. 2018. Inverse Problems.
Harper. 2009.
Hasegawa, and Van Vu. 2019. Physical Review E.
Hayou, Doucet, and Rousseau. 2019. In Proceedings of the 36th International Conference on Machine Learning.
Krishnamurthy, Can, and Schwab. 2022. Physical Review. X.
Krzakala, Zdeborova, Angelini, et al. n.d.
Lang, Fisher, Mora, et al. 2014. Physical Review Letters.
Laurent, and von Brecht. 2016. arXiv:1612.06212 [Cs].
Lin, and Tegmark. 2016a. arXiv:1606.06737 [Cond-Mat].
———. 2016b. arXiv:1608.08225 [Cond-Mat, Stat].
Machta. 1999. American Journal of Physics.
Mandelbrot. 1962. The Annals of Mathematical Statistics.
Marsland, and England. 2018. Reports on Progress in Physics.
Mehta, and Schwab. 2014.
Moore. 2017. Bulletin of the EATCS.
Nanda, Chan, Lieberum, et al. 2023.
Natal, Ávila, Tsukahara, et al. 2021. Entropy.
Oymak, and Tropp. 2015. arXiv:1511.09433 [Cs, Math, Stat].
Pavon. 1989. Applied Mathematics and Optimization.
Poole, Lahiri, Raghu, et al. 2016. In Advances in Neural Information Processing Systems 29.
Power, Burda, Edwards, et al. 2022.
Roberts, Yaida, and Hanin. 2021. arXiv:2106.10165 [Hep-Th, Stat].
Ruthotto, and Haber. 2020. Journal of Mathematical Imaging and Vision.
Schoenholz, Gilmer, Ganguli, et al. 2017. In.
Shalizi. 2009. Electronic Journal of Statistics.
Shalizi, and Moore. 2003.
Shwartz-Ziv, and Tishby. 2017. arXiv:1703.00810 [Cs].
Sinervo, and Lively. 1996. Nature.
Still. 2020. Physical Review Letters.
Sun, Yang, Xun, et al. 2023. ACM Transactions on Knowledge Discovery from Data.
Székely, and Rizzo. 2017. Annual Review of Statistics and Its Application.
Ventola, Braun, Yu, et al. 2023. arXiv.org.
Wolpert, David H. 2006. In Complex Engineered Systems. Understanding Complex Systems.
Wolpert, David H. 2008. Physica D: Nonlinear Phenomena, Novel Computing Paradigms: Quo Vadis?,.
Wolpert, David. 2017.
Wolpert, David H. 2018. In The Map and the Territory: Exploring the Foundations of Science, Thought and Reality.
———. 2019.
Wolpert, David H., Kolchinsky, and Owen. 2017.
Yarotsky, and Zhevnerchuk. 2020. In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.
Zdeborová, and Krzakala. 2016. Advances in Physics.
Zhang, Saxe, Advani, et al. 2018. Molecular Physics.