Statistical mechanics of statistics and NNs

2016-12-01 — 2025-01-07

Suspiciously similar content

Boaz Barak has a miniature dictionary for statisticians:

I’ve always been curious about the statistical physics approach to problems from computer science. The physics-inspired algorithm survey propagation is the current champion for random 3SAT instances, statistical-physics phase transitions have been suggested as explaining computational difficulty, and statistical physics has even been invoked to explain why deep learning algorithms seem to often converge to useful local minima.

Unfortunately, I have always found the terminology of statistical physics, “spin glasses”, “quenched averages”, “annealing”, “replica symmetry breaking”, “metastable states” etc… to be rather daunting

Jaan Altosaar’s guided translation is great.

This area is having a real explosion in NNs, in particular with observed phase transitions, SLT analyses and scaling laws.

1 Phase transitions in statistical inference

Cristopher Moore says

There is a deep analogy between statistical inference and statistical physics; I will give a friendly introduction to both of these fields. I will then discuss phase transitions in two problems of interest to a broad range of data sciences: community detection in social and biological networks, and clustering of sparse high-dimensional data. In both cases, if our data becomes too sparse or too noisy, it suddenly becomes impossible to find the underlying pattern, or even tell if there is one. Physics both helps us locate these phase transitions, and design optimal algorithms that succeed all the way up to this point. Along the way, I will visit ideas from computational complexity, random graphs, random matrices, and spin glass theory.

There is an overview lecture by Thomas Orton, which cites lots of the good stuff

Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behaviour, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimisation for physical systems. With these two perspectives in mind, we’ll then try to use belief propagation to do inference on the stochastic block model. We’ll see some heuristic techniques for determining when BP succeeds and fails in inference, as well as some numerical simulation results of belief propagation for this problem. Lastly, we’ll talk about where this all fits into what is currently known about efficient algorithms and information theoretic barriers for the stochastic block model.

See Igor Carron’s “phase diagram” list, and stuff like (Oymak and Tropp 2015). Likely there are connections to Erdős-Renyi giant components and other complex network things in probabilistic graph learning. Read (Barbier 2015; Poole et al. 2016).

2 Grokking

See Grokking.

3 Singular learning theory

See singular learning theory, which also produces an analysis of Grokking-like behaviour in terms of degeneracies in the loss landscape.

Via SLT colleagues, I have been referred to many papers on the subject, none of which I have read yet (Brill 2024; Cagnetta et al. 2023; Carroll 2021; Frenkel 1999; LaMont and Wiggins 2019; Z. Liu et al. 2025; Y. Liu, Liu, and Gore 2025; Yao, Liu, and Tegmark, n.d.; Yerrababu, Majumdar, and Sadhu 2024; Yue et al. 2025; X. Zhang et al. 2025; Zhong et al. 2018; Ziyin, Xu, and Chuang 2025).

4 Annealing

See annealing.

5 Entropy vs information

See Entropy vs information.

6 Neural tangent kernel

Has been argued to fit in this category in, e.g. Cagnetta et al. (2023).

7 Replicator equations and evolutionary processes

Can we think of statistical inference as an evolutionary process as in biology? See also evolution, game theory.

Gentle intro lecture by John Baez, Biology as Information Dynamics.

See (Baez 2011; Harper 2009; Shalizi 2009; Sinervo and Lively 1996).

8 Incoming

I had Mezard and Montanari (2009) recommended by David Donoho. Online.
Towards a theory for typical-case algorithmic hardness
Workshop: OSL 2023

9 References

Achlioptas, and Coja-Oghlan. 2008. “Algorithmic Barriers from Phase Transitions.” arXiv:0803.2122 [Math].

Alexiadis, Alessio. 2019. “Deep Multiphysics and Particle–Neuron Duality: A Computational Framework Coupling (Discrete) Multiphysics and Deep Learning.” Applied Sciences.

Alexiadis, A., Simmons, Stamatopoulos, et al. 2020. “The Duality Between Particle Methods and Artificial Neural Networks.” Scientific Reports.

Ashton, Bernstein, Buchner, et al. 2022. “Nested Sampling for Physical Scientists.” Nature Reviews Methods Primers.

Azizian, Iutzeler, Malick, et al. 2025. “The Global Convergence Time of Stochastic Gradient Descent in Non-Convex Landscapes: Sharp Estimates via Large Deviations.”

Baez. 2011. “Renyi Entropy and Free Energy.”

Bahri, Kadmon, Pennington, et al. 2020. “Statistical Mechanics of Deep Learning.” Annual Review of Condensed Matter Physics.

Baldassi, Borgs, Chayes, et al. 2016. “Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes.” Proceedings of the National Academy of Sciences.

Barbaresco. 2022. “Symplectic Theory of Heat and Information Geometry.” In Handbook of Statistics. Geometry and Statistics.

Barbier. 2015. “Statistical Physics and Approximate Message-Passing Algorithms for Sparse Linear Estimation Problems in Signal Processing and Coding Theory.” arXiv:1511.01650 [Cs, Math].

Barbier, Krzakala, Macris, et al. 2017. “Phase Transitions, Optimal Errors and Optimality of Message-Passing in Generalized Linear Models.” arXiv:1708.03395 [Cond-Mat, Physics:math-Ph].

Barnum, Barrett, Clark, et al. 2010. “Entropy and Information Causality in General Probabilistic Theories.” New Journal of Physics.

Braunstein, Mezard, and Zecchina. 2002. “Survey Propagation: An Algorithm for Satisfiability.” arXiv:cs/0212002.

Brill. 2024. “Neural Scaling Laws Rooted in the Data Distribution.”

Cagnetta, Oliveira, Sabanayagam, et al. 2023. “Kernels, Data & Physics.”

Carroll. 2021. “Phase Transitions in Neural Networks.”

Castellani, and Cavagna. 2005. “Spin-Glass Theory for Pedestrians.” Journal of Statistical Mechanics: Theory and Experiment.

Caticha. 2008. “Lectures on Probability, Entropy, and Statistical Physics.”

Catoni. 2007. “PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning.” IMS Lecture Notes Monograph Series.

Chang, Meng, Haber, et al. 2018. “Reversible Architectures for Arbitrarily Deep Residual Neural Networks.” In arXiv:1709.03698 [Cs, Stat].

Chaudhari, Choromanska, Soatto, et al. 2017. “Entropy-SGD: Biasing Gradient Descent Into Wide Valleys.”

Choromanska, Henaff, Mathieu, et al. 2015. “The Loss Surfaces of Multilayer Networks.” In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics.

“Complexity, Entropy and the Physics of Information: The Proceedings of the 1988 Workshop on Complexity, Entropy, and the Physics of Information.” 1990. In.

Donoho, and Tanner. 2009. “Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

Feldman. 2002. “A Brief Introduction to: Information Theory, Excess Entropy and Computational Mechanics.”

Feng, and Tu. 2021. “The Inverse Variance–Flatness Relation in Stochastic Gradient Descent Is Critical for Finding Flat Minima.” Proceedings of the National Academy of Sciences.

Frenkel. 1999. “Entropy-Driven Phase Transitions.” Physica A: Statistical Mechanics and Its Applications, Proceedings of the 20th IUPAP International Conference on Statistical Physics,.

Gershenfeld, N. 1996. “Signal Entropy and the Thermodynamics of Computation.” IBM Systems Journal.

Gershenfeld, Neil. 2011. “Aligning the Representation and Reality of Computation with Asynchronous Logic Automata.” Computing.

Goldt, and Seifert. 2017. “Stochastic Thermodynamics of Learning.” Physical Review Letters.

Haber, and Ruthotto. 2018. “Stable Architectures for Deep Neural Networks.” Inverse Problems.

Harper. 2009. “The Replicator Equation as an Inference Dynamic.”

Hasegawa, and Van Vu. 2019. “Uncertainty Relations in Stochastic Processes: An Information Inequality Approach.” Physical Review E.

Hayou, Doucet, and Rousseau. 2019. “On the Impact of the Activation Function on Deep Neural Networks Training.” In Proceedings of the 36th International Conference on Machine Learning.

Kolchinsky, Marvian, Gokler, et al. 2025. “Maximizing Free Energy Gain.” Entropy.

Krishnamurthy, Can, and Schwab. 2022. “Theory of Gating in Recurrent Neural Networks.” Physical Review. X.

Krzakala, Zdeborova, Angelini, et al. n.d. “Statistical Physics of Inference and Bayesian Estimation.”

LaMont, and Wiggins. 2019. “On the Correspondence Between Thermodynamics and Inference.” Physical Review E.

Lang, Fisher, Mora, et al. 2014. “Thermodynamics of Statistical Inference by Cells.” Physical Review Letters.

Laurent, and von Brecht. 2016. “A Recurrent Neural Network Without Chaos.” arXiv:1612.06212 [Cs].

Lavis, and Frigg. 2025. The Fundamentals of Thermodynamics. Fundamental Theories of Physics.

Lin, and Tegmark. 2016a. “Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language.” arXiv:1606.06737 [Cond-Mat].

———. 2016b. “Why Does Deep and Cheap Learning Work so Well?” arXiv:1608.08225 [Cond-Mat, Stat].

Liu, Ziming, Liu, Gore, et al. 2025. “Neural Thermodynamic Laws for Large Language Model Training.”

Liu, Yizhou, Liu, and Gore. 2025. “Superposition Yields Robust Neural Scaling.”

Machta. 1999. “Entropy, Information, and Computation.” American Journal of Physics.

Mandelbrot. 1962. “The Role of Sufficiency and of Estimation in Thermodynamics.” The Annals of Mathematical Statistics.

Marsland, and England. 2018. “Limits of Predictions in Thermodynamic Systems: A Review.” Reports on Progress in Physics.

Mehta, and Schwab. 2014. “An Exact Mapping Between the Variational Renormalization Group and Deep Learning.”

Mezard, and Montanari. 2009. Information, Physics, and Computation. Oxford Graduate Texts.

Moore. 2017. “The Computer Science and Physics of Community Detection: Landscapes, Phase Transitions, and Hardness.” Bulletin of the EATCS.

Nanda, Chan, Lieberum, et al. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.”

Natal, Ávila, Tsukahara, et al. 2021. “Entropy: From Thermodynamics to Information Processing.” Entropy.

Oymak, and Tropp. 2015. “Universality Laws for Randomized Dimension Reduction, with Applications.” arXiv:1511.09433 [Cs, Math, Stat].

Pavon. 1989. “Stochastic Control and Nonequilibrium Thermodynamical Systems.” Applied Mathematics and Optimization.

Poole, Lahiri, Raghu, et al. 2016. “Exponential Expressivity in Deep Neural Networks Through Transient Chaos.” In Advances in Neural Information Processing Systems 29.

Power, Burda, Edwards, et al. 2022. “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.”

Roberts, Yaida, and Hanin. 2021. “The Principles of Deep Learning Theory.” arXiv:2106.10165 [Hep-Th, Stat].

Ruthotto, and Haber. 2020. “Deep Neural Networks Motivated by Partial Differential Equations.” Journal of Mathematical Imaging and Vision.

Schoenholz, Gilmer, Ganguli, et al. 2017. “Deep Information Propagation.” In.

Shalizi. 2009. “Dynamics of Bayesian Updating with Dependent Data and Misspecified Models.” Electronic Journal of Statistics.

Shalizi, and Moore. 2003. “What Is a Macrostate? Subjective Observations and Objective Dynamics.”

Shwartz-Ziv, and Tishby. 2017. “Opening the Black Box of Deep Neural Networks via Information.” arXiv:1703.00810 [Cs].

Sinervo, and Lively. 1996. “The Rock–Paper–Scissors Game and the Evolution of Alternative Male Strategies.” Nature.

Still. 2020. “Thermodynamic Cost and Benefit of Memory.” Physical Review Letters.

Sun, Yang, Xun, et al. 2023. “Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD.” ACM Transactions on Knowledge Discovery from Data.

Székely, and Rizzo. 2017. “The Energy of Data.” Annual Review of Statistics and Its Application.

Ventola, Braun, Yu, et al. 2023. “Probabilistic Circuits That Know What They Don’t Know.” arXiv.org.

Wolpert, David H. 2006. “Information Theory — The Bridge Connecting Bounded Rational Game Theory and Statistical Physics.” In Complex Engineered Systems. Understanding Complex Systems.

Wolpert, David H. 2008. “Physical Limits of Inference.” Physica D: Nonlinear Phenomena, Novel Computing Paradigms: Quo Vadis?,.

Wolpert, David. 2017. “Constraints on Physical Reality Arising from a Formalization of Knowledge.”

Wolpert, David H. 2018. “Theories of Knowledge and Theories of Everything.” In The Map and the Territory: Exploring the Foundations of Science, Thought and Reality.

———. 2019. “Stochastic Thermodynamics of Computation.”

Wolpert, David H., Kolchinsky, and Owen. 2017. “A Space-Time Tradeoff for Implementing a Function with Master Equation Dynamics.”

Wolpert, David, and Scharnhorst. 2024. “Stochastic Process Turing Machines.”

Yao, Liu, and Tegmark. n.d. “Variational Loss Landscapes for Periodic Orbits.”

Yarotsky, and Zhevnerchuk. 2020. “The Phase Diagram of Approximation Rates for Deep Neural Networks.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.

Yerrababu, Majumdar, and Sadhu. 2024. “Dynamical Phase Transitions in Certain Non-Ergodic Stochastic Processes.”

Yue, Chen, Lu, et al. 2025. “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?”

Zdeborová, and Krzakala. 2016. “Statistical Physics of Inference: Thresholds and Algorithms.” Advances in Physics.

Zhang, Yao, Saxe, Advani, et al. 2018. “Energy-Entropy Competition and the Effectiveness of Stochastic Gradient Descent in Machine Learning.” Molecular Physics.

Zhang, Xiaotian, Shang, Yang, et al. 2025. “Is Grokking a Computational Glass Relaxation?”

Zhong, Panja, Barkema, et al. 2018. “Generalized Langevin Equation Formulation for Anomalous Diffusion in the Ising Model at the Critical Temperature.” Physical Review E.

Ziyin, Xu, and Chuang. 2025. “Neural Thermodynamics I: Entropic Forces in Deep and Universal Representation Learning.”