Singular Learning Theory

2024-10-29 — 2025-09-01

Wherein algebraic geometry is applied to characterise singularities in the loss surfaces of overparameterized neural networks, and the local learning coefficient is introduced as an effective dimension.

AI safety

dynamical systems

machine learning

neural nets

physics

pseudorandomness

sciml

statistics

statmech

stochastic processes

Figure 1: Busts of Watanabe, Murfet and Hironaka regarding the oracle of Singular Learning Theory, concealed in a mist of Monte Carlo samples.

Placeholder.

As far as I can tell, a first-order approximation to (the bits I vaguely understand of) Singular Learning Theory is something like:

Classical Bayesian statistics has a good theory of well-posed models with a small number of interpretable parameters. Singular Learning Theory is a theory of ill-posed models with a large number of uninterpretable parameters, which provides us with a model of Bayesian statistics by using results from algebraic geometry about singularities in the loss surface.

Why might we care about this? For the moment I am taking it on faith. Since attending ILIAD2 I am relatively more optimistic about the potential for this program of research to go somewhere.

Jesse Hoogland, Neural networks generalize because of this one weird trick:

Statistical learning theory is lying to you: “overparametrized” models actually aren’t overparametrized, and generalisation is not just a question of broad basins.

1 Local Learning Coefficient

Resources recommended to me by Rohan Hitchcock:

The upshot is Jesse Hoogland and Stan van Wingerden argue that we should care about model complexity, and the local learning coefficient (Lau et al. 2024) is arguably the correct measure.
The global version is the RLCT but AFAICT it is intractable.

I probably have some alpha in estimating this value, so I am doing so — see Estimating the local learning coefficient of neural networks.

2 Fractal loss landscapes

Notionally there might be a connection to fractal loss landscapes. See Fractal dimension of loss landscapes and self similar behaviour in neural networks.

3 Use in developmental interpretability

I don’t understand this step of the argument, but see Developmental Interpretability by Jesse Hoogland and Stan van Wingerden, and maybe read Lehalleur et al. (2025).

4 Questions I would like to know how to answer

Can we use LLC as an optimal design objective, crafting desired optima directly in loss space by altering loss functions and/or architectures dynamically?
Can we filter to estimate the LLC online? It looks a lot like sparse Kalman filtering — just sayin’.
The Bayesian formalism. How does the notional “Bayesian” update work in practical neural networks that are not trained by posterior updates? Does it matter? Can we use standard Bayes-by-backprop arguments? What happens when our training process really pushes the analogy, e.g. when we have synthetic data generation in a neural distillation?
How biased is LLC estimation for non-trivial networks, i.e non-linear ones? I ran some sims and made the LLC estimates of SGHMC and SGLD diverge substantially. Is that worrisome?
How would belief propagation work in a singular setting? What even is the natural space of priors generating local landscape geometries?

5 Incoming

From SLT to AIT: NN generalisation out-of-distribution — LessWrong

This post derives an upper bound on the prediction error of Bayesian learning on neural networks. Unlike the bound from vanilla Singular Learning Theory (SLT), this bound also holds for out-of-distribution generalization, not just for in-distribution generalization. Along the way, it shows some connections between SLT and Algorithmic Information Theory (AIT).
The Developmental Interpretability site is mostly about SLT, and it runs a thriving Discord server on the topic.
Alexander Gietelink Oldenziel — Singular Learning Theory
metauni’s Singular Learning Theory seminar
Spooky action at a distance in the loss landscape
Timaeus

Timaeus is an AI safety research organisation working on applications of Singular Learning Theory (SLT) to alignment.

6 References

Adam, Furman, Wu, et al. 2025. “Studying Data Complexity and Learned Structure in Neural Networks with Bayesian Probes.” In.

Andreeva, Dupuis, Sarkar, et al. 2024. “Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms.”

Bouchaud, and Georges. 1990. “Anomalous Diffusion in Disordered Media: Statistical Mechanisms, Models and Physical Applications.” Physics Reports.

Carroll. 2021. “Phase Transitions in Neural Networks.”

Chen, Lau, Mendel, et al. 2023. “Dynamical Versus Bayesian Phase Transitions in a Toy Model of Superposition.”

Farrugia-Roberts, Murfet, and Geard. 2022. “Structural Degeneracy in Neural Networks.”

Furman. 2025. “LLC as Fractal Dimension.”

Hitchcock. 2024. “On the Convergence of SGLD.”

Hitchcock, and Hoogland. 2025. “From Global to Local: A Scalable Benchmark for Local Posterior Sampling.”

Kreer, Wu, Adam, et al. 2025. “Bayesian Influence Functions for Scalable Data Attribution.” In.

Lau, Furman, Wang, et al. 2024. “The Local Learning Coefficient: A Singularity-Aware Complexity Measure.”

———, et al. 2025. “The Local Learning Coefficient: A Singularity-Aware Complexity Measure.” In.

Lee, Smith, Adam, et al. 2025. “Influence Dynamics and Stagewise Data Attribution.”

Lehalleur, Hoogland, Farrugia-Roberts, et al. 2025. “You Are What You Eat — AI Alignment Requires Understanding How Data Shapes Structure and Generalisation.”

Lin. 2011. “Algebraic Methods for Evaluating Integrals in Bayesian Statistics.”

Ly, and Gong. 2025. “Optimization on Multifractal Loss Landscapes Explains a Diverse Range of Geometrical and Dynamical Properties of Deep Learning.” Nature Communications.

Munn, and Wei. 2025. “A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints.”

Saito. 2007. “On Real Log Canonical Thresholds.”

Shoham, Mor-Yosef, and Avron. 2025. “Flatness After All?”

Urdshals, Lau, Hoogland, et al. 2025. “Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory.”

Volkhardt, and Grubmüller. 2022. “Estimating Ruggedness of Free-Energy Landscapes of Small Globular Proteins from Principal Component Analysis of Molecular Dynamics Trajectories.” Physical Review E.

Watanabe. 2009. Algebraic Geometry and Statistical Learning Theory. Cambridge Monographs on Applied and Computational Mathematics.

———. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research.

———. 2020. Mathematical theory of Bayesian statistics.

———. 2022. “Recent Advances in Algebraic Geometry and Bayesian Statistics.”

Wei, and Lau. 2023. “Variational Bayesian Neural Networks via Resolution of Singularities.” Journal of Computational and Graphical Statistics.

Wei, Murfet, Gong, et al. 2023. “Deep Learning Is Singular, and That’s Good.” IEEE Transactions on Neural Networks and Learning Systems.