Singular Learning Theory
2024-10-29 — 2025-09-01
Wherein algebraic geometry is applied to characterise singularities in the loss surfaces of overparameterized neural networks, and the local learning coefficient is introduced as an effective dimension.
Placeholder.
As far as I can tell, a first-order approximation to (the bits I vaguely understand of) Singular Learning Theory is something like:
Classical Bayesian statistics has a good theory of well-posed models with a small number of interpretable parameters. Singular Learning Theory is a theory of ill-posed models with a large number of uninterpretable parameters, which provides us with a model of Bayesian statistics by using results from algebraic geometry about singularities in the loss surface.
Why might we care about this? For the moment I am taking it on faith. Since attending ILIAD2 I am relatively more optimistic about the potential for this program of research to go somewhere.
Jesse Hoogland, Neural networks generalize because of this one weird trick:
Statistical learning theory is lying to you: “overparametrized” models actually aren’t overparametrized, and generalisation is not just a question of broad basins.
1 Local Learning Coefficient
Resources recommended to me by Rohan Hitchcock:
- The upshot is Jesse Hoogland and Stan van Wingerden argue that we should care about model complexity, and the local learning coefficient (Lau et al. 2024) is arguably the correct measure.
- The RLCT Measures the Effective Dimension of Neural Networks
I probably have some alpha in estimating this value.
2 Use in developmental interpretability
I don’t understand this step of the argument, but see Developmental Interpretability by Jesse Hoogland and Stan van Wingerden, and maybe read Lehalleur et al. (2025).
3 Fractal loss landscapes
Notionally there might be a connection to fractal loss landscapes. See Fractal dimension of loss landscapes and self similar behaviour in neural networks.
4 Questions I would like to know how to answer
- Can we use LLC as an optimal design objective, crafting desired optima directly in loss space by altering loss functions and/or architectures dynamically?
- Can we filter to estimate the LLC online? It looks a lot like sparse Kalman filtering — just sayin’.
- The Bayesian formalism. How does the notional “Bayesian” update work in practical neural networks that are not trained by posterior updates? Does it matter? What happens when our training process really pushes the analogy, e.g. when we have synthetic data generation in a neural distillation?
- How biased is LLC estimation for nontrivial networks? I ran some sims and made the LLS estimates of SGHMC and SGLD diverge substantially. Is that worrisome? [TODO clarify]
- How would belief propagation work in a Bayesian setting? What even is the natural space of priors generating local landscape geometries?
5 Incoming
From SLT to AIT: NN generalisation out-of-distribution — LessWrong
This post derives an upper bound on the prediction error of Bayesian learning on neural networks. Unlike the bound from vanilla Singular Learning Theory (SLT), this bound also holds for out-of-distribution generalization, not just for in-distribution generalization. Along the way, it shows some connections between SLT and Algorithmic Information Theory (AIT).
The Developmental Interpretability site is mostly about SLT, and it runs a thriving Discord server on the topic.
Alexander Gietelink Oldenziel — Singular Learning Theory
metauni’s Singular Learning Theory seminar
-
Timaeus is an AI safety research organisation working on applications of Singular Learning Theory (SLT) to alignment.