Estimating the Local Learning Coefficient
Singular Learning Theory’s prodigy
2025-05-29 — 2025-09-23
Wherein the Local Learning Coefficient is probed via a hot, tethered posterior, and a preconditioned SGLD sampler is employed with inverse temperature beta* = 1/log n and a Gaussian localizer γ to measure basin curvature.
I’m trying to estimate the Local Learning Coefficient (\(\lambda(w^*)\)) — a helpful degrees of freedom metric from singular learning theory. Canonical works in this domain are Hitchcock and Hoogland (2025) and Lau et al. (2024).
The promise is that the LLC captures something we’ve never had a good handle on: how “complex” or “singular” a solution really is in the geometry of the solution’s loss basin.
- If \(\lambda(w^*)\) is large: the basin is sharp and “thin”; small perturbations push the parameters out of the basin, which suggests the model is effectively more complex and might generalize worse.
- If \(\lambda(w^*)\) is small: the basin is broad and degenerate; the model is simpler in an information-theoretic sense, and SLT predicts better generalization.
There are arguments that suggest LLC is the “right” way of characterizing this singularity, at least with regard to the traditional setting of neural networks viewed as Bayesian posteriors. In practice, that could mean:
- Comparing different minima from the same training run to understand which are “simpler.”1
- Tracking \(\lambda(w^*)\) across architectures, optimizers, or regularizers to see which design choices actually reduce effective complexity.
- Using it as a local diagnostic to interpret why two models with the same training loss generalize differently.
In particular, we suspect that if we can estimate LLC for a given model, it might be useful for gracefully analyzing overparameterized models.
1 Needful formulae
Let \(L_n(w)\) be the empirical negative log‑likelihood, and \(w^*\) a trained solution. Two equivalent forms drive everything, as far as I can tell:
- Local free energy (around \(w^*\))
\[ F_n(w^*,\gamma)\;=\;-\log\!\int \exp\big(-n\,L_n(w)\big)\;\underbrace{\exp\!\big(-\tfrac{\gamma}{2}\|w-w^*\|^2\big)}_{\text{localizer}}\;dw \;\;=\;\;n\,L_n(w^*)\;+\;\lambda(w^*)\,\log n\;+\;o(\log n). \]
- Operational (WBIC‑style) estimator
We choose the “hot” inverse temperature \(\beta^*=\tfrac{1}{\log n}\) (TODO: I’m not sure why that exact temperature; I’m taking this on faith).
\[ p_{\beta,\gamma}(w\mid w^*)\ \propto\ \exp\!\Big(-n\beta\,L_n(w)-\tfrac{\gamma}{2}\|w-w^*\|^2\Big),\qquad \widehat{\lambda}(w^*)\;=\;n\beta^*\Big(\mathbb{E}_{p_{\beta^*,\gamma}}[L_n(w)]-L_n(w^*)\Big). \]
That’s the whole game: accurately approximate a single local expectation under a tempered, Gaussian‑tethered posterior.
2 HMC/MALA is hard at scale
I love HMC in moderate (say \(10^4\)) dimensions, but for LLC it runs into difficulties: [TODO clarify]
- Full‑batch accepts. Any Metropolis‑corrected method needs accurate energy differences. In deep learning, that’s a full pass over the dataset (or huge batches) per proposal. With millions of parameters, even a modest trajectory is prohibitively expensive.
- Local restriction.
We don’t want the global posterior; we need a local one (that Gaussian “tether” around \(w^*\)). Enforcing locality inside standard HMC is finicky: reject‑heavy, or we must hand‑craft reflective or soft constraints, which complicate tuning and implementation.No, I think I was wrong about that. The Gaussian localizer is part of the log‑density, so it’s just another term in the gradient. We don’t need to do anything special there. - Singular geometry. Intuitively we expect that around minima, directions are wildly anisotropic—many nearly flat, some very stiff. Intuitively, this should mix poorly, since we spend a long time exploring the “wrong direction” — Note, however that the same is true of SGLD and that seems to work surprisingly well, so I might have bad intuition here.
3 SGLD
In practice, the workhorse is Stochastic Gradient Langevin Dynamics (SGLD). It looks almost like SGD, but with cleverly chosen Gaussian noise:
\[ w_{t+1} = w_t - \tfrac{\epsilon}{2}\,\widehat{\nabla} L_n(w_t)\;+\;\sqrt{\epsilon}\,\eta_t,\quad \eta_t \sim \mathcal N(0,I). \]
- The first term is the usual gradient step, but computed on a minibatch, so \(\widehat{\nabla} L_n\) is noisy.
- The second term is Gaussian noise, scaled to match the discretization of Langevin diffusion.
- Over many steps, the iterates approximate samples from a posterior distribution proportional to \(\exp(-nL_n(w))\).
Recall that for LLC estimation, we tweak two things:
- Tempering: we don’t want the true posterior; we want a hot version with inverse temperature \(\beta^*=1/\log n\). That just means multiplying the gradient term by \(\beta^*\).
- Localization: we add a quadratic regularization pulling the chain towards the trained solution \(w^*\). That introduces an extra drift term \(-\tfrac{\epsilon}{2}\gamma(w_t-w^*)\). It looks like a Gaussian prior, but it isn’t: we invented it from the posterior.
So the update is:
\[ w_{t+1} = w_t - \tfrac{\epsilon}{2}\,\big(\beta^*\,\widehat{\nabla} L_n(w_t) + \gamma(w_t-w^*)\big)\;+\;\sqrt{\epsilon}\,\eta_t. \]
That’s the SGLD kernel used in the LLC papers.
3.1 Preconditioned SGLD
One limitation is that standard SGLD makes isotropic moves: the Gaussian noise is spherical, and the gradient step doesn’t account for very different curvature across directions. In singular models, this means we take tiny steps in stiff directions and explore slowly in flat ones.
A simple, affordable fix is preconditioning. Introduce a positive-definite matrix \(A\) — think of the Adam quasi-curvature estimates, which provide a diagonal approximation to \(A\). The update becomes:
\[ w_{t+1} = w_t - \tfrac{\epsilon}{2}\,A\big(\beta^*\,\widehat{\nabla} L_n(w_t) + \gamma(w_t-w^*)\big)\;+\;\xi_t,\quad \xi_t \sim \mathcal N(0,\,\epsilon A). \]
Now both the drift and the noise are scaled by \(A\). Intuitively, steps are larger in flat directions, smaller in sharp ones, and the noise is stretched accordingly. This respects the desired invariant distribution (for constant \(A\)), makes exploration more efficient, and preserves invariances like weight rescaling in ReLU nets.
On paper, unadjusted Langevin with minibatches should make us nervous. In practice, SGLD tends to work because the LLC setup stacks the deck in our favour:
- It’s local. The quadratic \(\tfrac{\gamma}{2}\|w-w^*\|^2\) keeps the chain near \(w^*\). We don’t need to cross modes or map the whole landscape.
- It’s hot. \(\beta^*=1/\log n\) is small for big \(n\). Energy barriers are softened; local exploration is easier.
- It’s cheap. Minibatch gradients mean each step costs about an SGD step, so long chains are feasible.
- It’s tolerant. We only need \(\mathbb{E}[L_n(w)]\) to be accurate under the local law. We don’t require immaculate global mixing.
Intuitively, I think of the estimator as “poke the basin around \(w^*\) and see how quickly the loss rises.” If the basin is broad (singular/degenerate), the average loss near \(w^*\) barely increases and \(\widehat{\lambda}\) is small; if it’s sharp, it rises faster and \(\widehat{\lambda}\) is larger. SGLD seems smart at doing the poking, at least in the models I’ve tried.
It can be shown to work in practice for at least one model family, the deep linear network, where we can test the estimates against ground truth, and indeed the SGLD-based estimator matches the ground truth beautifully, even at big scales (hundreds of millions of parameters, data sets that we can’t fully load into memory).
That’s surprising, to me at least. And impressive. I think we have to be careful not to oversell it. DLNs are special: their singularities are structured in a way we can analyze, and the loss surface has symmetries that play nicely with the Gaussian driving noise.
In contrast, in nonlinear nets—with ReLU, GeLU, and friends—we don’t have an analytic ground truth for \(\lambda(w^*)\). All we can do is compare different estimators (SGLD vs MALA vs diagnostics), but those estimators may share biases. It could be that SGLD is “working great” in DLNs precisely because they are unusually well-behaved, and that in truly nonlinear models the same algorithm is only approximately right.
So while the DLN results give me confidence that the estimator isn’t nonsense, they shouldn’t be read as a general proof of correctness in deep nonlinear nets. This gap—we can’t validate LLC estimates in the models we actually care about—is, to me, concerning.
3.2 SGLD shortcomings
Note there is unintuitive (to me) stuff going on here, even in tractable deep linear networks. The “landscape” has millions of dimensions, so “poking” in the correct directions is hard, as it always is in high dimensional models. There are various other things we might worry about:
- No universal guarantees in singular models. The standard SGLD convergence conditions (vanishing step sizes, Lyapunov tails) often fail for deep linear and modern nets. There are even clean examples where unadjusted Langevin diverges on light‑tailed targets.
- Discretization bias. With constant step sizes (what we actually use), we need strong diagnostics to keep bias small. In practice we rely on a MALA‑style acceptance‑probability diagnostic (not a real accept step) to tune the step size and on long burn‑in until the minibatch‑loss trace flattens.
- Anisotropy hurts. Flat/stiff directions force tiny steps. Preconditioning (fixed diagonal, Fisher‑like) helps, but… is that enough? We know that in interesting models the local curvature matrix is not positive definite so things should still not be amazing.
- Hyperparameters matter. The localizer strength \(\gamma\) is a knob: too big and you drown the likelihood geometry; too small and the chain can drift, even producing negative \(\widehat{\lambda}\).
- Finite‑\(n\) wrinkles. We evaluate at \(w^*\), which was trained on the same data we use inside the estimator. It works well empirically, but it’s not a theorem‑grade guarantee.
4 Tempting alternatives I tried that don’t seem to work
I’ve kicked the tires on several ideas that sounded like they should help. They did not.
4.1 Momentum SGMCMC (SGHMC / SGNHT)
Why I hoped it’d help. Momentum should cruise along flat manifolds and mix faster.
What goes wrong. To be correct with minibatch gradients, SGHMC needs a friction/noise correction matched to the covariance of the gradient noise. In deep nets that noise can be heavy‑tailed and state‑dependent, so the correction is ill‑defined or expensive to estimate, and mis‑tuning yields a different stationary law. Net effect: we risk biasing the expectation that defines \(\widehat{\lambda}\), while adding tricky hyperparameters.
4.2 Twisted / optimally‑guided SMC
Why I hoped it’d help. SMC gives ESS diagnostics, tempering ladders, and MH‑corrected move steps (MALA/HMC) that avoid unadjusted‑Langevin pathologies.
What goes wrong. In very high dimension, we need excellent “twists” to avoid particle collapse. Twisted SMC works when there is a trick that gives us good twists. Good twists don’t seem obvious for arbitrary SLT models. Further, the moment we put accept/reject MALA/HMC inside SMC, we are back to near full‑batch costs per particle. We can do it for small models (and it’s a good sanity check), but it’s not a scalable replacement.
4.3 Non‑Gaussian / Lévy noise (“spikier” SGLD)
Why I hoped it’d help. Heavy‑tailed jumps might traverse singular valleys faster.
What goes wrong. Swap Gaussian noise for α‑stable jumps and, unless we also change the drift to the fractional counterpart and/or add MH correction, the invariant law is not the tempered local posterior we need. Add MH to fix it and we pay full‑data accepts again.
5 Where next
- The LLC estimator is local, hot, and tethered. Any method that respects those three features and keeps per‑step cost ≈ SGD can work. Right now, (preconditioned) SGLD seems to have a chance.
- Metropolis‑corrected moves are great…but at scale they’re not feasible. If we reintroduce full‑batch accepts (via HMC, heavy‑tailed MH, or SMC move steps), we lose the scalability that makes LLC estimation feasible.
- Changing the noise law doesn’t buy correctness. Lévy‑driven tricks either target the wrong distribution or demand accepts; neither is attractive here.
- Real open problems: provable control of discretization bias for constant‑stepsize, preconditioned SGLD in singular models; principled selection of the localizer strength \(\gamma\); and geometry‑aware preconditioners that preserve the estimator’s invariances without expensive curvature estimation.
6 Potential angles of attack.
TODO: write these out so they make sense to other people.
- Transdimensional Hironaka moves
- Basin-friendly preconditioning estimators
- Online updates
7 Incoming
Connection to Muon and other normaliser approaches which give us “free” preconditioning and other nice things.
Is Microcanonical LMC going to do better?
honglu2875/hironaka: A utility package for Hironaka game of local resolution of singularities
Eleuther’s local volume measurement looks connected: Research Update: Applications of Local Volume Measurement | EleutherAI Blog
8 References
Footnotes
We need to be careful about what actually counts as a minimum in this setting: did we even converge, and what would that mean?↩︎