Estimating the Local Learning coefficient
2025-05-29 — 2025-09-09
Suspiciously similar content
I’m trying to estimate the Local Learning Coefficient (\(\lambda(w^*)\)) a helpful degrees of freedom metric from singular learning theory. Canonical works in this domain are Hitchcock and Hoogland (2025) and Lau et al. (2024).
The promise is that the LLC captures something we’ve never had a good handle on: how “complex” or “singular” a solution really is, in the geometry of its loss basin.
- If \(\lambda(w^*)\) is large: the basin is sharp and “thin”; small perturbations push you out, which suggests the model is effectively more complex and might generalize worse.
- If \(\lambda(w^*)\) is small: the basin is broad and degenerate; the model is simpler in an information-theoretic sense, and SLT predicts better generalization.
There are arguments that suggest that LLC is the “right” way of characterising this singular-ness, at least with regard to the the traditional setting of neural networks regarded as Bayesian posteriors. In practice, that could mean:
- Comparing different minima from the same training run to understand which are “simpler.”1
- Tracking \(\lambda(w^*)\) across architectures, optimizers, or regularizers, as a way to see which design choices really reduce effective complexity.
- Using it as a local diagnostic to interpret why two models with the same training loss generalize differently.
In particular, we suspect that if we can estimate LLC for a given, it might do something useful with regard to gracefully analyzing overparameterized models.
1 Needful formulae
Let \(L_n(w)\) be the empirical negative log‑likelihood and \(w^*\) a trained solution. Two equivalent forms drive everything AFAICT:
- Local free energy (around \(w^*\))
\[ F_n(w^*,\gamma)\;=\;-\log\!\int \exp\big(-n\,L_n(w)\big)\;\underbrace{\exp\!\big(-\tfrac{\gamma}{2}\|w-w^*\|^2\big)}_{\text{localizer}}\;dw \;\;=\;\;n\,L_n(w^*)\;+\;\lambda(w^*)\,\log n\;+\;o(\log n). \]
- Operational (WBIC‑style) estimator
We choose the “hot” inverse temperature \(\beta^*=\tfrac{1}{\log n}\) (TODO: I’m not sure why that temperature exactly; I take this part on faith).
\[ p_{\beta,\gamma}(w\mid w^*)\ \propto\ \exp\!\Big(-n\beta\,L_n(w)-\tfrac{\gamma}{2}\|w-w^*\|^2\Big),\qquad \widehat{\lambda}(w^*)\;=\;n\beta^*\Big(\mathbb{E}_{p_{\beta^*,\gamma}}[L_n(w)]-L_n(w^*)\Big). \]
That’s the whole game: accurately approximate one local expectation under a tempered, Gaussian‑tethered posterior.
2 HMC/MALA is hard at scale
I love HMC in moderate (say \(10^4\)) dimension, but for LLC it runs into difficulties:
- Full‑batch accepts. Any Metropolis‑corrected method needs accurate energy differences. In deep learning that’s a full pass over the dataset (or huge batches) per proposal. With millions of parameters, even a modest trajectory is prohibitively expensive.
- Local restriction. We don’t want the global posterior; we need a local one (that Gaussian “tether” around \(w^*\)). Enforcing locality inside standard HMC is finicky: reject‑heavy, or we have to hand‑craft reflective/soft constraints that complicate tuning and implementation
- Singular geometry. Around true minima, directions are wildly anisotropic—many nearly flat, some very stiff. You either shrink HMC steps until mixing dies, or accept nasty discretization bias. Preconditioned, accept/reject HMC can work on small problems, but the per‑step cost kills it at the data scales we care about.
3 SGLD
The workhorse in practice is Stochastic Gradient Langevin Dynamics (SGLD). It looks almost like SGD, but with a cunningly designed Gaussian noise:
\[ w_{t+1} = w_t - \tfrac{\epsilon}{2}\,\widehat{\nabla} L_n(w_t)\;+\;\sqrt{\epsilon}\,\eta_t,\quad \eta_t \sim \mathcal N(0,I). \]
- The first term is the usual gradient step (but on a minibatch, so \(\widehat{\nabla} L_n\) is noisy).
- The second term is Gaussian noise, scaled to match the discretization of Langevin diffusion.
- Over many steps, the iterates approximate samples from a posterior distribution proportional to \(\exp(-nL_n(w))\).
For LLC estimation, we tweak two things:
- Tempering: we don’t want the true posterior, we want a hot version at inverse temperature \(\beta^*=1/\log n\). That just means multiplying the gradient term by \(\beta^*\).
- Localization: we add a quadratic regularization pulling the chain toward the trained solution \(w^*\). That introduces an extra drift term \(-\tfrac{\epsilon}{2}\gamma(w_t-w^*)\). It looks a lot like a Gaussian prior, but it is not, because we invented that from the posterior.
So the update is:
\[ w_{t+1} = w_t - \tfrac{\epsilon}{2}\,\big(\beta^*\,\widehat{\nabla} L_n(w_t) + \gamma(w_t-w^*)\big)\;+\;\sqrt{\epsilon}\,\eta_t. \]
That’s the SGLD kernel used in the LLC papers.
3.1 Preconditioned SGLD
One limitation is that standard SGLD makes isotropic moves: the Gaussian noise is spherical, and the gradient step doesn’t account for wildly different curvature across directions. In singular models, this means tiny steps in stiff directions and slow exploration in flat ones.
An affordable fix is preconditioning. Introduce a positive-definite matrix \(A\) (Think of the Adam quasi-curfvature estimates which gives us a diagonal approximate $A)) The update becomes:
\[ w_{t+1} = w_t - \tfrac{\epsilon}{2}\,A\big(\beta^*\,\widehat{\nabla} L_n(w_t) + \gamma(w_t-w^*)\big)\;+\;\xi_t,\quad \xi_t \sim \mathcal N(0,\,\epsilon A). \]
Now both the drift and the noise are scaled by \(A\). Intuitively: steps are larger in flat directions, smaller in sharp ones, and the noise is stretched accordingly. This respects the desired invariant distribution (for constant \(A\)), while making exploration more efficient and respecting invariances like weight rescaling in ReLU nets.
On paper, unadjusted Langevin with minibatches should make us nervous. In practice, SGLD tends to work because the LLC setup stacks the deck in our favour:
- It’s local. The quadratic \(\tfrac{\gamma}{2}\|w-w^*\|^2\) keeps the chain near \(w^*\). We don’t need to cross modes or map the whole landscape.
- It’s hot. \(\beta^*=1/\log n\) is small for big \(n\). Energy barriers are softened; local exploration is easier.
- It’s cheap. Minibatch gradients mean each step costs about an SGD step, so long chains are feasible.
- It’s tolerant. We only need \(\mathbb{E}[L_n(w)]\) accurately under that local law. We don’t require immaculate global mixing.
Intuitively: I think of the estimator as “poke the basin around \(w^*\) and see how quickly loss rises.” If the basin is broad (singular/degenerate), the average loss near \(w^*\) barely increases and \(\widehat{\lambda}\) is small; if it’s sharp, it rises faster and \(\widehat{\lambda}\) is larger. SGLD is a convenient way to do that poking—locally, cheaply.
It seems to work in practice for at least one model family, the deep linear network, where we can test the estimates against ground truth; and indeed the SGLD-based estimator matches it beautifully, even at crazy scales (hundreds of millions of parameters).
That’s surprising, to me at least. And impressive. I think we have to be careful not to over-sell it. DLNs are special: their singularities are structured in a way we can analyze, and the loss surface has symmetries that play nicely with the Gaussian driving noise.
In contrast, in nonlinear nets—with ReLU, GeLU, and friends—we don’t have an analytic ground truth for \(\lambda(w^*)\). All we can do is compare different estimators (SGLD vs MALA vs diagnostics), but those estimators may share biases. It could be that SGLD is “working great” in DLNs precisely because they are unusually well-behaved, and that in truly nonlinear models the same algorithm is only approximately right.
So while the DLN results give me confidence that the estimator isn’t nonsense, they shouldn’t be read as a general proof of correctness in deep nonlinear nets. This gap—we can’t validate LLC estimates in the models we actually care about—is, to me, concerning.
Watch this space.
3.2 SGLD shortcomings
Note there is unintuitive (to me) stuff going on here, even in tractable deep linear networks. The “landscape” has millions of dimensions, so “poking” in the correct directions is hard, as it always is in high dimensional models. There are various other things we might worry about
- No universal guarantees in singular models. The standard SGLD convergence conditions (vanishing step sizes, Lyapunov tails) often fail for deep linear and modern nets. There are even clean examples where unadjusted Langevin diverges on light‑tailed targets.
- Discretization bias. With constant step sizes (what we actually use), you need strong diagnostics to keep bias small. In practice I rely on a MALA‑style acceptance‑probability diagnostic (not a real accept step) to tune the step size and on long burn‑in until the minibatch‑loss trace flattens.
- Anisotropy hurts. Flat/stiff directions force tiny steps. Preconditioning (fixed diagonal, Fisher‑like) helps, but there’s no magic.
- Hyperparameters matter. The localizer strength \(\gamma\) is a knob: too big and you drown the likelihood geometry; too small and the chain can drift, even producing negative \(\widehat{\lambda}\).
- Finite‑\(n\) wrinkles. We evaluate at \(w^*\) trained on the same data we use inside the estimator. It works well empirically, but it’s not a theorem‑grade guarantee.
4 Tempting alternatives I tried that don’t seem to work
I’ve kicked the tires on several ideas that sounded like they should help. They did not.
4.1 Momentum SGMCMC (SGHMC / SGNHT)
Why I hoped it’d help. Momentum should cruise along flat manifolds and mix faster.
What goes wrong. To be correct with minibatch gradients, SGHMC needs a friction/noise correction matched to the covariance of the gradient noise. In deep nets that noise can be heavy‑tailed and state‑dependent, so the correction is ill‑defined or expensive to estimate, and mis‑tuning yields a different stationary law. Net effect: we risk biasing the expectation that defines \(\widehat{\lambda}\), while adding tricky hyperparameters.
4.2 Twisted / optimally‑guided SMC
Why I hoped it’d help. SMC gives ESS diagnostics, tempering ladders, and MH‑corrected move steps (MALA/HMC) that avoid unadjusted‑Langevin pathologies.
What goes wrong. In very high dimension, we need excellent “twists” to avoid particle collapse. Twisted SMC works when there is a trick that gives us good twists. Good twists do not seem obvious for arbitrary SLT models. Further, the moment we put accept/reject MALA/HMC inside SMC, we are back to near full‑batch costs per particle. We can do it for small models (and it’s a good sanity check), but it’s not a scalable replacement.
4.3 Non‑Gaussian / Lévy noise (“spikier” SGLD)
Why I hoped it’d help. Heavy‑tailed jumps might traverse singular valleys faster.
What goes wrong. Swap Gaussian noise for α‑stable jumps and, unless we also change the drift to the fractional counterpart and/or add MH correction, the invariant law is not the tempered local posterior we need. Add MH to fix it and you pay full‑data accepts again. Net: either wrong target or wrong computational scale.
5 Where next
- The LLC estimator is local, hot, and tethered. Any method that respects those three features and keeps per‑step cost ≈ SGD can work. Right now, (preconditioned) SGLD seems to have a chance.
- Metropolis‑corrected moves are great…but at scale they’re a non‑starter. If we reintroduce full‑batch accepts (via HMC, heavy‑tailed MH, or SMC move steps), we lose the scalability that makes LLC estimation feasible.
- Changing the noise law doesn’t buy correctness. Lévy‑driven tricks either target the wrong distribution or demand accepts; neither is attractive here.
- Momentum isn’t a free lunch. The friction/noise calibration for SGHMC is the killer in singular, heavy‑tailed gradient regimes; without it we bias \(\widehat{\lambda}\).
- Real open problems: provable control of discretization bias for constant‑stepsize, preconditioned SGLD in singular models; principled selection of the localizer strength \(\gamma\); and geometry‑aware preconditioners that preserve the estimator’s invariances without expensive curvature estimation.
6 Potential angles of attack.
TODO: write these out so they make sense to other people.
- Transdimensional Hironaka moves
- Basin-friendly preconditioning estimators
- online updates?
7 Incoming
Connection to Muon and other normaliser approaches which get us “free” preconditioning and other nice things.
Is Microcanonical LMC going to do better?
timaeus-research/devinterp: Tools for studying developmental interpretability in neural networks.
honglu2875/hironaka: A utility package for Hironaka game of local resolution of singularities
Eleuther’s local volume measurement looks connected: Research Update: Applications of Local Volume Measurement | EleutherAI Blog
8 References
Footnotes
We need to be careful with what are in fact minima in this setting. Did we even converge? waht does that mean?↩︎