Degrees of freedom in NNs

Information criteria at scale

2025-06-25 — 2025-08-19

Wherein the notion of neural network degrees of freedom is examined through singular learning theory’s learning coefficient and via minimum description length, and a sharpness‑adjusted effective parameter count (SANE) is reported.

estimator distribution

information

likelihood

model selection

statistics

In classical statistics there are families of model complexity estimates, which are loosely collectively referred to as “Degrees of freedom” of a model. They don’t scale up to overparameterized NNs, computationally or practically, and there are other tools.

Exception: Shoham, Mor-Yosef, and Avron (2025) argues for a connection to the Takeuchi Information Criterion.

These end up being popular in developmental interpretability.

1 Learning coefficient

The major output of singular learning theory, AFAICT, is a particular estimate of a model’s effective dimensionality.

2 Minimum description length

MDL seems to be an interesting way to think about NNs (Geoffrey E. Hinton and Zemel 1993; Geoffrey E. Hinton and van Camp 1993; Perez, Kiela, and Cho 2021).

There’s some connection to LLC, I’ve been told, but I don’t yet know enough to make that precise.

3 SANE

Sharpness-Adjusted Number of Effective parameters (L. Wang and Roberts 2023).

Spruiked by Stephen Roberts at Instability is All You Need: The Surprising Dynamics of Learning in Deep Models

4 Incoming

Eleuther’s local volume measurement looks connected: Research Update: Applications of Local Volume Measurement | EleutherAI Blog

5 References

Abbas, Sutter, Figalli, et al. n.d. “Eﬀective Dimension of Machine Learning Models.”

Dold, Kobialka, Palm, et al. 2025. “Paths and Ambient Spaces in Neural Loss Landscapes.”

Hinton, Geoffrey E., and van Camp. 1993. “Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights.” In Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT ’93.

Hinton, Geoffrey E, and Zemel. 1993. “Autoencoders, Minimum Description Length and Helmholtz Free Energy.” In Advances in Neural Information Processing Systems.

Hitchcock, and Hoogland. 2025. “From Global to Local: A Scalable Benchmark for Local Posterior Sampling.”

Humayun, Balestriero, and Baraniuk. 2024. “Deep Networks Always Grok and Here Is Why.”

Khachaturov, and Mullins. 2024. “Complexity Matters: Effective Dimensionality as a Measure for Adversarial Robustness.”

Lau, Furman, Wang, et al. 2024. “The Local Learning Coefficient: A Singularity-Aware Complexity Measure.”

———, et al. 2025. “The Local Learning Coefficient: A Singularity-Aware Complexity Measure.” In.

Michaud, Liu, Girit, et al. n.d. “The Quantization Model of Neural Scaling.”

Munn, and Wei. 2025. “A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints.”

Murata, Yoshizawa, and Amari. 1994. “Network Information Criterion-Determining the Number of Hidden Units for an Artificial Neural Network Model.” IEEE Transactions on Neural Networks.

Perez, Kiela, and Cho. 2021. “Rissanen Data Analysis: Examining Dataset Characteristics via Description Length.” In Proceedings of the 38th International Conference on Machine Learning.

Ripoli, and Everitt. 2025. “Improved MCMC with Active Subspaces.”

Saito. 2007. “On Real Log Canonical Thresholds.”

Shoham, Mor-Yosef, and Avron. 2025. “Flatness After All?”

Wang, Lawrence, and Roberts. 2023. “SANE: The Phases of Gradient Descent Through Sharpness Adjusted Number of Effective Parameters.”

Wang, Pengfei, Zhang, Lei, et al. 2023. “Sharpness-Aware Gradient Matching for Domain Generalization.” In.

Watanabe. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research.