Degrees of freedom in NNs

Information criteria at scale

2025-06-25 — 2025-08-19

estimator distribution
information
likelihood
model selection
statistics
Figure 1

In classical statistics there are families of model complexity estimates, which are loosely collectively referred to as “Degrees of freedom” of a model. Neither computationally not practically do they scale up to overparmaterized NNs, and there are other tools.

Exception: Shoham, Mor-Yosef, and Avron (2025) argues for a connection to the Takeuchi Information Criterion.

These end up being popular in developmental interpretability.

1 Learning coefficient

The major output of singular learning theory AFAICT is one particular estimate of model effective dimensionality.

2 Minimum description length

MDL seems to be an interesting way to think about NNs (Geoffrey E. Hinton and Zemel 1993; Geoffrey E. Hinton and van Camp 1993, 1993; Perez, Kiela, and Cho 2021).

3 SANE

Sharpness Adjusted Number of Effective parameters (L. Wang and Roberts 2023).

Spruiked by Stephen Roberts at Instability is All You Need: The Surprising Dynamics of Learning in Deep Models

4 References

Hinton, Geoffrey E., and van Camp. 1993. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights.” In Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT ’93.
Hinton, Geoffrey E, and Zemel. 1993. Autoencoders, Minimum Description Length and Helmholtz Free Energy.” In Advances in Neural Information Processing Systems.
Hitchcock, and Hoogland. 2025. “From Global to Local: A Scalable Benchmark for Local Posterior Sampling.”
Lau, Furman, Wang, et al. 2024. The Local Learning Coefficient: A Singularity-Aware Complexity Measure.”
Murata, Yoshizawa, and Amari. 1994. Network Information Criterion-Determining the Number of Hidden Units for an Artificial Neural Network Model.” IEEE Transactions on Neural Networks.
Perez, Kiela, and Cho. 2021. Rissanen Data Analysis: Examining Dataset Characteristics via Description Length.” In Proceedings of the 38th International Conference on Machine Learning.
Shoham, Mor-Yosef, and Avron. 2025. Flatness After All?
Wang, Lawrence, and Roberts. 2023. SANE: The Phases of Gradient Descent Through Sharpness Adjusted Number of Effective Parameters.”
Wang, Pengfei, Zhang, Lei, et al. 2023. Sharpness-Aware Gradient Matching for Domain Generalization.” In.