Minimum description length

And other takes on learning-as-compression

2020-08-05 — 2020-08-05

Wherein a formalisation of Occam’s razor is presented, a link to Bayesian marginal likelihood for model selection is noted, and the practical uncomputability of MDL in neural‑net settings is observed.

Bayes

compsci

graphical models

how do science

information

machine learning

meta learning

model selection

networks

probability

pseudorandomness

statistics

statmech

stringology

A formalisation of Occam’s razor. I see it invoked in model selection.

In the Bayes context, I think this typically means model selection by optimising marginal likelihood.

MDL seems to be about learning as compression, but making it information-theoretically rigorous.

In practice always seem to be uncomputable.

1 In neural nets

People have tried to make MDL useful as a [ (Geoffrey E. Hinton and Zemel 1993; Geoffrey E. Hinton and van Camp 1993, 1993; Perez, Kiela, and Cho 2021)

2 References

Arora, and Zhang. 2021. “Rip van Winkle’s Razor: A Simple Estimate of Overfit to Test Data.” arXiv:2102.13189 [Cs, Stat].

Barron, Andrew R. 1991. “Complexity Regularization with Application to Artificial Neural Networks.” In Nonparametric Functional Estimation and Related Topics. NATO ASI Series 335.

Barron, A. R., and Cover. 1991. “Minimum Complexity Density Estimation.” IEEE Transactions on Information Theory.

Barron, A., Rissanen, and Yu. 1998. “The Minimum Description Length Principle in Coding and Modeling.” IEEE Transactions on Information Theory.

Collins, Dasgupta, and Schapire. 2001. “A Generalization of Principal Components Analysis to the Exponential Family.” In Advances in Neural Information Processing Systems.

David, Moran, and Yehudayoff. 2016. “On Statistical Learning via the Lens of Compression.” arXiv:1610.03592 [Cs, Math].

Delétang, Ruoss, Duquenne, et al. 2024. “Language Modeling Is Compression.”

Dubova, Chandramouli, Gigerenzer, et al. 2025. “Is Ockham’s Razor Losing Its Edge? New Perspectives on the Principle of Model Parsimony.” Proceedings of the National Academy of Sciences.

Grünwald, Peter. 1996. “A Minimum Description Length Approach to Grammar Inference.” In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing. Lecture Notes in Computer Science.

Grünwald, Peter D. 2004. “A Tutorial Introduction to the Minimum Description Length Principle.” Advances in Minimum Description Length: Theory and Applications.

———. 2007. The Minimum Description Length Principle.

Hansen, and Yu. 2001. “Model Selection and the Principle of Minimum Description Length.” Journal of the American Statistical Association.

Hinton, Geoffrey E., and van Camp. 1993. “Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights.” In Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT ’93.

Hinton, Geoffrey E, and Zemel. 1993. “Autoencoders, Minimum Description Length and Helmholtz Free Energy.” In Advances in Neural Information Processing Systems.

Jun Li, and Dacheng Tao. 2013. “Simple Exponential Family PCA.” IEEE Transactions on Neural Networks and Learning Systems.

Legg. 2006. “Is There an Elegant Universal Theory of Prediction?” In Algorithmic Learning Theory. Lecture Notes in Computer Science 4264.

Mavromatis. 2009. “Minimum Description Length Modelling of Musical Structure.” Journal of Mathematics and Music.

Mohamed, Ghahramani, and Heller. 2008. “Bayesian Exponential Family PCA.” In Advances in Neural Information Processing Systems.

Perez, Kiela, and Cho. 2021. “Rissanen Data Analysis: Examining Dataset Characteristics via Description Length.” In Proceedings of the 38th International Conference on Machine Learning.

Rissanen. 1984. “Universal Coding, Information, Prediction, and Estimation.” IEEE Transactions on Information Theory.

Roy, Gordon, and Thrun. 2005. “Finding Approximate POMDP Solutions Through Belief Compression.” Journal of Artificial Intelligence Research.

Solomonoff. 1964a. “A Formal Theory of Inductive Inference. Part I.” Information and Control.

———. 1964b. “A Formal Theory of Inductive Inference. Part II.” Information and Control.

Sterkenburg. 2016. “Solomonoff Prediction and Occam’s Razor.” Philosophy of Science.

Ullrich. 2020. “A Coding Perspective on Deep Latent Variable Models.”

Urdshals, Lau, Hoogland, et al. 2025. “Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory.”

Vitányi. 2006. “Meaningful Information.” IEEE Transactions on Information Theory.

Wojtowicz, and DeDeo. 2020. “From Probability to Consilience: How Explanatory Values Implement Bayesian Reasoning.” Trends in Cognitive Sciences.