MaxEnt inference

Looks annoyingly like Bayesian inference, but I’m not convinced

2010-12-01 — 2026-01-27

Wherein constraints encoding observations and prior knowledge are imposed on a probability distribution, and the distribution of maximum entropy subject to them is selected; connections to predictive coding and optimal transport are noted.

Bayes
dynamical systems
machine learning
neural nets
optimization
physics
pseudorandomness
sciml
statistics
statmech
stochastic processes

If we think about entropy versus information for long enough, we invent MaxEnt inference.

Figure 1

How about we throw out classical Bayes and do something different? We encode our observations and prior knowledge as constraints on a probability distribution, then select the maximum-entropy distribution that satisfies them.

The rationale is that the maximum-entropy distribution is the ‘least biased’ estimate possible while still satisfying those constraints.

Jaynes and Bretthorst (2003) founded this school of thought. Shore and Johnson (1980) axiomatized it (although I’m told there are significant errors: their conditions don’t suffice to restrict us to the Shannon entropy form).

However, from context cues — the fact people introduced a camelCase acronym — I deduce there’s more going on here. I first looked into it 10+ years ago, but my interest has been piqued again after ignoring it for ages. Bert de Vries claimed to have actioned MaxEnt as a particularly useful phenomenological idea within the predictive coding theory of mind, which inclined me to return to the original MaxEnt work by Caticha. Caticha’s treatment now appears in textbooks (Caticha 2015, 2008) and review articles (Caticha 2021; Caticha2014Informational?).

There are suggestive connections to optimal transport via Lagrange duality.

1 Incoming

Belghazi et al. (2021):

We argue that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. We present a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent. We present a handful of applications on which MINE can be used to minimize or maximize mutual information. We apply MINE to improve adversarially trained generative models. We also use MINE to implement Information Bottleneck, applying it to supervised classification; our results demonstrate substantial improvement in flexibility and performance in these settings.

2 References

Belghazi, Baratin, Rajeswar, et al. 2021. MINE: Mutual Information Neural Estimation.”
Caticha. 2007. Information and Entropy.” In AIP Conference Proceedings.
———. 2008. Lectures on Probability, Entropy, and Statistical Physics.”
———. 2009. Quantifying Rational Belief.” In.
———. 2011. Entropic Inference.” In.
———. 2015. The Basics of Information Geometry.” In.
———. 2021. Entropy, Information, and the Updating of Probabilities.” Entropy.
Caticha, and Giffin. 2006. Updating Probabilities.” In AIP Conference Proceedings.
Dewar. 2003. Information Theory Explanation of the Fluctuation Theorem, Maximum Entropy Production and Self-Organized Criticality in Non-Equilibrium Stationary States.” Journal of Physics A: Mathematical and General.
Gevers, De Marez, Van Nooten, et al. 2025. In Benchmarks We Trust … Or Not? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing.
Gottwald, and Braun. 2020. The Two Kinds of Free Energy and the Bayesian Revolution.” PLoS Computational Biology.
Jaynes. 1990. Probability in Quantum Theory.”
Jaynes, and Bretthorst. 2003. Probability Theory: The Logic of Science.
Salaudeen, Reuel, Ahmed, et al. 2025. Measurement to Meaning: A Validity-Centered Framework for AI Evaluation.”
Shore, and Johnson. 1980. Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy.” IEEE Transactions on Information Theory.
Tseng, and Caticha. 2002. Yet Another Resolution of the Gibbs Paradox: An Information Theory Approach.” In AIP Conference Proceedings.