MaxEnt inference
Looks annoyingly like Bayesian inference, but I’m not convinced
2010-12-01 — 2026-01-27
Wherein constraints encoding observations and prior knowledge are imposed on a probability distribution, and the distribution of maximum entropy subject to them is selected; connections to predictive coding and optimal transport are noted.
If we think about entropy versus information for long enough, we invent MaxEnt inference.
How about we throw out classical Bayes and do something different? We encode our observations and prior knowledge as constraints on a probability distribution, then select the maximum-entropy distribution that satisfies them.
The rationale is that the maximum-entropy distribution is the ‘least biased’ estimate possible while still satisfying those constraints.
Jaynes and Bretthorst (2003) founded this school of thought. Shore and Johnson (1980) axiomatized it (although I’m told there are significant errors: their conditions don’t suffice to restrict us to the Shannon entropy form).
However, from context cues — the fact people introduced a camelCase acronym — I deduce there’s more going on here. I first looked into it 10+ years ago, but my interest has been piqued again after ignoring it for ages. Bert de Vries claimed to have actioned MaxEnt as a particularly useful phenomenological idea within the predictive coding theory of mind, which inclined me to return to the original MaxEnt work by Caticha. Caticha’s treatment now appears in textbooks (Caticha 2015, 2008) and review articles (Caticha 2021; Caticha2014Informational?).
There are suggestive connections to optimal transport via Lagrange duality.
1 Incoming
Belghazi et al. (2021):
We argue that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. We present a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent. We present a handful of applications on which MINE can be used to minimize or maximize mutual information. We apply MINE to improve adversarially trained generative models. We also use MINE to implement Information Bottleneck, applying it to supervised classification; our results demonstrate substantial improvement in flexibility and performance in these settings.
