Entropies and other measures of surprise

November 25, 2011 — September 4, 2023

Figure 1

Not: what you hope to get from the newspaper. (Although…) Rather: Different types of (formally defined) entropy/information and their disambiguation. The seductive power of the logarithm, and convex functions rather like it.

A proven path to publication is to find or reinvent a derived measure based on Shannon information, and apply it to something provocative-sounding. (Qualia! Stock markets! Evolution! Language! The qualia of evolving stock market languages!)

This is purely about the analytic definition given random variables, not the estimation theory, which is a different problem

Connected also to functional equations and yes, statistical mechanics, and quantum information physics.

1 Connection to statistics

Figure 2: TODO: explain this diagram which I ripped off Wikipedia. Informations as measure, Venn diagram

See Shalizi, Note: Information Theory, Axiomatic Foundations, Connections to Statistics.

Many connections, not always clear because the continuous probability spaces of statistics are far from the discrete spaces of coding theory. Nonetheless there are contacts, most glaringly via the K-L divergence.

2 Shannon Information

“I have are given a discrete random process, how many bits of information do I need to reconstruct it?”

Vanilla information, thanks be to Claude Shannon. A thing related to coding of random processes.

Given a random variable \(X\) taking values \(x \in \mathcal{X}\) from some discrete alphabet \(\mathcal{X}\), with probability mass function \(p(x)\).

\[ \begin{array}{ccc} H(X) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \\ & \equiv & E( \log 1/p(x) ) \end{array} \]

More generally if we have a measure \(P\) over some Borel space \[ H(X)=-\int _{X}\log {\frac {\mathrm {d} P}{\mathrm {d} \mu }}\,dP \]

Over at the functional equations page is a link to Tom Leinster’s clever proof of the optimality of Shannon information via functional equations.

One interesting aspect of the proof is where the difficulty lies. Let \(I:\Delta_n \to \mathbb{R}^+\) be continuous functions satisfying the chain rule; we have to show that \(I\) is proportional to \(H\). All the effort and ingenuity goes into showing that \(I\) is proportional to \(H\) when restricted to the uniform distributions. In other words, the hard part is to show that there exists a constant \(c\) such that

\[ I(1/n, \ldots, 1/n) = c H(1/n, \ldots, 1/n) \]

for all \(n \geq 1\).

Venkatesan Guruswami, Atri Rudra and Madhu Sudan, Essential Coding Theory.

3 K-L divergence

Because “Kullback-Leibler divergence” is a lot of syllables for something you use so often. Or you could call it the “relative entropy”, if you want to sound fancy and/or mysterious.

KL divergence is defined between the probability mass functions of two discrete random variables, \(X,Y\) over the same space, where those probability mass functions are given \(p(x)\) and \(q(x)\) respectively.

\[ \begin{array}{cccc} D(P \parallel Q) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \frac{p(x)}{q(x)} \\ & \equiv & E \log p(x) \frac{p(x)}{q(x)} \end{array} \]

More generally, if the random variables have laws, respectively \(P\) and \(Q\): \[ {\displaystyle D_{\operatorname {KL} }(P\|Q)=\int _{\operatorname {supp} P}{\frac {\mathrm {d} P}{\mathrm {d} Q}}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dQ=\int _{\operatorname {supp} P}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dP,} \]

4 Jenson-Shannon divergence

Symmetrized version. Have never used.

Figure 3

5 Mutual information

The “informativeness” of one variable given another… Most simply, the K-L divergence between the product distribution and the joint distribution of two random variables. (That is, it vanishes if the two variables are independent).

Now, take \(X\) and \(Y\) with joint probability mass distribution \(p_{XY}(x,y)\) and, for clarity, marginal distributions \(p_X\) and \(p_Y\).

Then the mutual information \(I\) is given

\[ I(X; Y) = H(X) - H(X|Y) \]

Estimating this one has been giving me grief lately, so I’ll be happy when I get to this section and solve it forever. See nonparametric mutual information.

Getting an intuition of what this measure does is handy, so I’ll expound some equivalent definitions that emphasises different characteristics:

\[ \begin{array}{cccc} I(X; Y) & := & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p_{XY}(x, y) \log p(x, y) \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \\ & = & D( p_{XY} \parallel p_X p_Y) \\ & = & E \log \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \end{array} \]

More usually we want the Conditional Mutual information.

\[I(X;Y|Z)=\int _{\mathcal {Z}}D_{\mathrm {KL} }(P_{(X,Y)|Z}\|P_{X|Z}\otimes P_{Y|Z})dP_{Z}\]

See Christopher Olah’s excellent visual explanation.

6 Kolmogorov-Sinai entropy

Schreiber says:

If \(I\) is obtained by coarse graining a continuous system \(X\) at resolution \(\epsilon\), the entropy \(HX(\epsilon)\) and entropy rate \(hX(\epsilon)\) will depend on the partitioning and in general diverge like \(\log(\epsilon)\) when \(\epsilon \to 0\). However, for the special case of a deterministic dynamical system, \(\lim_{\epsilon\to 0} hX (\epsilon) = hKS\) may exist and is then called the Kolmogorov-Sinai entropy. (For non-Markov systems, also the limit \(k \to \infty\) needs to be taken.)

That is, it is a special case of the entropy rate for a dynamical system — Cue connection to algorithmic complexity. Also metric entropy?

7 Relatives

7.1 Rényi Information

Also, the Hartley measure.

You don’t need to use a logarithm in your information summation. Free energy, something something. (?)

The observation that many of the attractive features of information measures are simply due to the concavity of the logarithm term in the function. So, why not whack another concave function with even more handy features in there? Bam, you are now working on Rényi information. How do you feel?

7.2 Tsallis statistics

Attempting to make information measures “non-extensive”. “q-entropy”. Seems to have made a big splash in Brazil, but less in other countries. Non-extensive measures are an intriguing idea, though. I wonder if it’s parochialism that keeps everyone off Tsallis statistics, or a lack of demonstrated usefulness?

7.3 Fisher information

See maximum likelihood and information criteria.

8 Estimating information

Wait, you don’t know the exact parameters of your generating process a priori? You need to estimate it from data.

9 Connection to statistical mechanics

See Entropy vs information.

10 References

Akaike. 1973. Information Theory and an Extension of the Maximum Likelihood Principle.” In Proceeding of the Second International Symposium on Information Theory.
Amari. 2001. Information Geometry on Hierarchy of Probability Distributions.” IEEE Transactions on Information Theory.
Ay, Bertschinger, Der, et al. 2008. Predictive Information and Explorative Behavior of Autonomous Robots.” The European Physical Journal B - Condensed Matter and Complex Systems.
Bach. 2023. Information Theory With Kernel Methods.” IEEE Transactions on Information Theory.
———. 2024. Sum-of-Squares Relaxations for Information Theory and Variational Inference.” Foundations of Computational Mathematics.
Baez, Fritz, and Leinster. 2011. A Characterization of Entropy in Terms of Information Loss.” Entropy.
Barnett, Barrett, and Seth. 2009. Granger Causality and Transfer Entropy Are Equivalent for Gaussian Variables.” Physical Review Letters.
Barnum, Barrett, Clark, et al. 2010. Entropy and Information Causality in General Probabilistic Theories.” New Journal of Physics.
Barrett, Barnett, and Seth. 2010. Multivariate Granger Causality and Generalized Variance.” Phys. Rev. E.
Bialek, Nemenman, and Tishby. 2001. Complexity Through Nonextensivity.” Physica A: Statistical and Theoretical Physics.
———. 2006. Predictability, Complexity, and Learning.” Neural Computation.
Bieniawski, and Wolpert. 2004. Adaptive, Distributed Control of Constrained Multi-Agent Systems.” In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 3.
Calude. 2002. Information and Randomness : An Algorithmic Perspective.
Cappé, Moulines, and Ryden. 2007. Inference in Hidden Markov Models.
Cerf, and Adami. 1997. “Entropic Bell Inequalities.” Physical Review A.
Cerf, and Adami. 1998. “Information Theory of Quantum Entanglement and Measurement.” Physica D: Nonlinear Phenomena.
Chaitin. 1977. “Algorithmic Information Theory.” IBM Journal of Research and Development.
———. 2002. “The Intelligibility of the Universe and the Notions of Simplicity, Complexity and Irreducibility.”
Cherief-Abdellatif, and Alquier. 2020. MMD-Bayes: Robust Bayesian Estimation via Maximum Mean Discrepancy.” In Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference.
Chiribella, d’Ariano, and Perinotti. 2011. Informational Derivation of Quantum Theory.” Physical Review A.
Chow, and Liu. 1968. Approximating Discrete Probability Distributions with Dependence Trees.” IEEE Transactions on Information Theory.
Cohen. 1962. Information Theory and Music.” Behavioral Science.
Cover, Thomas M., Gács, and Gray. 1989. Kolmogorov’s Contributions to Information Theory and Algorithmic Complexity.” The Annals of Probability.
Cover, Thomas M, and Thomas. 2006. Elements of Information Theory.
Crooks. 2007. Measuring Thermodynamic Length.” Physical Review Letters.
Csiszár, and Shields. 2004. Information Theory and Statistics: A Tutorial. Foundations and Trends in Communications and Information Theory.
Dahlhaus. 1996. On the Kullback-Leibler Information Divergence of Locally Stationary Processes.” Stochastic Processes and Their Applications.
Dewar. 2003. Information Theory Explanation of the Fluctuation Theorem, Maximum Entropy Production and Self-Organized Criticality in Non-Equilibrium Stationary States.” Journal of Physics A: Mathematical and General.
Earman, and Norton. 1998. Exorcist XIV: The Wrath of Maxwell’s Demon. Part I. From Maxwell to Szilard.” Studies in History and Philosophy of Modern Physics.
———. 1999. Exorcist XIV: The Wrath of Maxwell’s Demon. Part II. From Szilard to Landauer and Beyond.” Studies in History and Philosophy of Modern Physics.
Eichler. 2001. Granger-Causality Graphs for Multivariate Time Series.” Granger-Causality Graphs for Multivariate Time Series.
Elidan, and Friedman. 2005. “Learning Hidden Variable Networks: The Information Bottleneck Approach.” Journal of Machine Learning Research.
Ellerman. 2017. Logical Information Theory: New Foundations for Information Theory.” arXiv:1707.04728 [Quant-Ph].
Feldman. 2002. A Brief Introduction to: Information Theory, Excess Entropy and Computational Mechanics.”
Feldman, and Crutchfield. 2004. Synchronizing to Periodicity: The Transient Information and Synchronization Time of Periodic Sequences.” Advances in Complex Systems.
Frazier, and Powell. 2010. Paradoxes in Learning and the Marginal Value of Information.” Decision Analysis.
Friedman, Mosenzon, Slonim, et al. 2001. Multivariate Information Bottleneck.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.
Friston. 2010. The Free-Energy Principle: A Unified Brain Theory? Nature Reviews Neuroscience.
Gács, Tromp, and Vitányi. 2001. Algorithmic Statistics.” IEEE Transactions on Information Theory.
Gao, Ver Steeg, and Galstyan. 2015. Efficient Estimation of Mutual Information for Strongly Dependent Variables.” In Journal of Machine Learning Research.
Goldfeld, and Polyanskiy. 2020. The Information Bottleneck Problem and Its Applications in Machine Learning.” arXiv:2004.14941 [Cs, Stat].
Granger. 1963. Economic Processes Involving Feedback.” Information and Control.
Grassberger. 1988. Finite Sample Corrections to Entropy and Dimension Estimates.” Physics Letters A.
Gray. 1991. Entropy and Information Theory.
Hausser, and Strimmer. 2009. “Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks.” Journal of Machine Learning Research.
Haussler, and Opper. 1997. Mutual Information, Metric Entropy and Cumulative Relative Entropy Risk.” The Annals of Statistics.
Hirata, and Ulanowicz. 1985. Information Theoretical Analysis of the Aggregation and Hierarchical Structure of Ecological Networks.” Journal of Theoretical Biology.
Ho, Kastner, and Wong. 1978. Teams, Signaling, and Information Theory.” IEEE Transactions on Automatic Control.
Jaynes. 1963. “Information Theory and Statistical Mechanics.” In Statistical Physics. Brandeis University Summer Institute Lectures in Theoretical Physics.
———. 1965. Gibbs Vs Boltzmann Entropies.” American Journal of Physics.
Kandasamy, Krishnamurthy, Poczos, et al. 2014. Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations.” arXiv:1411.4342 [Stat].
Keane, and O’Brien. 1994. A Bernoulli Factory.” ACM Trans. Model. Comput. Simul.
Kelly Jr. 1956. A New Interpretation of Information Rate.” Bell System Technical Journal.
Klir. 1998. Uncertainty and Information: Foundations of Generalized Information Theory.
Kolmogorov. 1968. “Three Approaches to the Quantitative Definition of Information.” International Journal of Computer Mathematics.
Kraskov, Stögbauer, and Grassberger. 2004. Estimating Mutual Information.” Physical Review E.
Krause, and Guestrin. 2009. “Optimal Value of Information in Graphical Models.” J. Artif. Int. Res.
Kullback, and Leibler. 1951. On Information and Sufficiency.” The Annals of Mathematical Statistics.
Langton. 1990. Computation at the Edge of Chaos: Phase Transitions and Emergent Computation.” Physica D: Nonlinear Phenomena.
Lecomte, Appert-Rolland, and Wijland. 2007. Thermodynamic Formalism for Systems with Markov Dynamics.” Journal of Statistical Physics.
Leonenko. 2008. A Class of Rényi Information Estimators for Multidimensional Densities.” The Annals of Statistics.
Leskovec. 2012. Information Diffusion and External Influence in Networks.” Eprint arXiv:1206.1331.
Liese, and Vajda. 2006. On Divergences and Informations in Statistics and Information Theory.” IEEE Transactions on Information Theory.
Lin. 1991. Divergence Measures Based on the Shannon Entropy.” IEEE Transactions on Information Theory.
Lizier, and Prokopenko. 2010. Differentiating Information Transfer and Causal Effect.” The European Physical Journal B - Condensed Matter and Complex Systems.
Lizier, Prokopenko, and Zomaya. 2008. Local Information Transfer as a Spatiotemporal Filter for Complex Systems.” Physical Review E.
Marton. 2015. Logarithmic Sobolev Inequalities in Discrete Product Spaces: A Proof by a Transportation Cost Distance.” arXiv:1507.02803 [Math].
Maynard Smith. 2000. The Concept of Information in Biology.” Philosophy of Science.
Moon, and Hero III. 2014. Multivariate f-Divergence Estimation With Confidence.” In NIPS 2014.
Nemenman, Shafee, and Bialek. 2001. Entropy and Inference, Revisited.” In arXiv:physics/0108025.
Nielsen. 2021. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius.” Entropy.
Odum. 1988. “Self-Organization, Transformity, and Information.” Science.
Palomar, and Verdu. 2008. Lautum Information.” IEEE Transactions on Information Theory.
Paninski. 2003. Estimation of Entropy and Mutual Information.” Neural Computation.
Parry. 1964. Intrinsic Markov Chains.” Transactions of the American Mathematical Society.
Pawlowski, Paterek, Kaszlikowski, et al. 2009. Information Causality as a Physical Principle.” Nature.
Phat, Thanh, and Trinh. 2014. Full-Order Observer Design for Nonlinear Complex Large-Scale Systems with Unknown Time-Varying Delayed Interactions.” Complexity.
Pinkerton. 1956. Information Theory and Melody.” Scientific American.
Plotkin, and Nowak. 2000. Language Evolution and Information Theory.” Journal of Theoretical Biology.
Raginsky, M. 2011. Directed Information and Pearl’s Causal Calculus.” In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).
Raginsky, Maxim, and Sason. 2012. Concentration of Measure Inequalities in Information Theory, Communications and Coding.” Foundations and Trends in Communications and Information Theory.
Riegler, Bölcskei, and Koliander. 2023. Lossy Compression of General Random Variables.”
Rissanen. 2007. Information and Complexity in Statistical Modeling. Information Science and Statistics.
Roulston. 1999. Estimating the Errors on Measured Entropy and Mutual Information.” Physica D: Nonlinear Phenomena.
Ryabko, and Ryabko. 2010. Nonparametric Statistical Inference for Ergodic Processes.” IEEE Transactions on Information Theory.
Schreiber. 2000. “Measuring Information Transfer.” Physical Review Letters.
Schürmann. 2015. A Note on Entropy Estimation.” Neural Computation.
Sethna. 2006. Statistical Mechanics: Entropy, Order Parameters, and Complexity.
Shalizi. n.d.a. “Information Theory.”
———. n.d.b. “Optimal Prediction.”
Shalizi, and Crutchfield. 2002. Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.” Advances in Complex Systems.
Shalizi, Haslinger, Rouquier, et al. 2006. Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems.” Physical Review E.
Shalizi, and Moore. 2003. What Is a Macrostate? Subjective Observations and Objective Dynamics.”
Shannon. 1948. “A Mathematical Theory of Communication.” The Bell Syst Tech J.
Shibata. 1997. “Bootstrap Estimate of Kullback-Leibler Information for Model Selection.” Statistica Sinica.
Shields. 1998. The Interactions Between Ergodic Theory and Information Theory.” IEEE Transactions on Information Theory.
Singh. 1985. Monitoring and Hierarchies: The Marginal Value of Information in a Principal-Agent Model.” Journal of Political Economy.
Slonim, Atwal, Tkačik, et al. 2005. Information-Based Clustering.” Proceedings of the National Academy of Sciences of the United States of America.
Slonim, Friedman, and Tishby. 2006. Multivariate Information Bottleneck.” Neural Computation.
Smith. 2008a. Thermodynamics of Natural Selection I: Energy Flow and the Limits on Organization.” Journal of Theoretical Biology.
———. 2008b. Thermodynamics of Natural Selection II: Chemical Carnot Cycles.” Journal of Theoretical Biology.
Spence. 2002. Signaling in Retrospect and the Informational Structure of Markets.” American Economic Review.
Strong, Koberle, de Ruyter van Steveninck, et al. 1998. Entropy and Information in Neural Spike Trains.” Phys. Rev. Lett.
Studený. 2016. Basic Facts Concerning Supermodular Functions.” arXiv:1612.06599 [Math, Stat].
Studený, and Vejnarová. 1998. “On Multiinformation Function as a Tool for Measuring Stochastic Dependence.” In Learning in Graphical Models.
Taylor, Tishby, and Bialek. 2007. “Information and Fitness.” Arxiv Preprint arXiv:0712.4382.
Tishby, Pereira, and Bialek. 2000. The Information Bottleneck Method.” arXiv:physics/0004057.
Tishby, and Polani. 2011. “Information Theory of Decisions and Actions.” In PERCEPTION-ACTION CYCLE.
Tumer, and Wolpert. 2004. “Coordination in Large Collectives- Chapter 1.” In.
ver Steeg, and Galstyan. 2015. The Information Sieve.” arXiv:1507.02284 [Cs, Math, Stat].
Vereshchagin, and Vitányi. 2004. Kolmogorov’s Structure Functions and Model Selection.” IEEE Transactions on Information Theory.
———. 2010. Rate Distortion and Denoising of Individual Data Using Kolmogorov Complexity.” IEEE Transactions on Information Theory.
Vitányi. 2006. Meaningful Information.” IEEE Transactions on Information Theory.
Weidmann, and Vetterli. 2012. Rate Distortion Behavior of Sparse Sources.” IEEE Transactions on Information Theory.
Weissman, Kim, and Permuter. 2013. Directed Information, Causal Estimation, and Communication in Continuous Time.” IEEE Transactions on Information Theory.
Wolf, and Wolpert. 1994. Estimating Functions of Distributions from A Finite Set of Samples, Part 2: Bayes Estimators for Mutual Information, Chi-Squared, Covariance and Other Statistics.” arXiv:comp-Gas/9403002.
Wolpert, David H. 2006a. Information Theory — The Bridge Connecting Bounded Rational Game Theory and Statistical Physics.” In Complex Engineered Systems. Understanding Complex Systems.
———. 2006b. What Information Theory Says about Bounded Rational Best Response.” In The Complex Networks of Economic Interactions. Lecture Notes in Economics and Mathematical Systems 567.
Wolpert, David H, and Lawson. 2002. Designing Agent Collectives for Systems with Markovian Dynamics.” In.
Wolpert, David H, Wheeler, and Tumer. 1999. General Principles of Learning-Based Multi-Agent Systems.” In.
———. 2000. Collective Intelligence for Control of Distributed Dynamical Systems.” EPL (Europhysics Letters).
Wolpert, David H., and Wolf. 1994. Estimating Functions of Probability Distributions from a Finite Set of Samples, Part 1: Bayes Estimators and the Shannon Entropy.” arXiv:comp-Gas/9403001.
Xu, and Raginsky. 2017. Information-Theoretic Analysis of Generalization Capability of Learning Algorithms.” In Advances In Neural Information Processing Systems.
Zellner. 1988. Optimal Information Processing and Bayes’s Theorem.” The American Statistician.
Zhang, and Grabchak. 2014. Nonparametric Estimation of Küllback-Leibler Divergence.” Neural Computation.