# Informations

Entropies and other measures of surprise

November 25, 2011 — September 4, 2023

classification
compsci
information
metrics
statistics
statmech

TODO: explain this diagram which I ripped off from Wikipedia.

Not: what you hope to get from the newspaper. (Although…) Rather: Different types of (formally defined) entropy/information and their disambiguation. The seductive power of the logarithm, and convex functions rather like it.

A proven path to publication is to find or reinvent a derived measure based on Shannon information, and apply it to something provocative-sounding. (Qualia! Stock markets! Evolution! Language! The qualia of evolving stock market languages!)

This is purely about the analytic definition given random variables, not the estimation theory, which is a different problem

Connected also to functional equations and yes, statistical mechanics, and quantum information physics.

## 1 Shannon Information

“I have are given a discrete random process, how many bits of information do I need to reconstruct it?”

Vanilla information, thanks be to Claude Shannon. A thing related to coding of random processes.

Given a random variable $$X$$ taking values $$x \in \mathcal{X}$$ from some discrete alphabet $$\mathcal{X}$$, with probability mass function $$p(x)$$.

$\begin{array}{ccc} H(X) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \\ & \equiv & E( \log 1/p(x) ) \end{array}$

More generally if we have a measure $$P$$ over some Borel space $H(X)=-\int _{X}\log {\frac {\mathrm {d} P}{\mathrm {d} \mu }}\,dP$

Over at the functional equations page is a link to Tom Leinster’s clever proof of the optimality of Shannon information via functional equations.

One interesting aspect of the proof is where the difficulty lies. Let $$I:\Delta_n \to \mathbb{R}^+$$ be continuous functions satisfying the chain rule; we have to show that $$I$$ is proportional to $$H$$. All the effort and ingenuity goes into showing that $$I$$ is proportional to $$H$$ when restricted to the uniform distributions. In other words, the hard part is to show that there exists a constant $$c$$ such that

$I(1/n, \ldots, 1/n) = c H(1/n, \ldots, 1/n)$

for all $$n \geq 1$$.

Venkatesan Guruswami, Atri Rudra and Madhu Sudan, Essential Coding Theory.

## 2 K-L divergence

Because “Kullback-Leibler divergence” is a lot of syllables for something you use so often. Or you could call it the “relative entropy”, if you want to sound fancy and/or mysterious.

KL divergence is defined between the probability mass functions of two discrete random variables, $$X,Y$$ over the same space, where those probability mass functions are given $$p(x)$$ and $$q(x)$$ respectively.

$\begin{array}{cccc} D(P \parallel Q) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \frac{p(x)}{q(x)} \\ & \equiv & E \log p(x) \frac{p(x)}{q(x)} \end{array}$

More generally, if the random variables have laws, respectively $$P$$ and $$Q$$: ${\displaystyle D_{\operatorname {KL} }(P\|Q)=\int _{\operatorname {supp} P}{\frac {\mathrm {d} P}{\mathrm {d} Q}}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dQ=\int _{\operatorname {supp} P}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dP,}$

## 3 Jenson-Shannon divergence

Symmetrized version. Have never used.

## 4 Mutual information

The “informativeness” of one variable given another… Most simply, the K-L divergence between the product distribution and the joint distribution of two random variables. (That is, it vanishes if the two variables are independent).

Now, take $$X$$ and $$Y$$ with joint probability mass distribution $$p_{XY}(x,y)$$ and, for clarity, marginal distributions $$p_X$$ and $$p_Y$$.

Then the mutual information $$I$$ is given

$I(X; Y) = H(X) - H(X|Y)$

Estimating this one has been giving me grief lately, so I’ll be happy when I get to this section and solve it forever. See nonparametric mutual information.

Getting an intuition of what this measure does is handy, so I’ll expound some equivalent definitions that emphasises different characteristics:

$\begin{array}{cccc} I(X; Y) & := & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p_{XY}(x, y) \log p(x, y) \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \\ & = & D( p_{XY} \parallel p_X p_Y) \\ & = & E \log \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \end{array}$

More usually we want the Conditional Mutual information.

$I(X;Y|Z)=\int _{\mathcal {Z}}D_{\mathrm {KL} }(P_{(X,Y)|Z}\|P_{X|Z}\otimes P_{Y|Z})dP_{Z}$

See Christopher Olah’s excellent visual explanation.

## 5 Kolmogorov-Sinai entropy

Schreiber says:

If $$I$$ is obtained by coarse graining a continuous system $$X$$ at resolution $$\epsilon$$, the entropy $$HX(\epsilon)$$ and entropy rate $$hX(\epsilon)$$ will depend on the partitioning and in general diverge like $$\log(\epsilon)$$ when $$\epsilon \to 0$$. However, for the special case of a deterministic dynamical system, $$\lim_{\epsilon\to 0} hX (\epsilon) = hKS$$ may exist and is then called the Kolmogorov-Sinai entropy. (For non-Markov systems, also the limit $$k \to \infty$$ needs to be taken.)

That is, it is a special case of the entropy rate for a dynamical system - Cue connection to algorithmic complexity. Also metric entropy?

## 6 Relatives

### 6.1 Rényi Information

Also, the Hartley measure.

You don’t need to use a logarithm in your information summation. Free energy, something something. (?)

The observation that many of the attractive features of information measures are simply due to the concavity of the logarithm term in the function. So, why not whack another concave function with even more handy features in there? Bam, you are now working on Rényi information. How do you feel?

### 6.2 Tsallis statistics

Attempting to make information measures “non-extensive”. “q-entropy”. Seems to have made a big splash in Brazil, but less in other countries. Non-extensive measures are an intriguing idea, though. I wonder if it’s parochialism that keeps everyone off Tsallis statistics, or a lack of demonstrated usefulness?

## 7 Estimating information

Wait, you don’t know the exact parameters of your generating process a priori? You need to estimate it from data.

## 8 Connection to statistical mechanics

Informational entropy versus thermodynamics entropy.

Shalizi and Moore (2003):

We consider the question of whether thermodynamic macrostates are objective consequences of dynamics, or subjective reflections of our ignorance of a physical system. We argue that they are both; more specifically, that the set of macrostates forms the unique maximal partition of phase space which 1) is consistent with our observations (a subjective fact about our ability to observe the system) and 2) obeys a Markov process (an objective fact about the system’s dynamics). We review the ideas of computational mechanics, an information-theoretic method for finding optimal causal models of stochastic processes, and argue that macrostates coincide with the “causal states” of computational mechanics. Defining a set of macrostates thus consists of an inductive process where we start with a given set of observables, and then refine our partition of phase space until we reach a set of states which predict their own future, i.e. which are Markovian. Macrostates arrived at in this way are provably optimal statistical predictors of the future values of our observables.

## 10 References

Akaike. 1973. In Proceeding of the Second International Symposium on Information Theory.
Amari. 2001. IEEE Transactions on Information Theory.
Ay, Bertschinger, Der, et al. 2008. The European Physical Journal B - Condensed Matter and Complex Systems.
Baez, Fritz, and Leinster. 2011. Entropy.
Barnett, Barrett, and Seth. 2009. Physical Review Letters.
Barrett, Barnett, and Seth. 2010. Phys. Rev. E.
Bialek, Nemenman, and Tishby. 2001. Physica A: Statistical and Theoretical Physics.
———. 2006. Neural Computation.
Bieniawski, and Wolpert. 2004. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 3.
Calude. 2002. Information and Randomness : An Algorithmic Perspective.
Cappé, Moulines, and Ryden. 2007. Inference in Hidden Markov Models.
Cerf, and Adami. 1997. “Entropic Bell Inequalities.” Physical Review A.
Cerf, and Adami. 1998. “Information Theory of Quantum Entanglement and Measurement.” Physica D: Nonlinear Phenomena.
Chaitin. 1977. “Algorithmic Information Theory.” IBM Journal of Research and Development.
———. 2002. “The Intelligibility of the Universe and the Notions of Simplicity, Complexity and Irreducibility.”
Chiribella, d’Ariano, and Perinotti. 2011. Physical Review A.
Chow, and Liu. 1968. IEEE Transactions on Information Theory.
Cohen. 1962. Behavioral Science.
Cover, Thomas M., Gács, and Gray. 1989. The Annals of Probability.
Cover, Thomas M, and Thomas. 2006. Elements of Information Theory.
Crooks. 2007. Physical Review Letters.
Csiszár, and Shields. 2004. Information Theory and Statistics: A Tutorial. Foundations and Trends in Communications and Information Theory.
Dahlhaus. 1996. Stochastic Processes and Their Applications.
Dewar. 2003. Journal of Physics A: Mathematical and General.
Earman, and Norton. 1998. Studies in History and Philosophy of Modern Physics.
———. 1999. Studies in History and Philosophy of Modern Physics.
Eichler. 2001. Granger-Causality Graphs for Multivariate Time Series.
Elidan, and Friedman. 2005. “Learning Hidden Variable Networks: The Information Bottleneck Approach.” Journal of Machine Learning Research.
Ellerman. 2017. arXiv:1707.04728 [Quant-Ph].
Feldman, and Crutchfield. 2004. Advances in Complex Systems.
Frazier, and Powell. 2010. Decision Analysis.
Friedman, Mosenzon, Slonim, et al. 2001. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.
Friston. 2010. Nature Reviews Neuroscience.
Gács, Tromp, and Vitányi. 2001. IEEE Transactions on Information Theory.
Gao, Ver Steeg, and Galstyan. 2015. In Journal of Machine Learning Research.
Goldfeld, and Polyanskiy. 2020. arXiv:2004.14941 [Cs, Stat].
Granger. 1963. Information and Control.
Grassberger. 1988. Physics Letters A.
Gray. 1991. Entropy and Information Theory.
Hausser, and Strimmer. 2009. “Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks.” Journal of Machine Learning Research.
Haussler, and Opper. 1997. The Annals of Statistics.
Hirata, and Ulanowicz. 1985. Journal of Theoretical Biology.
Ho, Kastner, and Wong. 1978. IEEE Transactions on Automatic Control.
Jaynes. 1963. “Information Theory and Statistical Mechanics.” In Statistical Physics. Brandeis University Summer Institute Lectures in Theoretical Physics.
———. 1965. American Journal of Physics.
Kandasamy, Krishnamurthy, Poczos, et al. 2014. arXiv:1411.4342 [Stat].
Keane, and O’Brien. 1994. ACM Trans. Model. Comput. Simul.
Kelly Jr. 1956. Bell System Technical Journal.
Klir. 1998. Uncertainty and Information: Foundations of Generalized Information Theory.
Kolmogorov. 1968. “Three Approaches to the Quantitative Definition of Information.” International Journal of Computer Mathematics.
Kraskov, Stögbauer, and Grassberger. 2004. Physical Review E.
Krause, and Guestrin. 2009. “Optimal Value of Information in Graphical Models.” J. Artif. Int. Res.
Kullback, and Leibler. 1951. The Annals of Mathematical Statistics.
Langton. 1990. Physica D: Nonlinear Phenomena.
Lecomte, Appert-Rolland, and Wijland. 2007. Journal of Statistical Physics.
Leonenko. 2008. The Annals of Statistics.
Leskovec. 2012. Eprint arXiv:1206.1331.
Liese, and Vajda. 2006. IEEE Transactions on Information Theory.
Lin. 1991. IEEE Transactions on Information Theory.
Lizier, and Prokopenko. 2010. The European Physical Journal B - Condensed Matter and Complex Systems.
Lizier, Prokopenko, and Zomaya. 2008. Physical Review E.
Marton. 2015. arXiv:1507.02803 [Math].
Maynard Smith. 2000. Philosophy of Science.
Moon, and Hero III. 2014. In NIPS 2014.
Nemenman, Shafee, and Bialek. 2001. In arXiv:physics/0108025.
Odum. 1988. “Self-Organization, Transformity, and Information.” Science.
Palomar, and Verdu. 2008. IEEE Transactions on Information Theory.
Paninski. 2003. Neural Computation.
Parry. 1964. Transactions of the American Mathematical Society.
Phat, Thanh, and Trinh. 2014. Complexity.
Pinkerton. 1956. Scientific American.
Plotkin, and Nowak. 2000. Journal of Theoretical Biology.
Raginsky, M. 2011. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).
Raginsky, Maxim, and Sason. 2012. Foundations and Trends in Communications and Information Theory.
Riegler, Bölcskei, and Koliander. 2023.
Rissanen. 2007. Information and Complexity in Statistical Modeling. Information Science and Statistics.
Roulston. 1999. Physica D: Nonlinear Phenomena.
Ryabko, and Ryabko. 2010. IEEE Transactions on Information Theory.
Schreiber. 2000. “Measuring Information Transfer.” Physical Review Letters.
Schürmann. 2015. Neural Computation.
Sethna. 2006. Statistical Mechanics: Entropy, Order Parameters, and Complexity.
Shalizi. n.d.a. “Information Theory.”
———. n.d.b. “Optimal Prediction.”
Shalizi, and Crutchfield. 2002. Advances in Complex Systems.
Shalizi, Haslinger, Rouquier, et al. 2006. Physical Review E.
Shalizi, and Moore. 2003.
Shannon. 1948. “A Mathematical Theory of Communication.” The Bell Syst Tech J.
Shibata. 1997. “Bootstrap Estimate of Kullback-Leibler Information for Model Selection.” Statistica Sinica.
Shields. 1998. IEEE Transactions on Information Theory.
Singh. 1985. Journal of Political Economy.
Slonim, Atwal, Tkačik, et al. 2005. Proceedings of the National Academy of Sciences of the United States of America.
Slonim, Friedman, and Tishby. 2006. Neural Computation.
Smith. 2008a. Journal of Theoretical Biology.
———. 2008b. Journal of Theoretical Biology.
Spence. 2002. American Economic Review.
Strong, Koberle, de Ruyter van Steveninck, et al. 1998. Phys. Rev. Lett.
Studený. 2016. arXiv:1612.06599 [Math, Stat].
Studený, and Vejnarová. 1998. “On Multiinformation Function as a Tool for Measuring Stochastic Dependence.” In Learning in Graphical Models.
Taylor, Tishby, and Bialek. 2007. “Information and Fitness.” Arxiv Preprint arXiv:0712.4382.
Tishby, Pereira, and Bialek. 2000. arXiv:physics/0004057.
Tishby, and Polani. 2011. “Information Theory of Decisions and Actions.” In PERCEPTION-ACTION CYCLE.
Tumer, and Wolpert. 2004. “Coordination in Large Collectives- Chapter 1.” In.
ver Steeg, and Galstyan. 2015. arXiv:1507.02284 [Cs, Math, Stat].
Vereshchagin, and Vitányi. 2004. IEEE Transactions on Information Theory.
———. 2010. IEEE Transactions on Information Theory.
Vitányi. 2006. IEEE Transactions on Information Theory.
Weidmann, and Vetterli. 2012. IEEE Transactions on Information Theory.
Weissman, Kim, and Permuter. 2013. IEEE Transactions on Information Theory.
Wolf, and Wolpert. 1994. arXiv:comp-Gas/9403002.
Wolpert, David H. 2006a. In Complex Engineered Systems. Understanding Complex Systems.
———. 2006b. In The Complex Networks of Economic Interactions. Lecture Notes in Economics and Mathematical Systems 567.
Wolpert, David H, and Lawson. 2002. In.
Wolpert, David H, Wheeler, and Tumer. 1999. In.
———. 2000. EPL (Europhysics Letters).
Wolpert, David H., and Wolf. 1994. arXiv:comp-Gas/9403001.
Xu, and Raginsky. 2017. In Advances In Neural Information Processing Systems.
Zellner. 1988. The American Statistician.
Zhang, and Grabchak. 2014. Neural Computation.