# Informations

## Entropies and other measures of surprise

TODO: explain this diagram which I ripped of Wikipedia.

Informations as measure, Venn diagram

Not: what you hope to get from the newspaper. (Although…) Rather: Different types of (formally defined) entropy/information and their disambiguation. The seductive power of the logarithm and convex functions rather like it.

A proven path to publication is to find or reinvent a derived measure based on Shannon information, and apply it to something provocative-sounding. (Qualia! Stock markets! Evolution! Language! The qualia of evolving stock market languages!)

This is purely about the analytic definition given random variables. If you wish to estimate such a quantity empirically, from your experiment, that’s a different problem.

Connected also to functional equations and yes, statistical mechanics, and quantum information physics.

## Shannon Information

Vanilla information, thanks be to Claude Shannon. You have are given a discrete random process of specified parameters. How much can you compress it down to a more parsimonious process? (leaving coding theory aside for the moment.)

Given a random variable $$X$$ taking values $$x \in \mathcal{X}$$ from some discrete alphabet $$\mathcal{X}$$, with probability mass function $$p(x)$$.

$\begin{array}{ccc} H(X) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \\ & \equiv & E( \log 1/p(x) ) \end{array}$

More generally if we have a measure $$P$$ over some Borel space

$H(X)=-\int _{X}\log {\frac {\mathrm {d} P}{\mathrm {d} \mu }}\,dP$

Over at the Functional equations page I note that Tom Leinster has a clever proof of the optimality of Shannon information via functional equations.

One interesting aspect of the proof is where the difficulty lies. Let $$I:\Delta_n \to \mathbb{R}^+$$ be continuous functions satisfying the chain rule; we have to show that $$I$$ is proportional to $$H$$. All the effort and ingenuity goes into showing that $$I$$ is proportional to $$H$$ when restricted to the uniform distributions. In other words, the hard part is to show that there exists a constant $$c$$ such that

$I(1/n, \ldots, 1/n) = c H(1/n, \ldots, 1/n)$

for all $$n \geq 1$$.

Venkatesan Guruswami, Atri Rudra and Madhu Sudan, Essential Coding Theory.

## K-L divergence

Because “Kullback-Leibler divergence” is a lot of syllables for something you use so often, even if usually in sentences like “unlike the K-L divergences”. Or you could call it the “relative entropy”, but that sounds like something to do with my uncle after the seventh round of Christmas drinks.

It is defined between the probability mass functions of two discrete random variables, $$X,Y$$ over the same space, where those probability mass functions are given $$p(x)$$ and $$q(x)$$ respectively.

$\begin{array}{cccc} D(P \parallel Q) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \frac{p(x)}{q(x)} \\ & \equiv & E \log p(x) \frac{p(x)}{q(x)} \end{array}$

More generally, if the random variables have laws, respectively $$P$$ and $$Q$$:

${\displaystyle D_{\operatorname {KL} }(P\|Q)=\int _{\operatorname {supp} P}{\frac {\mathrm {d} P}{\mathrm {d} Q}}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dQ=\int _{\operatorname {supp} P}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dP,}$

## Mutual information

The “informativeness” of one variable given another… Most simply, the K-L divergence between the product distribution and the joint distribution of two random variables. (That is, it vanishes if the two variables are independent).

Now, take $$X$$ and $$Y$$ with joint probability mass distribution $$p_{XY}(x,y)$$ and, for clarity, marginal distributions $$p_X$$ and $$p_Y$$.

Then the mutual information $$I$$ is given

$I(X; Y) = H(X) - H(X|Y)$

Estimating this one has been giving me grief lately, so I’ll be happy when I get to this section and solve it forever. See nonparametric mutual information.

Getting an intuition of what this measure does is handy, so I’ll expound some equivalent definitions that emphasises different characteristics:

$\begin{array}{cccc} I(X; Y) & := & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p_{XY}(x, y) \log p(x, y) \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \\ & = & D( p_{XY} \parallel p_X p_Y) \\ & = & E \log \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \end{array}$

More usually we want the Conditional Mutual information.

$I(X;Y|Z)=\int _{\mathcal {Z}}D_{\mathrm {KL} }(P_{(X,Y)|Z}\|P_{X|Z}\otimes P_{Y|Z})dP_{Z}$

Here is Chrosopher Olah’s excellent visual explanation of it

## Kolmogorov-Sinai entropy

Schreiber says:

If $$I$$ is obtained by coarse graining a continuous system $$X$$ at resolution $$\epsilon$$, the entropy $$HX(\epsilon)$$ and entropy rate $$hX(\epsilon)$$ will depend on the partitioning and in general diverge like $$\log(\epsilon)$$ when $$\epsilon \to 0$$. However, for the special case of a deterministic dynamical system, $$\lim_{\epsilon\to 0} hX (\epsilon) = hKS$$ may exist and is then called the Kolmogorov-Sinai entropy. (For non-Markov systems, also the limit $$k \to \infty$$ needs to be taken.)

That is, it is a special case of the entropy rate for a dynamical system - Cue connection to algorithmic complexity. Also metric entropy?

## Relatives

### Rényi Information

Also, the Hartley measure.

You don’t need to use a logarithm in your information summation. Free energy, something something. (?)

The observation that many of the attractive features of information measures are simply due to the concavity of the logarithm term in the function. So, why not whack another concave function with even more handy features in there? Bam, you are now working on Rényi information. How do you feel?

### Tsallis statistics

Attempting to make information measures “non-extensive”. “q-entropy”. Seems to have made a big splash in Brazil, but less in other countries. Non-extensive measures are an intriguing idea, though. I wonder if it’s parochialism that keeps everyone off Tsallis statistics, or a lack of demonstrated usefulness?

## Estimating information

Wait, you don’t know the exact parameters of your generating process a priori? You need to estimate it from data.

## References

Akaike, Hirotogu. 1973. In Proceeding of the Second International Symposium on Information Theory, edited by Petrovand F Caski, 199–213. Budapest: Akademiai Kiado.
Amari, Shunʼichi. 2001. IEEE Transactions on Information Theory 47: 1701–11.
Ay, N., N. Bertschinger, R. Der, F. Güttler, and E. Olbrich. 2008. The European Physical Journal B - Condensed Matter and Complex Systems 63 (3): 329–39.
Baez, John C., Tobias Fritz, and Tom Leinster. 2011. Entropy 13 (11): 1945–57.
Barnett, Lionel, Adam B. Barrett, and Anil K. Seth. 2009. Physical Review Letters 103 (23): 238701.
Barrett, Adam B, Lionel Barnett, and Anil K Seth. 2010. Phys. Rev. E 81 (4): 041907.
Bialek, William, Ilya Nemenman, and Naftali Tishby. 2001. Physica A: Statistical and Theoretical Physics 302 (1-4): 89–99.
———. 2006. Neural Computation 13 (11): 2409–63.
Bieniawski, Stefan, and David H. Wolpert. 2004. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 3, 4:1230–31. IEEE Computer Society.
Calude, Cristian S. 2002. Information and Randomness : An Algorithmic Perspective. Springer.
Cappé, Olivier, Eric Moulines, and Tobias Ryden. 2007. Inference in Hidden Markov Models. 1st ed. 2005. Corr. 2nd printing 2007 edition. New York ; London: Springer.
Cerf, Nicolas J., and Chris Adami. 1998. “Information Theory of Quantum Entanglement and Measurement.” Physica D: Nonlinear Phenomena 120 (1-2): 62–81.
Cerf, Nicolas J., and Christoph Adami. 1997. “Entropic Bell Inequalities.” Physical Review A 55 (5): 3371.
Chaitin, Gregory J. 1977. “Algorithmic Information Theory.” IBM Journal of Research and Development.
———. 2002. “The Intelligibility of the Universe and the Notions of Simplicity, Complexity and Irreducibility.”
Chiribella, G, G M D’Ariano, and P Perinotti. 2011. Physical Review A 84 (1): 012311.
Chow, C K, and C N Liu. 1968. IEEE Transactions on Information Theory 14: 462–67.
Cohen, Joel E. 1962. Behavioral Science 7 (2): 137–63.
Cover, Thomas M., Péter Gács, and Robert M. Gray. 1989. The Annals of Probability 17 (3): 840–65.
Cover, Thomas M, and Joy A Thomas. 2006. Elements of Information Theory. Wiley-Interscience.
Crooks, Gavin E. 2007. Physical Review Letters 99 (10): 100602.
Csiszár, Imre, and Paul C Shields. 2004. Information Theory and Statistics: A Tutorial. Vol. 1. Foundations and Trends in Communications and Information Theory.
Dahlhaus, R. 1996. Stochastic Processes and Their Applications 62 (1): 139–68.
Dewar, Roderick C. 2003. Journal of Physics A: Mathematical and General 36: 631–41.
Earman, John, and John D Norton. 1998. Studies in History and Philosophy of Modern Physics 29 (4): 435–71.
———. 1999. Studies in History and Philosophy of Modern Physics 30 (1): 1–40.
Eichler, Michael. 2001. Granger-Causality Graphs for Multivariate Time Series.
Elidan, Gal, and Nir Friedman. 2005. “Learning Hidden Variable Networks: The Information Bottleneck Approach.” Journal of Machine Learning Research 6: 81–127.
Ellerman, David. 2017. arXiv:1707.04728 [Quant-Ph], May.
Feldman, David P, and James P Crutchfield. 2004. Advances in Complex Systems 7 (03): 329–55.
Frazier, Peter I, and Warren B Powell. 2010. Decision Analysis 7 (4): 378–403.
Friedman, Nir, Ori Mosenzon, Noam Slonim, and Naftali Tishby. 2001. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 152–61. UAI’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Friston, Karl. 2010. Nature Reviews Neuroscience 11 (2): 127.
Gács, Péter, JohnT. Tromp, and Paul M.B. Vitányi. 2001. IEEE Transactions on Information Theory 47 (6): 2443–63.
Gao, Shuyang, Greg Ver Steeg, and Aram Galstyan. 2015. In Journal of Machine Learning Research, 277–86.
Goldfeld, Ziv, and Yury Polyanskiy. 2020. arXiv:2004.14941 [Cs, Stat], May.
Granger, Clive W J. 1963. Information and Control 6 (1): 28–48.
Grassberger, Peter. 1988. Physics Letters A 128 (6–7): 369–73.
Gray, Robert M. 1991. Entropy and Information Theory. New York: Springer-Verlag.
Hausser, Jean, and Korbinian Strimmer. 2009. “Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks.” Journal of Machine Learning Research 10: 1469.
Haussler, David, and Manfred Opper. 1997. The Annals of Statistics 25 (6): 2451–92.
Hirata, Hironori, and Robert E Ulanowicz. 1985. Journal of Theoretical Biology 116 (3): 321–41.
Jaynes, Edwin Thompson. 1963. “Information Theory and Statistical Mechanics.” In Statistical Physics. Vol. 3. Brandeis University Summer Institute Lectures in Theoretical Physics.
———. 1965. American Journal of Physics 33: 391–98.
Kandasamy, Kirthevasan, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and James M. Robins. 2014. arXiv:1411.4342 [Stat], November.
Keane, M. S., and George L. O’Brien. 1994. ACM Trans. Model. Comput. Simul. 4 (2): 213–19.
Kelly Jr, J L. 1956. “A New Interpretation of Information Rate.” Bell System Technical Journal 35 (3): 917–26.
Klir, George J. 1998. Uncertainty and Information: Foundations of Generalized Information Theory. Wiley-IEEE Press.
Kolmogorov, A N. 1968. “Three Approaches to the Quantitative Definition of Information.” International Journal of Computer Mathematics 2 (1): 157–68.
Kraskov, Alexander, Harald Stögbauer, and Peter Grassberger. 2004. Physical Review E 69: 066138.
Krause, Andreas, and Carlos Guestrin. 2009. “Optimal Value of Information in Graphical Models.” J. Artif. Int. Res. 35 (1): 557–91.
Kullback, S, and R A Leibler. 1951. The Annals of Mathematical Statistics 22 (1): 79–86.
Langton, Chris G. 1990. Physica D: Nonlinear Phenomena 42 (1–3): 12–37.
Lecomte, V., C. Appert-Rolland, and F. van Wijland. 2007. Journal of Statistical Physics 127 (1): 51–106.
Leonenko, Nikolai. 2008. The Annals of Statistics 36 (5): 2153–82.
Leskovec, Jure. 2012. Eprint arXiv:1206.1331, June.
Liese, F, and I Vajda. 2006. IEEE Transactions on Information Theory 52 (10): 4394–4412.
Lin, Jianhua. 1991. IEEE Transactions on Information Theory 37 (1): 145–51.
Lizier, Joseph T, and Mikhail Prokopenko. 2010. The European Physical Journal B - Condensed Matter and Complex Systems 73 (4): 605–15.
Lizier, Joseph T, Mikhail Prokopenko, and Albert Y Zomaya. 2008. Physical Review E 77: 026110.
Marton, Katalin. 2015. arXiv:1507.02803 [Math], July.
Maynard Smith, John. 2000. Philosophy of Science 67 (2): 177–94.
Moon, Kevin R., and Alfred O. Hero III. 2014. In NIPS 2014.
Nemenman, Ilya, Fariel Shafee, and William Bialek. 2001. In arXiv:physics/0108025.
Odum, Howard T. 1988. “Self-Organization, Transformity, and Information.” Science, 1988.
Palomar, D.P., and S. Verdu. 2008. IEEE Transactions on Information Theory 54 (3): 964–75.
Paninski, Liam. 2003. Neural Computation 15 (6): 1191–1253.
Parry, William. 1964. Transactions of the American Mathematical Society 112 (1): 55–66.
Phat, Vu N., Nguyen T. Thanh, and Hieu Trinh. 2014. Complexity, August, n/a–.
Pinkerton, Richard C. 1956. Scientific American 194 (2): 77–86.
Plotkin, Joshua B, and Martin A Nowak. 2000. Journal of Theoretical Biology 205: 147–59.
Raginsky, M. 2011. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 958–65.
Raginsky, Maxim, and Igal Sason. 2012. Foundations and Trends in Communications and Information Theory, December.
Rissanen, Jorma. 2007. Information and Complexity in Statistical Modeling. Information Science and Statistics. New York: Springer.
Roulston, Mark S. 1999. Physica D: Nonlinear Phenomena 125 (3-4): 285–94.
Ryabko, Daniil, and Boris Ryabko. 2010. IEEE Transactions on Information Theory 56 (3): 1430–35.
Schreiber, Thomas. 2000. “Measuring Information Transfer.” Physical Review Letters 85 (2): 461–64.
Schürmann, Thomas. 2015. Neural Computation 27 (10): 2097–2106.
Sethna, James P. 2006. Statistical Mechanics: Entropy, Order Parameters, and Complexity. Oxford University Press, USA.
Shalizi, Cosma Rohilla. n.d.a. “Information Theory.”
———. n.d.b. “Optimal Prediction.”
Shalizi, Cosma Rohilla, and James P. Crutchfield. 2002. Advances in Complex Systems 05 (01): 91–95.
Shalizi, Cosma Rohilla, Robert Haslinger, Jean-Baptiste Rouquier, Kristina L Klinkner, and Cristopher Moore. 2006. Physical Review E 73 (3).
Shalizi, Cosma Rohilla, and Cristopher Moore. 2003. Eprint arXiv:cond-Mat/0303625.
Shannon, Claude E. 1948. “A Mathematical Theory of Communication.” The Bell Syst Tech J 27: 379–423.
Shibata, Ritei. 1997. “Bootstrap Estimate of Kullback-Leibler Information for Model Selection.” Statistica Sinica 7: 375–94.
Shields, P C. 1998. IEEE Transactions on Information Theory 44 (6): 2079–93.
Singh, Nirvikar. 1985. Journal of Political Economy 93 (3): 599–609.
Slonim, Noam, Gurinder S Atwal, Gašper Tkačik, and William Bialek. 2005. Proceedings of the National Academy of Sciences of the United States of America 102: 18297–302.
Slonim, Noam, Nir Friedman, and Naftali Tishby. 2006. Neural Computation 18 (8): 1739–89.
Smith, D Eric. 2008a. Journal of Theoretical Biology 252: 185–97.
———. 2008b. Journal of Theoretical Biology 252: 198–212.
Spence, Michael. 2002. American Economic Review 92: 434–59.
Steeg, Greg Ver, and Aram Galstyan. 2015. arXiv:1507.02284 [Cs, Math, Stat], July.
Strong, Steven P, Roland Koberle, Rob R de Ruyter van Steveninck, and William Bialek. 1998. Phys. Rev. Lett. 80 (1): 197–200.
Studený, Milan. 2016. arXiv:1612.06599 [Math, Stat], December.
Studený, Milan, and Jiřina Vejnarová. 1998. “On Multiinformation Function as a Tool for Measuring Stochastic Dependence.” In Learning in Graphical Models, 261–97. Cambridge, Mass.: MIT Press.
Taylor, Samuel F, Naftali Tishby, and William Bialek. 2007. “Information and Fitness.” Arxiv Preprint arXiv:0712.4382.
Tishby, Naftali, Fernando C Pereira, and William Bialek. 2000. arXiv:physics/0004057, April.
Tishby, Naftali, and Daniel Polani. 2011. “Information Theory of Decisions and Actions.” In PERCEPTION-ACTION CYCLE, 601–36. Springer.
Tumer, Kagan, and David H Wolpert. 2004. “Coordination in Large Collectives- Chapter 1.” In.
Vereshchagin, N.K., and Paul MB Vitányi. 2004. IEEE Transactions on Information Theory 50 (12): 3265–90.
———. 2010. IEEE Transactions on Information Theory 56 (7): 3438–54.
Vitányi, Paul M. 2006. IEEE Transactions on Information Theory 52 (10): 4617–26.
Weidmann, Claudio, and Martin Vetterli. 2012. IEEE Transactions on Information Theory 58 (8): 4969–92.
Weissman, T., Y. H. Kim, and H. H. Permuter. 2013. IEEE Transactions on Information Theory 59 (3): 1271–87.
Wolf, David R., and David H. Wolpert. 1994. arXiv:comp-Gas/9403002, March.
Wolpert, David H. 2006a. In Complex Engineered Systems, 262–90. Understanding Complex Systems. Springer Berlin Heidelberg.
———. 2006b. In The Complex Networks of Economic Interactions, 293–306. Lecture Notes in Economics and Mathematical Systems 567. Springer.
Wolpert, David H, and John W Lawson. 2002. In, 1066–73.
Wolpert, David H, Kevin R Wheeler, and Kagan Tumer. 1999. In, 77–83.
———. 2000. EPL (Europhysics Letters) 49: 708.
Wolpert, David H., and David R. Wolf. 1994. arXiv:comp-Gas/9403001, March.
Xu, Aolin, and Maxim Raginsky. 2017. In Advances In Neural Information Processing Systems.
Zhang, Zhiyi, and Michael Grabchak. 2014. Neural Computation 26 (11): 2570–93.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.