Estimation of mutual information functional from data

Informing yourself from your data how informative your data was

Say I would like to know the mutual information of the laws of processes generating two streams \(X,Y\) of observations, with weak assumptions on the laws of the generation process. In the case that they have a continuous state space and joint densities \(p_{X,Y}\), marginal densities \(p_{X},p_{Y}\)

\[ \operatorname {I} (X;Y)=\int _{\mathcal {Y}}\int _{\mathcal {X}}{p_{X,Y}X,Y\log {\left({\frac {p_{X,Y}X,Y}{p_{X}(x)\,p_{Y}(y)}}\right)}}\;dx\,dy\]

This is a normal sort of empirical probability metric estimation problem.

Information is harder than the average, because observations with low frequency have high influence on that value but are by definition rarely observed.. It is easy to get a uselessly biased – or even inconsistent – estimator, especially in the nonparametric case.

A typical technique is to construct a joint histogram from my (finite, discrete state space) sample, and then plug in the empirical histogram estimate to the information estimate. This is already dangerous if I have rare symbols in your alphabet. But if I want to do this with a continuous state variable by quantizing it to a finite alphabet it is much more so. Now I have to choose the bin size, or equivalently, the alphabet size. What is the natural one? Moreover this method is highly sensitive and can be inconsistent if you don’t do it right (Paninski 2003).

So, better alternatives?

One obvious one is asking whether one really needs to estimate mutual information as such. Do I really want to know the information? Or do I merely wish to show that the mutual information between two processes is low, i.e. to estimate some degree of independence? Independence is related to mutual information but there are many more general approaches.

If I have a parametric model for my processes I am on firmer ground; there might be an analytic estimator already. If not, I could instead estimate the joint densities and work it out from there.

To consider:

  • ad hominem, (Kandasamy et al. 2014) could be an interesting place to start.
  • Kraskov’s (Kraskov, Stögbauer, and Grassberger 2004) NN-method looks nice, but don’t have any guarantees that I know of
  • those occasional mentions of calculating mutual information from recurrence plots- how do they work?

However, I do not have need of mutual information estimation any longer, so I can safely forget this all for now.

Akaike, Hirotogu. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.” In Proceeding of the Second International Symposium on Information Theory, edited by Petrovand F Caski, 199–213. Budapest: Akademiai Kiado. http://link.springer.com/chapter/10.1007/978-1-4612-1694-0_15.

Gao, Shuyang, Greg Ver Steeg, and Aram Galstyan. 2015. “Efficient Estimation of Mutual Information for Strongly Dependent Variables.” In Journal of Machine Learning Research, 277–86. http://www.jmlr.org/proceedings/papers/v38/gao15.html.

Grassberger, Peter. 1988. “Finite Sample Corrections to Entropy and Dimension Estimates.” Physics Letters A 128 (6–7): 369–73. https://doi.org/10.1016/0375-9601(88)90193-4.

Hausser, Jean, and Korbinian Strimmer. 2009. “Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks.” Journal of Machine Learning Research 10: 1469.

Kandasamy, Kirthevasan, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and James M. Robins. 2014. “Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations,” November. http://arxiv.org/abs/1411.4342.

Kraskov, Alexander, Harald Stögbauer, and Peter Grassberger. 2004. “Estimating Mutual Information.” Physical Review E 69: 066138. https://doi.org/10.1103/PhysRevE.69.066138.

Moon, Kevin R., and Alfred O. Hero III. 2014. “Multivariate F-Divergence Estimation with Confidence.” In NIPS 2014. http://arxiv.org/abs/1411.2045.

Nemenman, Ilya, Fariel Shafee, and William Bialek. 2001. “Entropy and Inference, Revisited.” In. http://arxiv.org/abs/physics/0108025.

Paninski, Liam. 2003. “Estimation of Entropy and Mutual Information.” Neural Computation 15 (6): 1191–1253. https://doi.org/10.1162/089976603321780272.

Roulston, Mark S. 1999. “Estimating the Errors on Measured Entropy and Mutual Information.” Physica D: Nonlinear Phenomena 125 (3-4): 285–94. https://doi.org/10.1016/S0167-2789(98)00269-3.

Schürmann, Thomas. 2015. “A Note on Entropy Estimation.” Neural Computation 27 (10): 2097–2106. https://doi.org/10.1162/NECO_a_00775.

Shibata, Ritei. 1997. “Bootstrap Estimate of Kullback-Leibler Information for Model Selection.” Statistica Sinica 7: 375–94.

Taylor, Samuel F, Naftali Tishby, and William Bialek. 2007. “Information and Fitness.” Arxiv Preprint arXiv:0712.4382.

Wolf, David R., and David H. Wolpert. 1994. “Estimating Functions of Distributions from A Finite Set of Samples, Part 2: Bayes Estimators for Mutual Information, Chi-Squared, Covariance and Other Statistics,” March. http://arxiv.org/abs/comp-gas/9403002.

Wolpert, David H., and David R. Wolf. 1994. “Estimating Functions of Probability Distributions from a Finite Set of Samples, Part 1: Bayes Estimators and the Shannon Entropy,” March. http://arxiv.org/abs/comp-gas/9403001.

Zhang, Zhiyi, and Michael Grabchak. 2014. “Nonparametric Estimation of Küllback-Leibler Divergence.” Neural Computation 26 (11): 2570–93. https://doi.org/10.1162/NECO_a_00646.