Empirical estimation of information

Informing yourself from your data how informative your data was

April 19, 2011 — April 28, 2020

Figure 1

This is an empirical probability metric estimation problem, with especially cruel error properties. There are a few different versions of this problem corresponding to various different information: Mutual information between two variables, KL divergence between two distributions, information of one variable; discrete variables, continuous variable… In the mutual information case this is an independence test.

Say I would like to know the mutual information of the laws of processes generating two streams \(X,Y\) of observations, with weak assumptions on the laws of the generation process. Better, suppose further that each observation from each process is i.i.d. In the case that they have a continuous state space and joint densities \(p_{X,Y}\), marginal densities \(p_{X},p_{Y},\)

\[ \operatorname {I} (X;Y)=\int _{\mathcal {Y}}\int _{\mathcal {X}}{p_{X,Y}X,Y\log {\left({\frac {p_{X,Y}X,Y}{p_{X}(x)\,p_{Y}(y)}}\right)}}\;dx\,dy\]

Information is harder than the many metrics, because observations with low frequency have high influence on that value but are by definition rarely observed. It is easy to get a uselessly biased — or even inconsistent — estimator, especially in the nonparametric case.

1 Histogram estimator

The obvious one for discrete data. For continuous data where the histogram bins must also be learned, this method is highly sensitive and can be inconsistent if you don’t do it right (Paninski 2003).

2 Parametric


3 Monte Carlo parametric

One case you might want to estimate this value which is one where there is no nonparametric estimation problem per se but the integral to solve it is inconvenient. In which case, we might use a Monte Carlo method.

John Schulmann explicates a good trick for estimating KL divergence in the case that you can simulate from \(x_i\sim q\) and calculate \(p(x)\) and \(q(x_i),\) The following estimator is good despite looking unrelated:

\[\begin{aligned} KL[q, p] &= \int_x q(x) \log \frac{q(x)}{p(x)} \mathrm{d}x\\ &= E_{ x \sim q}\left[\log \frac{q(x)}{p(x)} \right]\\ &\approx \frac1N \sum_{i=1}^N \frac12(\log ⁡p(x)−\log ⁡q(x))^2 \end{aligned}\]

He also introduced a simple debiased one that does even better. The mechanics are interesting. (If you actually want a mutual information this notionally calculates it if we find the KL divergence between joint and product densities; But that is not totally trivial I shall concede.)

4 Incoming

Inform: A C library for information analysis of complex systems

Inform is a cross-platform C library designed for performing information analysis of complex system.

  • The inform_dist struct provides discrete, emperical probability distributions. These form the basis for all of the information-theoretic measures.
  • A collection of information measures built upon the distribution struct provide the core algorithms for the library, and are provided through the shannon.h header.
  • A host of measures of the information dynamics on time series are built upon the core information measures. Each measure is housed in its own header, e.g. active_info.h.

See also ITE toolbox (estimators).

Information Theoretical Estimators (ITE) in Python


  • is the redesigned, Python implementation of the Matlab/Octave ITE toolbox.
  • can estimate numerous entropy, mutual information, divergence, association measures, cross quantities, and kernels on distributions.
  • can be used to solve information theoretical optimization problems in a high-level way.
  • comes with several demos.
  • is free and open source: GNU GPLv3(>=).

Estimated quantities:

  • entropy (H): Shannon entropy, Rényi entropy, Tsallis entropy (Havrda and Charvát entropy), Sharma-Mittal entropy, Phi-entropy (f-entropy).
  • mutual information (I): Shannon mutual information (total correlation, multi-information), Rényi mutual information, Tsallis mutual information, chi-square mutual information (squared-loss mutual information, mean square contingency), L2 mutual information, copula-based kernel dependency, kernel canonical correlation analysis, kernel generalized variance, multivariate version of Hoeffding’s Phi, Hilbert-Schmidt independence criterion, distance covariance, distance correlation, Lancaster three-variable interaction.
  • divergence (D): Kullback-Leibler divergence (relative entropy, I directed divergence), Rényi divergence, Tsallis divergence, Sharma-Mittal divergence, Pearson chi-square divergence (chi-square distance), Hellinger distance, L2 divergence, f-divergence (Csiszár-Morimoto divergence, Ali-Silvey distance), maximum mean discrepancy (kernel distance, current distance), energy distance (N-distance; specifically the Cramer-Von Mises distance), Bhattacharyya distance, non-symmetric Bregman distance (Bregman divergence), symmetric Bregman distance, J-distance (symmetrised Kullback-Leibler divergence, J divergence), K divergence, L divergence, Jensen-Shannon divergence, Jensen-Rényi divergence, Jensen-Tsallis divergence.
  • association measures (A): multivariate extensions of Spearman’s rho (Spearman’s rank correlation coefficient, grade correlation coefficient), multivariate conditional version of Spearman’s rho, lower and upper tail dependence via conditional Spearman’s rho.
  • cross quantities (C): cross-entropy,
  • kernels on distributions (K): expected kernel (summation kernel, mean map kernel, set kernel, multi-instance kernel, ensemble kernel; specific convolution kernel), probability product kernel, Bhattacharyya kernel (Bhattacharyya coefficient, Hellinger affinity), Jensen-Shannon kernel, Jensen-Tsallis kernel, exponentiated Jensen-Shannon kernel, exponentiated Jensen-Rényi kernels, exponentiated Jensen-Tsallis kernels.
  • conditional entropy (condH): conditional Shannon entropy.
  • conditional mutual information (condI): conditional Shannon mutual information.

5 References

Akaike. 1973. Information Theory and an Extension of the Maximum Likelihood Principle.” In Proceeding of the Second International Symposium on Information Theory.
Amigó, Szczepański, Wajnryb, et al. 2004. Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity.” Neural Computation.
Crumiller, Knight, Yu, et al. 2011. Estimating the Amount of Information Conveyed by a Population of Neurons.” Frontiers in Neuroscience.
Elith, Phillips, Hastie, et al. 2011. A Statistical Explanation of MaxEnt for Ecologists.” Diversity and Distributions.
Gao, Ver Steeg, and Galstyan. 2015. Efficient Estimation of Mutual Information for Strongly Dependent Variables.” In Journal of Machine Learning Research.
Goodman. 2002. Extended Comment on Language Trees and Zipping.” arXiv:cond-Mat/0202383.
Grassberger. 1988. Finite Sample Corrections to Entropy and Dimension Estimates.” Physics Letters A.
Haslinger, Klinkner, and Shalizi. 2010. The Computational Structure of Spike Trains.” Neural Computation.
Hausser, and Strimmer. 2009. “Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks.” Journal of Machine Learning Research.
Kandasamy, Krishnamurthy, Poczos, et al. 2014. Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations.” arXiv:1411.4342 [Stat].
Kraskov, Stögbauer, and Grassberger. 2004. Estimating Mutual Information.” Physical Review E.
Kulhavý. 1990. Recursive Nonlinear Estimation: A Geometric Approach.” Automatica.
Leinster. 2021. Entropy and Diversity: The Axiomatic Approach.”
Marzen, and Crutchfield. 2020. Inference, Prediction, and Entropy-Rate Estimation of Continuous-Time, Discrete-Event Processes.” arXiv:2005.03750 [Cond-Mat, Physics:nlin, Stat].
Moon, and Hero III. 2014. Multivariate f-Divergence Estimation With Confidence.” In NIPS 2014.
Nemenman, Bialek, and de Ruyter van Steveninck. 2004. Entropy and Information in Neural Spike Trains: Progress on the Sampling Problem.” Physical Review E.
Nemenman, Shafee, and Bialek. 2001. Entropy and Inference, Revisited.” In arXiv:physics/0108025.
Paninski. 2003. Estimation of Entropy and Mutual Information.” Neural Computation.
Roulston. 1999. Estimating the Errors on Measured Entropy and Mutual Information.” Physica D: Nonlinear Phenomena.
Rubinfeld. 2012. Taming Big Probability Distributions.” XRDS: Crossroads, The ACM Magazine for Students.
Schürmann. 2015. A Note on Entropy Estimation.” Neural Computation.
Shibata. 1997. “Bootstrap Estimate of Kullback-Leibler Information for Model Selection.” Statistica Sinica.
Song, and Ermon. 2020. Understanding the Limitations of Variational Mutual Information Estimators.” In.
Taylor, Tishby, and Bialek. 2007. “Information and Fitness.” Arxiv Preprint arXiv:0712.4382.
Wolf, and Wolpert. 1994. Estimating Functions of Distributions from A Finite Set of Samples, Part 2: Bayes Estimators for Mutual Information, Chi-Squared, Covariance and Other Statistics.” arXiv:comp-Gas/9403002.
Wolpert, and Wolf. 1994. Estimating Functions of Probability Distributions from a Finite Set of Samples, Part 1: Bayes Estimators and the Shannon Entropy.” arXiv:comp-Gas/9403001.
Zhang, and Grabchak. 2014. Nonparametric Estimation of Küllback-Leibler Divergence.” Neural Computation.