# Nearly sufficient statistics

How about “Sufficient sufficiency”? — is that taken?

March 13, 2018 — January 14, 2019

approximation
estimator distribution
functional analysis
information
linear algebra
metrics
model selection
optimization
probabilistic algorithms
probability
sparser than thou
statistics
uncertainty

🏗

I’m working through a small realisation, for my own interest, which has been helpful in my understanding of variational Bayes; specifically relating it to non-Bayes variational inference. Also sequential monte carlo.

NB: I have shelved this for now. Worked through my ideas, discovered that I was more-or-less reinventing rate-distortion theory and decided not to chase it. I may return later for pedagogic purposes.

By starting from the idea of sufficient statistics, we come to the idea of variational inference in a natural way, via some other interesting stopovers.

Consider the Bayes filtering setup. We have some system who parameters we wish to estimate, and we have many experiments done upon that system, throughout time. We would like to update our understanding of that system, using the experiments’ data. Possibly the system is changing, possibly it’s not changing. Either way (or both ways) there is a computational challenge.

(Or it’s not changing, but our model is increasing in size as we gain more data about it. How to handle this case?)

We would like the hypothesis updating to happen in a compact way, where we can do inference without keeping the whole data-set about, but rather use some summary statistics. With exponential family models this is trivial to do; just use conjugate updates and you are done. With models where compactness would be more useful, say, big data and big models without conjugate updates, it’s not clear how we can do this exactly.

Possibly related: Related, but not quite the same, learning summary statistics or Data summarisation via inducing sets or coresets. That latter does not reduce the dimension of the data, but the size of the data set. Bounded Memory Learning? To mention: Pitman—Koopman—Darmois theorem. Other fun keywords: rate distortion, sufficient statistic paradox

## 1 Sufficient statistics in exponential families

Let’s start with sufficient statistics in exponential families, which, for reasons of historical pedagogy, are the Garden of Eden of Inference. (the Garden of Edenference for short.) I suspect that deep in their hearts, all statisticians regard themselves as prodigal exiles from of the exponential family, and long for the innocence of that Garden of Edenference.

Informally, here’s what’s going on with the inference problems involving sufficient statistics. We are interested in estimating some parameter of inference, $$\theta$$ using realisations $$x$$ of some random process $$X\sim \mathbb{P}(x|\theta).$$

Then $$T(x)$$ is a sufficient statistic for $$\theta$$ iff $\mathbb{P}(x|T(x),\theta)= \mathbb{P}(x|T(x)).$ That is, our inference about $$\theta$$ depends on the data only through the sufficient statistic.

(mention size of sufficient statistic)

Fisher—Neyman factorization theorem $\mathbb{P}(x;\theta)=h(x)g(T(x);\theta)$

Famously, Maximum Likelihood estimators of exponential family models are highly compressible, in that these have sufficient statistics — these are low-dimensional functions of the data which characterise all the information in the complete data, with respect to the parameter estimates. Many models and data sets and estimation methods do not have this feature, even parametric models with very few parameters.

This can be a PITA when your data is very big and you wish to get benefit from that, and yet you can’t fit the data in memory; The question then arises — when can I do better? Can I find a “nearly sufficient” statistic, which is smaller than my data and yet does not worsen my error substantially? Can I quantify this nearness to the original?

## 2 References

Adragni, and Cook. 2009. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences.
Bachem, Lucic, and Krause. 2015. In International Conference on Machine Learning.
Baez. 2011.
Barron, A., Rissanen, and Yu. 1998. IEEE Transactions on Information Theory.
Barron, Andrew, Schervisch, and Wasserman. 1999. The Annals of Statistics.
Blei, Kucukelbir, and McAuliffe. 2017. Journal of the American Statistical Association.
Broderick, Boyd, Wibisono, et al. 2013. In Advances in Neural Information Processing Systems 26.
Cortes, Kuznetsov, Mohri, et al. 2016. In Advances in Neural Information Processing Systems 29.
Cutajar, Bonilla, Michiardi, et al. 2017. In PMLR.
de Castro, and Dorigo. 2019. Computer Physics Communications.
Dezfouli, and Bonilla. 2015. In Advances in Neural Information Processing Systems 28. NIPS’15.
Dimitrakakis, Nelson, Zhang, et al. 2013. arXiv:1306.1066 [Cs, Stat].
Edwards, and Storkey. 2017. In Proceedings of ICLR.
Gribonval, Blanchard, Keriven, et al. 2017. arXiv:1706.07180 [Cs, Math, Stat].
Grünwald. 2004. Advances in Minimum Description Length: Theory and Applications.
Hinton, and van Camp. 1993. In Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT ’93.
Honkela, and Valpola. 2004. IEEE Transactions on Neural Networks.
Huggins, Adams, and Broderick. 2017. In Advances in Neural Information Processing Systems 30.
Hu, Moreno, Lawrence, et al. 2018. “Β-BNN: A Rate-Distortion Perspective on Bayesian Neural Networks.”
Kandasamy, Krishnamurthy, Poczos, et al. 2014. arXiv:1411.4342 [Stat].
Kullback, and Leibler. 1951. The Annals of Mathematical Statistics.
MacKay. 2002. Information Theory, Inference & Learning Algorithms.
Mandelbrot. 1962. The Annals of Mathematical Statistics.
Montanari. 2015. Electronic Journal of Statistics.
Moreno-Muñoz, Artés-Rodríguez, and Álvarez. 2019. arXiv:1911.00002 [Cs, Stat].
Moshkovitz, Dana, and Moshkovitz. 2017. In Conference on Learning Theory.
Moshkovitz, Michal, and Tishby. 2017. arXiv:1712.03524 [Cs].
Rissanen, J. 1978. Automatica.
———. 1984. IEEE Transactions on Information Theory.
Rissanen, Jorma. 2007. Information and Complexity in Statistical Modeling. Information Science and Statistics.
Vitányi. 2006. IEEE Transactions on Information Theory.
Wallace. 1990. In Advances in Computing and Information — ICCI ’90. Lecture Notes in Computer Science.
Wolpert. 2008. Physica D: Nonlinear Phenomena, Novel Computing Paradigms: Quo Vadis?,.