# Expectation propagation

## Generalized moment matching

A classic generalised message-passing inference method, approximating belief propagation. Other names: assumed density filtering, which is as special case.

## What?

The foundational paper, Thomas P. Minka (2001), is readable. Tutorial texts which include EP are… hmm. . A shortcoming of the texts I can find is that they say “This is how we generalize Assumed Density Filtering” which is as pedagogically useful as trying to introduce IEE754 float operations by saying “This is how we generalize an abacus”… Skip the abacus step, people.

Anyway, let us talk about the abacus.

## Assumed Density Filtering

Approximating posteriors by moment matching. Is there any distinction between this and moment matching? I cannot see one is all. explains ADF. We assume that there is some family $$\mathcal{q}$$ of density functions that we use to approximate the posteriors. We are in a filtering setting here so we assume the posterior updates we are concerned about are coming from sequential updates from soe time series of observations.

Suppose we are interested in some unknown parameters $$\boldsymbol{\theta}_{t}$$. Starting from an approximate prior $$q_{t-1} \in \mathcal{Q}$$ given $$q_{t-1}\left(\boldsymbol{\theta}_{t-1}\right) \approx p\left(\boldsymbol{\theta}_{t-1} \mid \mathbf{y}_{1: t-1}\right)$$, we can update this with the new measurement to get the approximate posterior $\hat{p}\left(\boldsymbol{\theta}_{t}\right)=\frac{1}{Z_{t}} p\left(\mathbf{y}_{t} \mid \boldsymbol{\theta}_{t}\right) q_{t \mid t-1}\left(\boldsymbol{\theta}_{t}\right)$ where $Z_{t}=\int p\left(\mathbf{y}_{t} \mid \boldsymbol{\theta}_{t}\right) q_{t \mid t-1}\left(\boldsymbol{\theta}_{t}\right) d \boldsymbol{\theta}_{t}$ is the normalization constant and $q_{t \mid t-1}\left(\boldsymbol{\theta}_{t}\right)=\int p\left(\boldsymbol{\theta}_{t} \mid \boldsymbol{\theta}_{t-1}\right) q_{t-1}\left(\boldsymbol{\theta}_{t-1}\right) d \boldsymbol{\theta}_{t-1}$ is the next-step predictive distribution. For most interesting dynamics $$\hat{p}\left(\boldsymbol{\theta}_{t}\right) \notin \mathcal{Q}$$. So we seek an approximation by computing $q\left(\boldsymbol{\theta}_{t}\right)=\underset{q \in \mathcal{Q}}{\operatorname{arg min}} \mathbb{K} \mathbb{L}\left(\hat{p}\left(\boldsymbol{\theta}_{t}\right) \| q\left(\boldsymbol{\theta}_{t}\right)\right).$ If $$\mathcal{Q}$$ forms an exponential family, this minimisation is equivalent to moment matching. For the common case of Geussians we call this weak marginalisation apparently.

Practical example, you say? Murphy lists e.g. Zoeter (2007).

## Generalising to Expectation propagation

In EP we would like the ADF operations to be applicable on diverse belief propgation topologies and in particular to admit loopy BP.

## Incoming

Andrew Gelman on Vehtari et al. (2019):

We revisit expectation propagation (EP) as a prototype for scalable algorithms that partition big datasets into many parts and analyze each part in parallel to perform inference of shared parameters. The algorithm should be particularly efficient for hierarchical models, for which the EP algorithm works on the shared parameters (hyperparameters) of the model.

The central idea of EP is to work at each step with a “tilted distribution” that combines the likelihood for a part of the data with the “cavity distribution,” which is the approximate model for the prior and all other parts of the data. EP iteratively approximates the moments of the tilted distributions and incorporates those approximations into a global posterior approximation. As such, EP can be used to divide the computation for large models into manageable sizes. The computation for each partition can be made parallel with occasional exchanging of information between processes through the global posterior approximation. Moments of multivariate tilted distributions can be approximated in various ways, including, MCMC, Laplace approximations, and importance sampling.

I love love love love love this. The idea is to forget about the usual derivation of EP (the Kullback-Leibler discrepancy, etc.) and to instead start at the other end, with Bayesian data-splitting algorithms, with the idea of taking a big problem and dividing it into K little pieces, performing inference on each of the K pieces, and then putting them together to get an approximate posterior inference.

The difficulty with such algorithms, as usually constructed, is that each of the K pieces has only partial information; as a result, for any of these pieces, you’re wasting a lot of computation in places that are contradicted by the other K-1 pieces.

Christian Robert’s redjoinder is interesting too.

I would like to read the new Helsinki group work on EP in state filters which looks interesting.

Guillaume Dehaene, Expectation Propagation is exact in the large data limit sounds intersting

Expectation Propagation (EP) is a popular algorithm for approximating posterior distributions. While it is known empirically to give good approximations at a low computational cost it’s also very poorly understood. In this talk, I will present some new results on EP in the large-data limit. I will show that Gaussian-EP iterations asymptotically behave like the Newton algorithm, from which it follows that Gaussian-EP is asymptotically exact. I will then give an upper bound on the errors of the EP approximation and show that their decay speed compares favorably to the errors of the Laplace approximation of the posterior.

These theoretical results can be used to inform practical use of EP, and I will give some advice for implementation of the method.

## References

Barthelmé, Simon, and Nicolas Chopin. 2011. “ABC-EP: Expectation Propagation for Likelihood-Free Bayesian Computation.” In Proceedings of the 28th International Conference on International Conference on Machine Learning, 289–96. ICML’11. Madison, WI, USA: Omnipress.
———. 2012. arXiv.
Boukouvalas, Alexis, Remi Barillec, and Dan Cornford. 2012. In ICML 2012.
Cressie, Noel, and Christopher K. Wikle. 2014. In Wiley StatsRef: Statistics Reference Online. American Cancer Society.
Csató, Lehel, and Manfred Opper. 2002. Neural Computation 14 (3): 641–68.
Dehaene, Guillaume, and Simon Barthelmé. 2015. arXiv:1503.08060 [Math, Stat], March.
Freitas, J. F. G. de, Mahesan Niranjan, A. H. Gee, and Arnaud Doucet. 1998. “Sequential Monte Carlo Methods for Optimisation of Neural Network Models.” Cambridge University Engineering Department, Cambridge, England, Technical Report TR-328.
Gustafsson, Fredrik, and Gustaf Hendeby. 2012. IEEE Transactions on Signal Processing 60 (2): 545–55.
Hernandez-Lobato, Daniel, and Jose Miguel Hernandez-Lobato. 2016. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 168–76. PMLR.
Huang, Zaijing, and Andrew Gelman. 2005. SSRN Electronic Journal.
Kirkley, Alec, George T. Cantwell, and M. E. J. Newman. 2020. arXiv:2009.12246 [Cond-Mat], September.
Liu, Qiang, and Alexander Ihler. 2011. In Proceedings of the 28th International Conference on International Conference on Machine Learning, 849–56. ICML’11. Madison, WI, USA: Omnipress.
Minka, Thomas. 2001.
Minka, Thomas P. 2018.
Minka, Thomas P. 2001. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 362–69. UAI’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Minka, Tom P. 2008. Techincal Report.
Murphy, Kevin P. 2012. Machine learning: a probabilistic perspective. 1 edition. Adaptive computation and machine learning series. Cambridge, MA: MIT Press.
Nickisch, Hannes, and Carl Edward Rasmussen. 2008. “Approximations for Binary Gaussian Process Classiﬁcation.” Journal of Machine Learning Research 9: 44.
Nodelman, Uri, Daphne Koller, and Christian R. Shelton. 2012. arXiv.
Peltola, Tomi, Pasi Jylänki, and Aki Vehtari. 2014. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, 769–77. PMLR.
Raymond, Jack, Andre Manoel, and Manfred Opper. 2014. arXiv.
Rochoux, M. C., C. Emery, S. Ricci, B. Cuenot, and A. Trouvé. 2015. Natural Hazards and Earth System Sciences 15 (8): 1721–39.
Rochoux, M. C., S. Ricci, D. Lucor, B. Cuenot, and A. Trouvé. 2014. Natural Hazards and Earth System Sciences 14 (11): 2951.
Seeger, Matthias, ed. 2005.
Seeger, Matthias, and Hannes Nickisch. 2011. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 652–60. JMLR Workshop and Conference Proceedings.
Taghvaei, Amirhossein, and Prashant G. Mehta. 2021. IEEE Transactions on Automatic Control 66 (7): 3052–67.
Vehtari, Aki, Andrew Gelman, Tuomas Sivula, Pasi Jylänki, Dustin Tran, Swupnil Sahai, Paul Blomstedt, John P. Cunningham, David Schiminovich, and Christian Robert. 2019. arXiv:1412.4869 [Stat], November.
Wainwright, Martin J., and Michael I. Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Vol. 1. Foundations and Trends® in Machine Learning. Now Publishers.
Wang, Xiangyu, and David B. Dunson. 2013. December.
Wilkinson, William J., Paul E. Chang, Michael Riis Andersen, and Arno Solin. 2020. In ICML.
Wilkinson, William J., Simo Särkkä, and Arno Solin. 2021. arXiv.
Zhang, Rui, Christian Walder, Edwin V. Bonilla, Marian-Andrei Rizoiu, and Lexing Xie. 2020. In Proceedings of NeurIPS 2020.
Zhou, Hai-Jun. 2017. 15.
Zoeter, Onno. 2007. In 2007 5th International Symposium on Image and Signal Processing and Analysis, 435–40. Istanbul, Turkey: IEEE.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.