# Expectation propagation

Generalized moment matching

October 26, 2015 — April 3, 2023

approximation
Bayes
concurrency hell
graphical models
probabilistic algorithms
statistics

A classic generalised message-passing inference method, approximating belief propagation. Other names: assumed density filtering, which is as special case.

## 1 What?

The foundational paper, Thomas P. Minka (2001), is readable, but lacks the right context to make sense in the 2020s

Minka maintains an EP roadmap which discusses various facets of the field and is current to ca 2014, and a really useful cheat sheet.

Tutorial texts which include EP are… hmm… . A shortcoming of the texts I can find is that they say “This is how we generalize Assumed Density Filtering” which is as pedagogically useful as trying to introduce IEE754 float operations by saying “This is how we generalize an abacus”… Skip the abacus step, people.

Anyway, let us talk about the abacus.

## 2 Assumed Density Filtering

Approximating posteriors by moment matching of individual terms in a product approximation.

explains ADF. We assume that there is some family $$\mathcal{q}$$ of density functions that we use to approximate the posteriors. We are in a filtering setting here so we assume the posterior updates are coming sequentially, from a time series of observations.

Suppose we are interested in some unknown parameters $$\boldsymbol{\theta}_{t}$$. Starting from an approximate prior $$q_{t-1} \in \mathcal{Q}$$ given $$q_{t-1}\left(\boldsymbol{\theta}_{t-1}\right) \approx p\left(\boldsymbol{\theta}_{t-1} \mid \mathbf{y}_{1: t-1}\right)$$, we can update this with the new measurement to get the approximate posterior $\hat{p}\left(\boldsymbol{\theta}_{t}\right)=\frac{1}{Z_{t}} p\left(\mathbf{y}_{t} \mid \boldsymbol{\theta}_{t}\right) q_{t \mid t-1}\left(\boldsymbol{\theta}_{t}\right)$ where $Z_{t}=\int p\left(\mathbf{y}_{t} \mid \boldsymbol{\theta}_{t}\right) q_{t \mid t-1}\left(\boldsymbol{\theta}_{t}\right) d \boldsymbol{\theta}_{t}$ is the normalization constant and $q_{t \mid t-1}\left(\boldsymbol{\theta}_{t}\right)=\int p\left(\boldsymbol{\theta}_{t} \mid \boldsymbol{\theta}_{t-1}\right) q_{t-1}\left(\boldsymbol{\theta}_{t-1}\right) d \boldsymbol{\theta}_{t-1}$ is the next-step predictive distribution. For most interesting dynamics $$\hat{p}\left(\boldsymbol{\theta}_{t}\right) \notin \mathcal{Q}$$. So we seek an approximation by computing $q\left(\boldsymbol{\theta}_{t}\right)=\underset{q \in \mathcal{Q}}{\operatorname{arg min}} \mathbb{K} \mathbb{L}\left(\hat{p}\left(\boldsymbol{\theta}_{t}\right) \| q\left(\boldsymbol{\theta}_{t}\right)\right).$ If $$\mathcal{Q}$$ forms an exponential family, this minimisation is equivalent to moment matching. For the common case of Gaussians we call this weak marginalisation apparently.

Practical example, you say? Murphy lists e.g. Zoeter (2007).

## 3 Generalising ADF to Expectation propagation

In EP we would like the ADF operations to be applicable on diverse belief propagation topologies, more general distances.. TBC

## 4 Relation to message passing on factor graphs

This has been confusing me. There is typically a factorisation given in the expectation propagation algorithm, but it is not necessarily the factor graph we would expect to see in other variational message passing, i.e. the factorisation over the factor graph. Sometimes it is; sometimes it is the somewhat-related-but-distinct product posterior approximation. When the posterior approximation factorization corresponds to the factorisation of the probability, in the sense of inducing the same graphical structure, then it is natural to think of expectation propagation as a factor graph message passing algorithms. That is to say, that we can write the true joint as $p(\mathbf{x})=s \prod_a f_a\left(x\right)$ and the variational approximation as $q(\mathbf{x})=s \prod_a q_a\left(x\right)$ where the $$a$$s index the same sets. Then everything will work out as factor message passing. If not, we still have a notion of projecting onto the approximate family, but over some different graph.

Not everyone makes this distinction. Here are some refs that do: Sutton (2005), Tom P. Minka (2005), Khashabi (n.d.),Thomas P. Minka (2018) (all pedagogic), Wainwright and Jordan (2008) , Welling, Minka, and Teh (2012) (advanced). Sutton (2005) is the simplest and most pragmatic.

Recall Tom P. Minka (2005):

The recipe to make a message-passing algorithm has four steps:

1. Pick an approximating family for $$q$$ to be chosen from. For example, the set of fully-factorized distributions, the set of Gaussians, the set of $$k$$-component mixtures, etc.
2. Pick a divergence measure to minimize. For example, mean-field methods minimize the Kullback-Leibler divergence $$\mathrm{KL}(q \| p)$$, expectation propagation minimizes $$\mathrm{KL}(p \| q)$$, and power EP minimizes $$\alpha$$-divergence $$D_\alpha(p \| q)$$.
3. Construct an optimization algorithm for the chosen divergence measure and approximating family. Usually this is a fixed-point iteration obtained by setting the gradients to zero.
4. Distribute the optimization across the network, by dividing the network $$p$$ into factors, and minimizing local divergence at each factor.

We can apply this to construct EP in particular.

The next step is to write the original distribution $$p$$ as a product of nonnegative factors: $p(\mathbf{x})=\prod_a f_a(\mathbf{x})$ This defines the specific way in which we want to divide the network, and is not unique. Each factor can depend on several, perhaps all, of the variables of $$p$$. By approximating each factor $$f_a$$ by $$\tilde{f}_a \in \mathcal{F}$$, we get an approximation divided in the same way: \begin{aligned} \tilde{f}_a(\mathbf{x}) & =\exp \left(\sum_j g_j(\mathbf{x}) \tau_{a j}\right) \\ q(\mathbf{x}) & =\prod_a \tilde{f}_a(\mathbf{x}) \end{aligned} Now we look at the problem from the perspective of a given approximate factor $$\tilde{f}_a$$. Define $$q^{\backslash a}(\mathbf{x})$$ to be the product of all other approximate factors: $q^{\backslash a}(\mathbf{x})=q(\mathbf{x}) / \tilde{f}_a(\mathbf{x})=\prod_{b \neq a} \tilde{f}_b(\mathbf{x})$ Similarly, define $$p^{\backslash a}(\mathbf{x})=\prod_{b \neq a} f_a(\mathbf{x})$$. Then factor $$\tilde{f}_a$$ seeks to minimize $$D\left(f_a p^{\backslash a} \| \tilde{f}_a q^{\backslash a}\right)$$. To make this tractable, assume that the approximations we’ve already made, $$q^{\backslash a}(\mathbf{x})$$, are a good approximation to the rest of the network, i.e. $$p^{\backslash a} \approx q^{\backslash a}$$, at least for the purposes of solving for $$\tilde{f}_a$$. Then the problem becomes $\tilde{f}_a(\mathbf{x})=\operatorname{argmin} D\left(f_a(\mathbf{x}) q^{\backslash a}(\mathbf{x}) \| \tilde{f}_a(\mathbf{x}) q^{\backslash a}(\mathbf{x})\right)$ This problem is tractable, provided we’ve made a sensible choice of factors.

The resulting algorithm he summarizes as

1. Initialize $$\tilde{f}_a(\mathbf{x})$$ for all $$a$$.
2. Repeat until all $$\tilde{f}_a$$ converge:
1. Pick a factor $$a$$.
2. Compute $$q^{\backslash a}$$
3. Minimise $\tilde{f}_a(\mathbf{x})^{\text {new }}= \operatorname{argmin} D\left(f_a(\mathbf{x}) q^{\backslash a}(\mathbf{x}) \| \tilde{f}_a(\mathbf{x}) q^{\backslash a}(\mathbf{x})\right)$

This algorithm can be interpreted as message passing between the factors $$f_a$$. The approximation $$\tilde{f}_a$$ is the message that factor $$a$$ sends to the rest of the network, and $$q^{\backslash a}$$ is the collection of messages that factor $$a$$ receives (its “inbox”). The inbox summarizes the behavior of the rest of the network.

To enable the use of un-normalized distributions they introduce the un-normalized KL divergence:

$$\mathrm{KL}(p \| q)=\int_x p(x) \log \frac{p(x)}{q(x)} d x+\int(q(x)-p(x)) d x$$

and un-normalized $$\alpha$$ divergences. Apparently everything works out if you use these.

$$D_\alpha(p \| q)=\frac{\int_x \alpha p(x)+(1-\alpha) q(x)-p(x)^\alpha q(x)^{1-\alpha} d x}{\alpha(1-\alpha)}$$

TBC

## 5 Higher order approximations

Kikuchi, Bethe etc

## 6 Incoming

Andrew Gelman on Vehtari et al. (2019):

We revisit expectation propagation (EP) as a prototype for scalable algorithms that partition big datasets into many parts and analyze each part in parallel to perform inference of shared parameters. The algorithm should be particularly efficient for hierarchical models, for which the EP algorithm works on the shared parameters (hyperparameters) of the model.

The central idea of EP is to work at each step with a “tilted distribution” that combines the likelihood for a part of the data with the “cavity distribution,” which is the approximate model for the prior and all other parts of the data. EP iteratively approximates the moments of the tilted distributions and incorporates those approximations into a global posterior approximation. As such, EP can be used to divide the computation for large models into manageable sizes. The computation for each partition can be made parallel with occasional exchanging of information between processes through the global posterior approximation. Moments of multivariate tilted distributions can be approximated in various ways, including, MCMC, Laplace approximations, and importance sampling.

I love love love love love this. The idea is to forget about the usual derivation of EP (the Kullback-Leibler discrepancy, etc.) and to instead start at the other end, with Bayesian data-splitting algorithms, with the idea of taking a big problem and dividing it into K little pieces, performing inference on each of the K pieces, and then putting them together to get an approximate posterior inference.

The difficulty with such algorithms, as usually constructed, is that each of the K pieces has only partial information; as a result, for any of these pieces, you’re wasting a lot of computation in places that are contradicted by the other K-1 pieces.

Christian Robert’s redjoinder is interesting too.

I would like to read the new Helsinki group work on EP in state filters which looks interesting.

Guillaume Dehaene, Expectation Propagation is exact in the large data limit sounds interesting:

Expectation Propagation (EP) is a popular algorithm for approximating posterior distributions. While it is known empirically to give good approximations at a low computational cost it’s also very poorly understood. In this talk, I will present some new results on EP in the large-data limit. I will show that Gaussian-EP iterations asymptotically behave like the Newton algorithm, from which it follows that Gaussian-EP is asymptotically exact. I will then give an upper bound on the errors of the EP approximation and show that their decay speed compares favorably to the errors of the Laplace approximation of the posterior.

These theoretical results can be used to inform practical use of EP, and I will give some advice for implementation of the method.

## 7 References

Barthelmé, and Chopin. 2011. “ABC-EP: Expectation Propagation for Likelihood-Free Bayesian Computation.” In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11.
Boukouvalas, Barillec, and Cornford. 2012. In ICML 2012.
Çakmak, Opper, Fleury, et al. 2016.
Chen, and Wand. 2020. Journal of the Korean Statistical Society.
Cressie, and Wikle. 2014. In Wiley StatsRef: Statistics Reference Online.
Csató, and Opper. 2002. Neural Computation.
Csató, Opper, and Winther. 2001. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. NIPS’01.
de Freitas, Niranjan, Gee, et al. 1998. “Sequential Monte Carlo Methods for Optimisation of Neural Network Models.” Cambridge University Engineering Department, Cambridge, England, Technical Report TR-328.
Dehaene, and Barthelmé. 2015. arXiv:1503.08060 [Math, Stat].
Deisenroth, and Mohamed. 2012. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2. NIPS’12.
Gustafsson, and Hendeby. 2012. IEEE Transactions on Signal Processing.
Hasenclever, Webb, Lienart, et al. 2017. In The Journal of Machine Learning Research.
Heess, Tarlow, and Winn. 2013. In Advances in Neural Information Processing Systems.
Hernandez-Lobato, and Hernandez-Lobato. 2016. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics.
Huang, and Gelman. 2005. SSRN Electronic Journal.
Kamthe, Takao, Mohamed, et al. 2022. Transactions on Machine Learning Research.
Khashabi. n.d. “Expectation Propagation for Bayesian Inference.”
Kirkley, Cantwell, and Newman. 2020. arXiv:2009.12246 [Cond-Mat].
Li, Hernandez-Lobato, and Turner. 2015. In Advances In Neural Information Processing Systems.
Liu, and Ihler. 2011. In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11.
Minka, Thomas. 2001.
Minka, Thomas P. 2001. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.
Minka, Tom P. 2005.
———. 2008. Techincal Report.
Minka, Thomas P. 2018.
Murphy. 2012. Machine learning: a probabilistic perspective. Adaptive computation and machine learning series.
Nickisch, and Rasmussen. 2008. “Approximations for Binary Gaussian Process Classiﬁcation.” Journal of Machine Learning Research.
Nodelman, Koller, and Shelton. 2012.
Opper, and Winther. 2000. Neural Computation.
———. 2005. Journal of Machine Learning Research.
Peltola, Jylänki, and Vehtari. 2014. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics.
Qi, Yuan. 2004. “Extending Expectation Propagation for Graphical Models.”
Qi, Yuan Alan, Abdel-Gawad, and Minka. 2010. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence. UAI’10.
Raymond, Manoel, and Opper. 2014. In Statistical Physics, Optimization, Inference, and Message-Passing Algorithms.
Rochoux, Emery, Ricci, et al. 2015. Natural Hazards and Earth System Sciences.
Rochoux, Ricci, Lucor, et al. 2014. Natural Hazards and Earth System Sciences.
Seeger, ed. 2005.
Seeger, and Nickisch. 2011. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.
Sutton. 2005.
Taghvaei, and Mehta. 2021. IEEE Transactions on Automatic Control.
Vehtari, Gelman, Sivula, et al. 2019. arXiv:1412.4869 [Stat].
Wainwright, and Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.
Wang, and Dunson. 2013.
Welling, Minka, and Teh. 2012. In Uncertainty in Artificial Intelligence.
Wilkinson, Chang, Andersen, et al. 2020. In ICML.
Wilkinson, Särkkä, and Solin. 2021.
Xu, Lakshminarayanan, Teh, et al. 2014. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14.
Ypma, and Heskes. 2003. In 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).
Zhang, Walder, Bonilla, et al. 2020. In Proceedings of NeurIPS 2020.
Zhou. 2017.
Zoeter. 2007. In 2007 5th International Symposium on Image and Signal Processing and Analysis.