Variational message-passing algorithms in graphical models

Cleaving reality at the joint, then summing it at the marginal

November 25, 2014 — March 25, 2024

algebra

approximation

Bayes

distributed

dynamical systems

generative

graphical models

machine learning

networks

optimization

probability

signal processing

state space models

statistics

stochastic processes

swarm

time series

Assumed audience:

People who want to learn the tools of modern approximate graphical inference with the minimum of diversions

Even more wrong than normal this notebook is a mere scratch pad where I am sketching out an alternative explanation for myself and it is definitively not correct right now. It may at some future point become correct, but do not hold your breath. Right now I am copy-pasting and trying stuff on to see if I can find a quick path to success.

See also graph computations for a more general sense of “Message Passing”, and graph NNs for a related idea but with more magic neural dust sprinkled on it.

Variational message passing refers to inference wherein we can calculate some approximation to some marginal probability density by passing around “messages” that summarise (in some sense) some conditional probability updates between nodes on a graph. The nodes arise when the probability density factorizes over some graphical independence structure. If we can find such a cool factorisation, and an appropriate form for the messages, we might get cheap and/or distributed inference, for some values of cheap and distributed.

I am currently particularly interested in message passing for latent GP models where it generates Gaussian belief propagation.

The manner in which variational message passing is usually introduced I find confusing. We do not usually lead with the intuition. What are these damned messages? In this notebook I am trying to shuffle about the graphical models curriculum until I find a minimum viable set of concepts to explain to myself what I actually need to do in my own approximate variational inference practice when I fire up modern probabilistic programming tools.

This document is live and subject to revisions and definitely includes errors I have introduced recently in addition to the earlier errors I have not yet found.

For motivation as to why we would try to solve statistical questions by the passing of messages, see the specific example of belief propagation, which is the original setting for message passing, where it turns out there are some nice relations between graphs and integrals of factorised functions. It is worth learning once. Exact Belief Propagation cannot be implemented except in special cases, so you can likely forget it soon after. The key take-home is that some integrals arising in marginalising and conditioning random variables can be solved, at least conceptually, by passing messages around on the graph corresponding to the factorisation of the joint density. It is, IMO, worth not learning too much about because traditionally there are lots of other side-quests they send you at this point, all of which are useless practically, even if they are pedagogic.

There are two common ways to build up message passing ideas; one is to start from belief propagation and try to generalise it in various ways, worrying at each step how valid the extension is. Alternatively, we can present a specific VMP algorithm in its own right, which is pretty simple and always does something, even though sometimes it might produce nonsense. This approach is what of a lot of application papers actually do. Later we can wonder under what circumstances it might do marginal inference, and what it might approximate when it does not. The latter approach is my preference for now, since it leads to hands-on intuition-building with easy implementation, and defers the (to my mind) horrible, grinding stage of theorem-proving to later on when we are already solving useful problems and the pay-off is better.

Advice from Tom P. Minka (2005) tells us we have too many choices when devising a message passing algorism:

The recipe to make a message-passing algorithm has four steps:

Pick an approximating family for q to be chosen from. For example, the set of fully-factorized distributions, the set of Gaussians, the set of k-component mixtures, etc.

Pick a divergence measure to minimize. For example, mean-field methods minimize the Kullback-Leibler divergence \(KL(q \| p)\), expectation propagation minimizes \(KL(p \| q)\), and power EP minimizes α-divergence, \(D\alpha(p \| q)\).

Construct an optimization algorithm for the chosen divergence measure and approximating family. Usually this is a fixed-point iteration obtained by setting the gradients to zero.

Distribute the optimization across the network, by dividing the network p into factors, and minimizing local divergence at each factor.

Take home message: there are a lot of pluggable parts in this family of algorithms. This can be confusing. I think it is easier to start from a concrete set of assumptions, derive a specific (ideally maximally simple) algorithm and then generalise outwards to more fiddly/complicated ones. Here I keep a somewhat concrete example in mind.

Figure 2: I think the landscape of possible message-passing algorithms looks something like this. I am working on the assumption that starting from the red star is the easiest thing to do.

There are many variational message-passing algorithms, including Gaussian Belief Propagation, particle message passing, expectation propagation, gradient message passing, Stein variational gradient descent, kernel message passing (Song et al. 2011), Sigma-point belief propagation (Meyer, Hlinka, and Hlawatsch 2014) and probably other stuff I forgot.

1 It is conceptually easier to seek forgiveness for approximate inference than permission for exact inference

A classic way to introduce message-passing inference is to spend time wondering when we are permitted do exact inference with local calculations (“exact belief propagation”), and then introduce ever more caveats and complications.

After this complicated machinery breaks down we give up, throw most of it out, and introduce a simpler approximation, i.e. variational message-passing (VMP), which can be applied to for more diverse distributions and graph structures. There is slightly more complex set up in terms of distributions (what does it mean to variationally approximate a distribution?) but it the machinery there is something we can use over and over again and is simpler for me at least then complicated graph algorithms.

2 Which graph?

We can pass messages on factor graphs or on DAGs or what-have-you, and the update rules are slightly different in each case. It comes out simpler on factor graphs (although John M. Winn and Bishop (2005) does a good job of being clear about the structure of the method on DAGs by analogy with expectation maximisation, which produces some lovely symmetries). Regardless, factor graphs are easiest and that is what we pursue here; they come out connecting closely to some free-energy results from physics, e.g. ELBO if we use KL divergences.

3 Which divergence?

Kullback-Leibler divergence is a great starting point. In combination with the other assumptions we have introduced, this makes everything come out nice.

Hold that thought.

4 Which random variates?

We will assume exponential family RVs, or specifically, that each factorized term in a factor is an exponential condition on the incoming message. For now we will, at least. There are cool generalisations.

5 What is mean-field?

I think it means that we choose a factorisation of the joint (typically posterior) density which is identical to the product of variable node marginals:

\[ q(\mathbf{Z})=\prod_i^M q_i\left(\mathbf{Z}_i\right) \]

It is remarkably hard to extract this detail assumption explicitly from the literature. Note that this factorisation pairs well with the KL-divergence, although they are separate design choices.

6 KL variational message-passing for exponential families on factor graphs

Putting this together, a reasonable choice for our basic starting algorithm is Variational message passing on factor graphs with respect to KL divergence (Tom P. Minka 2005; Wand 2017). I will also assume that all the graphs are tree graphs, which is to say, there are no loops. Later on, it will turn out that some things break if we have loops, but often not very badly and the badness can be quantified and controlled using various tools. We can take a short leap of faith where we seem to being doing arbitrary or weird things, then reassure ourselves that we are solving something like a real inference problem.

TODO: compare this to an exact algorithm like Gaussian BP side by side.

Let us now look at the factor graph from Wand (2017).

Figure 3: A factor graph. The *factors* are represented by square nodes, and the random variates or *stochastic nodes* by circular ones.

Consider a Bayesian statistical model with observed data \(\boldsymbol{D}\) and parameter vector \(\boldsymbol{\theta}\). In a factor graph we can write the joint density of these random nodes as \(p(\boldsymbol{\theta}, \boldsymbol{D})=\prod_{j=1}^{N} f_{j}\left(\boldsymbol{\theta}_{\text {neighbors }(j)}\right) .\) Our goal in inference is to calculate the posterior density function \(p(\boldsymbol{\theta} \mid \boldsymbol{D}).\)¹

We do that with updates that look like this:

Figure 4: Exact belief propagation updates steps from Ortiz et al’s interactive GBP demo.

In the exact case Ortiz et al’s interpretation of these messages in the belief propagation setting makes sense:

Belief Update: The variable node beliefs are updated by taking a product of the incoming messages from all adjacent factors, each of which represents that factor’s belief on the receiving node’s variables.

Factor-to-variable message: To send a message to an adjacent variable node, a factor aggregates messages from all other adjacent variable nodes and marginalizes over all the other nodes’variables to produce a message that expresses the factor’s belief over the receiving node’s variables.

Variable-to-factor message: A variable-to-factor message tells the factor what the belief of the variable would be if the receiving factor node did not exist. This is computed by taking the product of the messages the variable node has received from all other factor nodes.

We are working in an approximated setting, though, so this is more of a heuristic.

A fully-factorized approximation to the posterior density function \(p(\boldsymbol{\theta} \mid \boldsymbol{D})\) is \[ p(\boldsymbol{\theta} \mid \boldsymbol{D}) \approx q^{*}(\boldsymbol{\theta}) \] where \(q^{*}(\boldsymbol{\theta})\) is the minimizer of the Kullback-Leibler divergence \(\int q(\boldsymbol{\theta}) \log \left\{\frac{q(\boldsymbol{\theta})}{p(\boldsymbol{\theta} \mid \boldsymbol{D})}\right\} d \boldsymbol{\theta}\). We suppress the dependence on data here because we are holding it fixed when we conduct a posterior update.

For each \(1 \leq i \leq M\) and \(1 \leq j \leq N\), the VMP stochastic node to factor message updates are \[ m_{\boldsymbol{\theta}_{i} \rightarrow f_{j}}\left(\boldsymbol{\theta}_{i}\right) \longleftarrow \propto \prod_{j^{\prime} \neq j: i \in \text { neighbors }\left(j^{\prime}\right)} m_{f_{j^{\prime}} \rightarrow \boldsymbol{\theta}_{i}}\left(\boldsymbol{\theta}_{i}\right) \] and the factor to stochastic node message updates are \[ m_{f_{j} \rightarrow \boldsymbol{\theta}_{i}}\left(\boldsymbol{\theta}_{i}\right) \longleftarrow \propto \exp \left[E_{f_{j} \rightarrow \boldsymbol{\theta}_{i}}\left\{\log f_{j}\left(\boldsymbol{\theta}_{\text {neighbors }(j)}\right)\right\}\right] \] where \(E_{f_{j} \rightarrow \boldsymbol{\theta}_{i}}\) denotes expectation with respect to the density function \[ \frac{\prod_{i^{\prime} \in \text { neighbors }(j) \backslash\{i\}} m_{f_{j} \rightarrow \boldsymbol{\theta}_{i^{\prime}}}\left(\boldsymbol{\theta}_{i^{\prime}}\right) m_{\boldsymbol{\theta}_{i^{\prime}} \rightarrow f_{j}}\left(\boldsymbol{\theta}_{i^{\prime}}\right)}{\prod_{i^{\prime} \in \text { neighbors }(j) \backslash\{i\}} \int m_{f_{j} \rightarrow \boldsymbol{\theta}_{i^{\prime}}}\left(\boldsymbol{\theta}_{i^{\prime}}\right) m_{\boldsymbol{\theta}_{i^{\prime}} \rightarrow f_{j}}\left(\boldsymbol{\theta}_{i^{\prime}}\right) d \boldsymbol{\theta}_{i^{\prime}}} \] That looks like we have made a ghastly mistake, because this latter message is horrible in general. it does simplify somewhat for conditionally-exponential families where we can exploit conditional conjugacy.

After passing messages around all the nodes in the graph until it seems to have converged, the Kullback-Leibler optimal \(q\)-densities are obtained via \[ q^{*}\left(\boldsymbol{\theta}_{i}\right) \propto \prod_{j: i \in \text { neighbors }(j)} m_{f_{j} \rightarrow \boldsymbol{\theta}_{i}}\left(\boldsymbol{\theta}_{i}\right). \]

6.1 What just happened?

What is this chaos?

Firstly, what is the message actually? It is a probability distribution, or, more generally a measure, over only some of the axes of the joint distribution.

Some texts mention describe the message as a vector. The fact that we can represent some target distribution, or more often some approximation to the probability distribution, as a finite-dimensional vector, is because some distributions can be represented by finite-dimensional parameter vectors (e.g. I can pack the means and variances of a multivariate Gaussian into a vector, which means that in Gaussian BP we are simply passing around Gaussian distributions).

But what do we actually do with these message? That depends.

What ends up being feasible most often is to introduce that lucky combination of assumptions (KL-divergence, fully-factorized, exponential-family) which causes the factors to decompose nicely such that the messages can be reasonably be considered over each edge separately.

If those assumptions do no hold, inference can be hard. In that case we can try various things. One option is to expand the model into a larger more flexible one, by adding auxiliary latent variables, or adding semiparametric models (Wand 2017) or possibly other clever tricks. Thus, this idea is not as hopeless as it first seemed.

Another weird thing is that the relationship between nodes and observations is often assumed away. You can kind of get away with this if observations are i.i.d, or even exponential family i.i.d. but even then it is confusing to me, because it relies on magic. These are important cases, but they treated as if their relationship to the methodology at its most general was obvious. IMO it is not obvious, but the fact that it comes out so nicely in the special cases means that I feel like an idiot for not “getting” what problem I am solving generally.

tl;dr Part of magic is that update rules for (conditionally) exponential variates under a mean field assumption and KL divergence are surprisingly simple.

Another part is that each observation in general introduces a new variable and a new factor, and we cannot always collapse them down using an i.i.d. assumption.

7 Factoring within nodes

🏗️

7.1 Fully-factored assumption

Here we assume that the density at each factor is well-approximated by a product of independent terms, one for each edge in the factor graph. (TODO: example equation).

Anyway, it not always a terrible idea to make this assumptions, but nor is it the only way to do things, at least conceptually. In practice, it is much more effort if we cannot make this assumption.

Nonetheless, papers that do not make it explicit are very confusing. Why would we ever expect inference to look anything like that?²

AFAICT this is a problem of history. Not all variational inference must make the mean-field, or the fully-factored assumptions etc, but methods that relax these restrictions start to look different and need rather different technology to go, such as local Monte Carlo estimation. For a long while this set of assumptions were the main game in town, (or even more onerous assumptions, like assuming everything is discrete and treating inference exactly) so tutorials of a certain vintage take mean-field variational inference as a synonym for variational inference. If I have just learnt some non-mean-field SVI methods from a recent NeurIPS paper before I encounter VMP, I might well be confused by this.

7.2 Structured mean field

Directly generalises mean field (Hoffman and Blei 2015). I am not sure how useful this is in practice.

7.3 Without mean field assumptions

What do relaxations of the mean-field assumption look like, and do they ever arise in message-passing models? Variational autoencoder do not use such assumption in general, but they also do not use message passing inference. Re-imposing a graphical structure on VAE models is gruelling and AFAIK only done on a case-by-case basis. Perhaps there are some neural methods that do something else?

8 Conditional conjugacy/ exponential families

Under construction.

TODO: note that “natural parameter” form is extra nice here because it summarises the “right” updates.

There are a several foundational texts to read about this (Attias 1999; Ghahramani and Beal 2001; John M. Winn and Bishop 2005).

Dustin Tran and David Blei mention the importance of the conditional conjugacy relations when we are working in (conditionally) exponential families.

Justin Domke summarises exponential families for graphical model usage.

9 Conditionally independent observations

TBD

10 Which free energy?

Helmholtz Free Energy is a classic flavour. On graphs we are concerned with a related, more general (?) free energy called Bethe free energy (J. S. Yedidia, Freeman, and Weiss 2005; Jonathan S. Yedidia, Freeman, and Weiss 2000; M. J. Wainwright and Jordan 2008).

11 Deterministic nodes

TBD

12 In MCMC sampling

See Huang and Gelman (2005), Vehtari et al. (2019) and X. Wang and Dunson (2013).

13 Expectation propagation

See Expectation propagation.

14 Semiparametric

This is the main innovation I think, in the synthesis of Wand (2017).

15 Loopy belief propagation

OK, we have assumed tree graphs for everything thus far, but this is an onerous assumption. We can allow loopy graphs, which often works, sort-of. This choice does introduce extra approximations into the solution, but also allows us to handle more plausible systems within the same setup.

From Murphy, Weiss, and Jordan (1999) and analysed in Weiss and Freeman (2001); M. J. Wainwright and Jordan (2008);J. S. Yedidia, Freeman, and Weiss (2005). One day I might be interested in Bethe free energy too?

TBD.

16 Gaussian BP in particular

See Gaussian belief propagation.

17 Phase transitions

Thomas Orton’s overview lecture connects message-passing with statistical mechanics of statistics.

Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behavior, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimization for physical systems.

More possibly in Barbier (2015).

18 Gradient

See gradient message passing.

19 Neural

TBD. See (Kuck et al. 2020; Satorras and Welling 2021).

20 Further reading

A good introduction is M. J. Wainwright and Jordan (2008), which does some deep stuff with setting up graphical models in exponential family context. Bishop (2006) provides extra application context and a different theoretical development. Dauwels (2007) explains why the factor graph setting is the simplest way to develop the message passing algorithms. I agree with him about this. It is simplest to learn message-passing inference for factor graphs and then translate to and from classic DAGs and MRFs as needed.

Dustin Tran and David Blei advocate for Wand (2017) as a kind of modern milestone in this field. Wingate and Weber (2013) connects us to modern stochastic VI.

Next, see the recommended texts about general graphical model theory in the graphical models notebook.

From a signal processing/coding perspective H.-A. Loeliger (2004) is interesting. Read this one if you know what a turbo code is.

Other overviews I have found useful for various reasons are Tom P. Minka (2005),J. S. Yedidia, Freeman, and Weiss (2003), Sutton and Minka (2006) and Cox, van de Laar, and de Vries (2019).

Recommended: Wee Sun Lee’s NeurIPS lecture about how to integrate neural function approximation into distributed inference: Message Passing In Machine Learning.

From automatic differentiation to message passing

21 Tools

GAMP.
ForneyLab (Akbayrak, Bocharov, and de Vries 2021) looks especially useful for me:

The message passing paradigm offers a convenient method for leveraging model-specific structures, while remaining generally applicable. Message passing can be conveniently formulated on a Forney-style factor graph (FFG) representation of the model. Inference tasks on the model can then be decomposed in local computations, represented by messages that flow across the graph. This locality allows for storing pre-computed message updates in a look-up table that can be re-used across models. Automated algorithm construction then amounts to scheduling these messages in the order required by the inference task (see also this conference paper at JuliaCon).

ForneyLab (GitHub) is introduced in Cox, van de Laar, and de Vries (2019) as a novel Julia package that allows the user to specify a probabilistic model as an FFG and pose inference problems on this FFG. In return, ForneyLab automatically constructs a Julia program that executes a message passing-based (approximate) inference procedure. ForneyLab is designed with a focus on flexibility, extensibility and applicability to biologically plausible models for perception and decision making, such as the hierarchical Gaussian filter (HGF). With ForneyLab, the search for better models for perception and action can be accelerated

This idea has been baked into various other probabilistic programming frameworks by now, in that many of them seem to infer a reasonable message-passing structure for models.

22 References

Aji, and McEliece. 2000. “The Generalized Distributive Law.” IEEE Transactions on Information Theory.

Akbayrak, Bocharov, and de Vries. 2021. “Extended Variational Message Passing for Automated Approximate Bayesian Inference.” Entropy.

Ambrogioni, Lin, Fertig, et al. 2021. “Automatic Structured Variational Inference.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.

Arnold, Castillo, and Sarabia. 1999. Conditional Specification of Statistical Models.

Attias. 1999. “Inferring Parameters and Structure of Latent Variable Models by Variational Bayes.” In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. UAI’99.

Bapst, and Coja-Oghlan. 2016. “Harnessing the Bethe Free Energy.” Random Structures & Algorithms.

Barber. 2012. Bayesian Reasoning and Machine Learning.

Barbier. 2015. “Statistical Physics and Approximate Message-Passing Algorithms for Sparse Linear Estimation Problems in Signal Processing and Coding Theory.” arXiv:1511.01650 [Cs, Math].

Barbier, Krzakala, Macris, et al. 2017. “Phase Transitions, Optimal Errors and Optimality of Message-Passing in Generalized Linear Models.” arXiv:1708.03395 [Cond-Mat, Physics:math-Ph].

Barfoot. 2020. “Fundamental Linear Algebra Problem of Gaussian Inference.”

Bayati, and Montanari. 2011. “The Dynamics of Message Passing on Dense Graphs, with Applications to Compressed Sensing.” IEEE Transactions on Information Theory.

Bishop. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics.

Blake, Kohli, and Rother, eds. 2011. Markov Random Fields for Vision and Image Processing.

Blei, Kucukelbir, and McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association.

Borgerding, and Schniter. 2016. “Onsager-Corrected Deep Networks for Sparse Linear Inverse Problems.” arXiv:1612.01183 [Cs, Math].

Bouchard, and Zoeter. 2009. “Split Variational Inference.” In Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09.

Bouttier, Jardri, and Deneve. 2024. “Circular Belief Propagation for Approximate Probabilistic Inference.”

Boyd. 2010. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.

Braunstein, Mézard, and Zecchina. 2005. “Survey Propagation: An Algorithm for Satisfiability.” Random Structures & Algorithms.

Cevher, Duarte, Hegde, et al. 2009. “Sparse Signal Recovery Using Markov Random Fields.” In Advances in Neural Information Processing Systems.

Cox, van de Laar, and de Vries. 2019. “A Factor Graph Approach to Automated Design of Bayesian Signal Processing Algorithms.” International Journal of Approximate Reasoning.

Dauwels. 2007. “On Variational Message Passing on Factor Graphs.” In 2007 IEEE International Symposium on Information Theory.

Dehaene. 2016. “Expectation Propagation Performs a Smoothed Gradient Descent.” arXiv:1612.05053 [Stat].

Donoho, Maleki, and Montanari. 2009a. “Message-Passing Algorithms for Compressed Sensing.” Proceedings of the National Academy of Sciences.

———. 2009b. “Message Passing Algorithms for Compressed Sensing: II. Analysis and Validation.” In 2010 IEEE Information Theory Workshop (ITW).

Donoho, Maleki, and Montanari. 2010. “Message Passing Algorithms for Compressed Sensing: I. Motivation and Construction.” In 2010 IEEE Information Theory Workshop (ITW).

Donoho, and Montanari. 2013. “High Dimensional Robust M-Estimation: Asymptotic Variance via Approximate Message Passing.” arXiv:1310.7320 [Cs, Math, Stat].

Eaton, and Ghahramani. 2009. “Choosing a Variable to Clamp.” In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics.

Forney. 2001. “Codes on Graphs: Normal Realizations.” IEEE Transactions on Information Theory.

Frey, and Jojic. 2005. “A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical Models.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Geier, Richter, and Biundo. 2014. “Conditioned Belief Propagation Revisited (Extended Version).”

Gemp, McWilliams, Vernade, et al. 2020. “EigenGame: PCA as a Nash Equilibrium.” In.

Ghahramani, and Beal. 2001. “Propagation Algorithms for Variational Bayesian Learning.” In Advances in Neural Information Processing Systems.

Hoffman, and Blei. 2015. “Structured Stochastic Variational Inference.” In Artificial Intelligence and Statistics.

Huang, and Gelman. 2005. “Sampling for Bayesian Computation with Large Datasets.” SSRN Electronic Journal.

Jaggi, Smith, Takac, et al. 2014. “Communication-Efficient Distributed Dual Coordinate Ascent.” In Advances in Neural Information Processing Systems 27.

Jordan, Michael Irwin. 1999. Learning in Graphical Models.

Jordan, Michael I. 2004. “Graphical Models.” Statistical Science.

Jordan, Michael I., Ghahramani, Jaakkola, et al. 1999. “An Introduction to Variational Methods for Graphical Models.” Machine Learning.

Kindermann, and Snell. 1980. Markov Random Fields and Their Applications. Contemporary Mathematics.

Kirkley, Cantwell, and Newman. 2020. “Message Passing for Probabilistic Models on Networks with Loops.” arXiv:2009.12246 [Cond-Mat].

Kjærulff, and Madsen. 2008. Bayesian Networks and Influence Diagrams. Information Science and Statistics.

Koller, and Friedman. 2009. Probabilistic Graphical Models : Principles and Techniques.

Kschischang, Frey, and Loeliger. 2001. “Factor Graphs and the Sum-Product Algorithm.” IEEE Transactions on Information Theory.

Kuck, Chakraborty, Tang, et al. 2020. “Belief Propagation Neural Networks.” arXiv:2007.00295 [Cs, Stat].

Lauritzen. 1996. Graphical Models. Oxford Statistical Science Series.

Li, and Turner. 2016. “Rényi Divergence Variational Inference.” In Advances in Neural Information Processing Systems.

Liu. 2020. “Perspectives on Probabilistic Graphical Models.”

Loeliger, H.-A. 2004. “An Introduction to Factor Graphs.” IEEE Signal Processing Magazine.

Loeliger, Hans-Andrea, Dauwels, Hu, et al. 2007. “The Factor Graph Approach to Model-Based Signal Processing.” Proceedings of the IEEE.

Malioutov, Johnson, and Willsky. 2006. “Walk-Sums and Belief Propagation in Gaussian Graphical Models.” Journal of Machine Learning Research.

Ma, Smith, Jaggi, et al. 2015. “Adding Vs. Averaging in Distributed Primal-Dual Optimization.” arXiv:1502.03508 [Cs].

Meyer, Hlinka, and Hlawatsch. 2014. “Sigma Point Belief Propagation.” IEEE Signal Processing Letters.

Minka, Thomas P. 2001. “Expectation Propagation for Approximate Bayesian Inference.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.

Minka, Tom P. 2005. “Divergence Measures and Message Passing.”

———. 2008. “EP: A Quick Reference.” Techincal Report.

Montanari. 2012. “Graphical Models Concepts in Compressed Sensing.” Compressed Sensing: Theory and Applications.

Montanari, and Sen. 2022. “A Short Tutorial on Mean-Field Spin Glass Techniques for Non-Physicists.”

Murphy. 2012. Machine learning: a probabilistic perspective. Adaptive computation and machine learning series.

———. 2022. Probabilistic Machine Learning: An Introduction. Adaptive Computation and Machine Learning Series.

Murphy, Weiss, and Jordan. 1999. “Loopy Belief Propagation for Approximate Inference: An Empirical Study.” In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. UAI’99.

Naesseth, Lindsten, and Schön. 2014. “Sequential Monte Carlo for Graphical Models.” In Advances in Neural Information Processing Systems.

Nguyen, and Bonilla. 2014. “Automated Variational Inference for Gaussian Process Models.” In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. NIPS’14.

Noorshams, and Wainwright. 2013. “Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees.” The Journal of Machine Learning Research.

Ortiz, Evans, and Davison. 2021. “A Visual Introduction to Gaussian Belief Propagation.” arXiv:2107.02308 [Cs].

Parr, Markovic, Kiebel, et al. 2019. “Neuronal Message Passing Using Mean-Field, Bethe, and Marginal Approximations.” Scientific Reports.

Pearl. 1982. “Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach.” In Proceedings of the Second AAAI Conference on Artificial Intelligence. AAAI’82.

———. 1986. “Fusion, Propagation, and Structuring in Belief Networks.” Artificial Intelligence.

———. 2008. Probabilistic reasoning in intelligent systems: networks of plausible inference. The Morgan Kaufmann series in representation and reasoning.

———. 2009. Causality: Models, Reasoning and Inference.

Peleg, Eldar, and Elad. 2010. “Exploiting Statistical Dependencies in Sparse Representations for Signal Recovery.” IEEE Transactions on Signal Processing.

Rajaei, Gigan, Krzakala, et al. 2017. “Robust Phase Retrieval with the Swept Approximate Message Passing (prSAMP) Algorithm.” Image Processing On Line.

Riegler, Kirkelund, Manchón, et al. 2012. “Merging Belief Propagation and the Mean Field Approximation: A Free Energy Approach.” arXiv:1112.0467 [Cs, Math, Stat].

Roychowdhury, and Kulis. 2015. “Gamma Processes, Stick-Breaking, and Variational Inference.” In Artificial Intelligence and Statistics.

Satorras, and Welling. 2021. “Neural Enhanced Belief Propagation on Factor Graphs.” In.

Schniter, and Rangan. 2012. “Compressive Phase Retrieval via Generalized Approximate Message Passing.” In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

Smith, and Eisner. 2008. “Dependency Parsing by Belief Propagation.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

Song, Gretton, Bickson, et al. 2011. “Kernel Belief Propagation.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.

Spirtes, Glymour, and Scheines. 2001. Causation, Prediction, and Search. Adaptive Computation and Machine Learning.

Studený. 2005. Probabilistic Conditional Independence Structures. Information Science and Statistics.

Sutton, and Minka. 2006. “Local Training and Belief Propagation.”

van de Laar, Cox, Senoz, et al. 2018. “ForneyLab: A Toolbox for Biologically Plausible Free Energy Minimization in Dynamic Neural Models.” In Conference on Complex Systems.

Vehtari, Gelman, Sivula, et al. 2019. “Expectation Propagation as a Way of Life: A Framework for Bayesian Inference on Partitioned Data.” arXiv:1412.4869 [Stat].

Wainwright, Martin, and Jordan. 2005. “A Variational Principle for Graphical Models.” In New Directions in Statistical Signal Processing.

Wainwright, Martin J., and Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.

Wand. 2017. “Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing.” Journal of the American Statistical Association.

Wang, Xiangyu, and Dunson. 2013. “Parallelizing MCMC via Weierstrass Sampler.”

Wang, Dilin, Zeng, and Liu. 2018. “Stein Variational Message Passing for Continuous Graphical Models.”

Weiss, and Freeman. 2001. “Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology.” Neural Computation.

Welling, Minka, and Teh. 2012. “Structured Region Graphs: Morphing EP into GBP.” In Uncertainty in Artificial Intelligence.

Wiegerinck, and Heskes. 2002. “Fractional Belief Propagation.” In Advances in Neural Information Processing Systems.

Wingate, and Weber. 2013. “Automated Variational Inference in Probabilistic Programming.” arXiv:1301.1299 [Cs, Stat].

Winn, John M. 2004. “Variational Message Passing and Its Applications.”

Winn, John M., and Bishop. 2005. “Variational Message Passing.” In Journal of Machine Learning Research.

Xing, Jordan, and Russell. 2003. “A Generalized Mean Field Algorithm for Variational Inference in Exponential Families.” In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. UAI’03.

Yedidia, Jonathan S., Freeman, and Weiss. 2000. “Generalized Belief Propagation.” In Proceedings of the 13th International Conference on Neural Information Processing Systems. NIPS’00.

Yedidia, J.S., Freeman, and Weiss. 2003. “Understanding Belief Propagation and Its Generalizations.” In Exploring Artificial Intelligence in the New Millennium.

———. 2005. “Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms.” IEEE Transactions on Information Theory.

Yoshida, and West. 2010. “Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing.” Journal of Machine Learning Research.

Yuille. 2011. “Loopy Belief Propagation, Mean Field Theory and Bethe Approximations.” In Markov Random Fields for Vision and Image Processing.

Zhang, Schwing, and Urtasun. 2014. “Message Passing Inference for Large Scale Graphical Models with High Order Potentials.” In Advances in Neural Information Processing Systems.

Zhou, Hai-Jun. 2017. “Message Passing for Graphical Models.”

Zhou, Jiankui, and Qiu. 2023. “Augmented Message Passing Stein Variational Gradient Descent.”

Zhuo, Liu, Shi, et al. 2018. “Message Passing Stein Variational Gradient Descent.” In Proceedings of the 35th International Conference on Machine Learning.

Footnotes

or some function of that posterior density, but we don’t gain anything by doing that in the current context. If you want to calculate, e.g. the mode, consider looking up “max-product” inference.↩︎
I suspect that is a fairly deep question.↩︎