# Message-passing algorithms in graphical models

## Cleaving reality at the joint, then summing it at the marginal

### Assumed audience:

People who want to learn the tools of modern approximate graphical inference with the minimum of diversions See also graph computations for a more general sense of Message Passing, and graph NNs for the same idea but with more neural dust sprinkled on it.

Variational message passing refers to inference where the probability density factorizes over some graphical independence structure, which means we get cheap and distributed inference, for some values of cheap and distributed. I am currently particularly interested in this for latent GP models where it generates Gaussian belief propagation, and also the union of message passing with variational inference.

I find the manner in which variational message passing is usually introduced to be confusing. It detours via a lot of ideas that turn out to be impractical to use in inference problems that I care about. In this notebook I am trying to shuffle about the graphical models curriculum until I find a minimum viable set of concepts to explain to myself what I actually need to do in my own approximate variational inference practice when I fire up modern probabilistic programming tools. This document is live and subject to revisions and definitely includes errors I have introduced recently in addition to the earlier errors I have not yet found.

For motivation as to why we would try to solve statistical questions see belief propagation, which is the original setting for message passing, where it turns out there are some nice relations between graphs and integrals of factorised functions. It is worth learning once. Exact Belief Propagation cannot be implemented except in special cases, so you can likely forget it soon after. The key take-home is that some integrals can be solved, at least conceptually, by passing messages around on a graph corresponding to terms in a a graph. It is, IMO, worth not learning too much about because traditionally there are lots of other side-quests they send you at this point, all of which are useless (Junction trees? No thank you).

There are two common ways to build up message passing ideas; one is to start from belief propagation and try to generalise it in various ways, worrying at each step how valid the extension is. Alternatively, we can present a specific VMP algorithm in its own right, which is pretty simple and always does something, even though sometimes it might produce nonsense. This approach is what of a lot of application papers actually do. Later we can wonder under what circumstances it might do marginal inference, and what it might approximate when it does not. The latter approach is my preference for now, since it leads to hands-on intuition-building with easy implementation, and defers the (to my mind) horrible, grinding stage of theorem-proving to later on when we are already solving useful problems and the pay-off is better.

Advice from Tom P. Minka (2005) tells us we have too many choices here:

The recipe to make a message-passing algorithm has four steps:

• Pick an approximating family for q to be chosen from. For example, the set of fully-factorized distributions, the set of Gaussians, the set of k-component mixtures, etc.
• Pick a divergence measure to minimize. For example, mean-field methods minimize the Kullback-Leibler divergence $$KL(q \| p)$$, expectation propagation minimizes $$KL(p \| q)$$, and power EP minimizes α-divergence, $$D\alpha(p \| q)$$.
• Construct an optimization algorithm for the chosen divergence measure and approximating family. Usually this is a fixed-point iteration obtained by setting the gradients to zero.
• Distribute the optimization across the network, by dividing the network p into factors, and minimizing local divergence at each factor.

Take home message: there are a lot of pluggable parts in this family of algorithms. This can be confusing. I think it is easier to start from a concrete set of assumptions, derive a specific (ideally maximally simple) algorithm and then generalise outwards to more fiddly/complicated ones. So here I will mention a few pluggable choices, but I am focussing on some specific choices. I think the landscape of possible message-passing algorithms looks something like this. I am working on the assumption that starting from the red star is the easiest thing to do.

## It is conceptually easier to seek forgiveness for bad approximate inference than permission for exact inference

A classic way to introduce message-passing inference is to spend time wondering when we are permitted do exact inference with local calculations (“exact belief propagation”), and then introduce ever more caveats and complications.

After this complicated machinery breaks down we give up, throw most of it out, and introduce a simpler approximation, i.e. variational message-passing (VMP), which can be applied to for more diverse distributions and graph structures. There is slightly more complex set up in terms of distributions (what does it mean to variationally approximate a distribution?) but it the machinery there is something we can use over and over again and is simpler for me at least then complicated graph algorithms.

Although VMP might not actually work, when it does, it is simple. And in many special cases the approximation and the exact version end up being the same thing, or at least very close. Certainly, comprehension-wise we should start from VMP.

NB Wand (2017) asserts that there is a difference between Mean Field Variational Bayes and Variational Message Passing. I have no idea what it is; I thought MFVB was a special case of VMP. So, take everything I say with a grain of salt because I am clearly missing at least one distinction of note.

## Which graph?

We can pass messages on factor graphs or on DAGs or what-have-you. It comes out simpler on factor graphs (although John M. Winn and Bishop (2005) does a good job of being clear about the structure of the method on DAGs by analogy with expectation maximisation, which produces some lovely symmetries). The intuition here is that the factor graph then corresponds exactly to the variational approximation.

## Which divergence?

Kullback-Leibler divergence. In combination with the other assumptions we have introduced, this makes everything come out nice. This is the confusingly-named “mean-field” assumption. I assume the name comes from physicists; it has that air about it.

Hold that thought.

## Which random variates?

We will assume exponential family RVs, or specifically, that each factorized term in a factor is an exponential condition on the incoming message. Otherwise stuff is weird.

## KL variational message-passing for exponential families on factor graphs

Putting this together, I assert that a reasonable choice for our basic starting algorithm is Variational message passing on factor graphs with respect to KL divergence . I will also assume that all the graphs are tree graphs, which is to say, there are no loops. Later on, it will turn out that some things break if we have loops, but generally not very badly and the badness can be quantified and controlled using various tools so we can often not worry about that.

TODO: compare to an exact algorithm like Gaussian BP.

At first glance these together look like insane restrictions, but we can make it work. We can take a short leap of faith where we seem to being doing arbitrary or weird things, then reassure ourselves that we are solving something like a real inference problem.

Let us now look at the factor graph from Wand (2017). A factor graph. The factors are represented by square nodes, and the random variates or stochastic nodes by circular ones.

Consider a Bayesian statistical model with observed data $$\boldsymbol{D}$$ and parameter vector $$\boldsymbol{\theta}$$. In a factor graph we can write the joint density of these random nodes as $$p(\boldsymbol{\theta}, \boldsymbol{D})=\prod_{j=1}^{N} f_{j}\left(\boldsymbol{\theta}_{\text {neighbors }(j)}\right) .$$ Our goal in inference is to calculate the the posterior density function $$p(\boldsymbol{\theta} \mid \boldsymbol{D}).$$1

We do that with updates that look like this: Exact belief propagation updates steps from Ortiz et al’s interactive GBP demo.

In the exact case Ortiz et al’s interpretation of these messages in the belief propagation setting makes sense:

Belief Update: The variable node beliefs are updated by taking a product of the incoming messages from all adjacent factors, each of which represents that factor’s belief on the receiving node’s variables.

Factor-to-variable message: To send a message to an adjacent variable node, a factor aggregates messages from all other adjacent variable nodes and marginalizes over all the other nodes’variables to produce a message that expresses the factor’s belief over the receiving node’s variables.

Variable-to-factor message: A variable-to-factor message tells the factor what the belief of the variable would be if the receiving factor node did not exist. This is computed by taking the product of the messages the variable node has received from all other factor nodes.

We are working in an approximated setting, though, so this is more of a heuristic.

A fully-factorized approximation to the posterior density function $$p(\boldsymbol{\theta} \mid \boldsymbol{D})$$ is $p(\boldsymbol{\theta} \mid \boldsymbol{D}) \approx q^{*}(\boldsymbol{\theta})$ where $$q^{*}(\boldsymbol{\theta})$$ is the minimizer of the Kullback-Leibler divergence $$\int q(\boldsymbol{\theta}) \log \left\{\frac{q(\boldsymbol{\theta})}{p(\boldsymbol{\theta} \mid \boldsymbol{D})}\right\} d \boldsymbol{\theta}$$. We suppress the dependence on data here because we are holding it fixed when we conduct a posterior update.

For each $$1 \leq i \leq M$$ and $$1 \leq j \leq N$$, the VMP stochastic node to factor message updates are $m_{\boldsymbol{\theta}_{i} \rightarrow f_{j}}\left(\boldsymbol{\theta}_{i}\right) \longleftarrow \propto \prod_{j^{\prime} \neq j: i \in \text { neighbors }\left(j^{\prime}\right)} m_{f_{j^{\prime}} \rightarrow \boldsymbol{\theta}_{i}}\left(\boldsymbol{\theta}_{i}\right)$ and the factor to stochastic node message updates are $m_{f_{j} \rightarrow \boldsymbol{\theta}_{i}}\left(\boldsymbol{\theta}_{i}\right) \longleftarrow \propto \exp \left[E_{f_{j} \rightarrow \boldsymbol{\theta}_{i}}\left\{\log f_{j}\left(\boldsymbol{\theta}_{\text {neighbors }(j)}\right)\right\}\right]$ where $$E_{f_{j} \rightarrow \boldsymbol{\theta}_{i}}$$ denotes expectation with respect to the density function $\frac{\prod_{i^{\prime} \in \text { neighbors }(j) \backslash\{i\}} m_{f_{j} \rightarrow \boldsymbol{\theta}_{i^{\prime}}}\left(\boldsymbol{\theta}_{i^{\prime}}\right) m_{\boldsymbol{\theta}_{i^{\prime}} \rightarrow f_{j}}\left(\boldsymbol{\theta}_{i^{\prime}}\right)}{\prod_{i^{\prime} \in \text { neighbors }(j) \backslash\{i\}} \int m_{f_{j} \rightarrow \boldsymbol{\theta}_{i^{\prime}}}\left(\boldsymbol{\theta}_{i^{\prime}}\right) m_{\boldsymbol{\theta}_{i^{\prime}} \rightarrow f_{j}}\left(\boldsymbol{\theta}_{i^{\prime}}\right) d \boldsymbol{\theta}_{i^{\prime}}}$ That looks like we have made a ghastly mistake, because this latter message is horrible in general. it does simplify somewhat for conditionally-exponential families where we can exploit conditional conjugacy.

After passing messages around all the nodes in the graph until it seems to have converged, the Kullback-Leibler optimal $$q$$-densities are obtained via $q^{*}\left(\boldsymbol{\theta}_{i}\right) \propto \prod_{j: i \in \text { neighbors }(j)} m_{f_{j} \rightarrow \boldsymbol{\theta}_{i}}\left(\boldsymbol{\theta}_{i}\right).$

### What just happened?

I found several things confusing about these algorithms when first I heard of them.

Firstly, what is the message actually? It is a probability distribution, or, more generally a measure, possibly over only some of the axes. This can be obscured by the presentation in some texts, IMO, because of a couple of pedagogic choices seem to be common but determined more by historical inertia rather than ease of learning.

The fact that we can represent some target distribution, or more often some approximation to the probability distribution, as a finite-dimensional vector, is because some distributions can be represented by finite-dimensional parameter vectors (e.g. I can pack the means and variances of a multivariate Gaussian into a vector).

But what do we actually do with these message? This turns out to be subtle, tl;dr: it is a combination of multiplying and re-normalising other distributions, and sometimes optimising with respect to those. Longer story: There are a several things that we might hope we could do in a general case.

What ends up being feasible most often is to introduce that lucky combination of assumptions (KL-divergence, fully-factorized, exponential-family) which causes the factors to decompose nicely such that the messages can be reasonably be considered over each edge separately. (and if that is infeasible there are other things that may be done, such as using semiparametric models which are still fully-factored and exponential, or breaking variables out into observations and auxiliary variables and possibly other things… so this idea is not as terrible as it first seemed to me).

Also, we often see the update rules for multiple i.i.d. exponential family observations, which is a special case, or for a single observation, which is another kind of special case. These are important cases, but they treated as if their relationship to the methodology at its most general was obvious. IMO it is not obvious, but the fact that it comes out so nicely in the special cases means that I feel like an idiot for not “getting” what problem I am solving generally.

tl;dr The magic here is that update rules for (conditionally) exponential variates under a mean field assumption and KL Divergence are surprisingly simple in this application.

## Fully-factored assumption

Here we assume that the density at each factor is well-approximated by a product of independent terms, one for each edge in the factor graph. (TODO: example equation).

Anyway, it not always a terrible idea to make this assumptions, but nor is it the only way to do things, at least conceptually. In practice, it is much more effort if we cannot make this assumption.

Nonetheless, papers that do not make it explicit are very confusing. Why would we ever expect inference to look anything like that?2

AFAICT this is a problem of history. Not all variational inference must make the mean-field, or the fully-factored assumptions etc, but methods that relax these restrictions start to look different and need rather different technology to go, such as local Monte Carlo estimation. For a long while this set of assumptions were the main game in town, (or even more onerous assumptions, like assuming everything is discrete and treating inference exactly) so tutorials of a certain vintage take mean-field variational inference as a synonym for variational inference. If I have just learnt some non-mean-field SVI methods from a recent NeurIPS paper before I encounter VMP, I might well be confused by this.

## Conditional conjugacy/ exponential families

Under construction.

TODO: note that “natural parameter” form is extra nice here because it summarises the “right” updates.

Attias (1999);Ghahramani and Beal (2001);WinnVariational2005

Dustin Tran and David Blei mention the importance of the conditional conjugacy relations when we are working in (conditionally) exponential families.

Justin Domke summarises exponential families for graphical model usage.

TBD

## Loopiness

OK, we have assumed tree graphs for everything thus far, but this is an onerous assumption. We can allow loopy graphs, which often works, sort-of. This choice does introduce extra approximations into the solution, but also allows us to handle more plausible systems within the same setup.

TBD

## Which free energy?

Helmholtz Free Energy is the classic flavour. On graphs we are concerned with a related, more general (?) free energy called Bethe free energy [Jonathan S. Yedidia, Freeman, and Weiss (2005); Jonathan S. Yedidia, Freeman, and Weiss (2000); WainwrightGraphical2008].

TBD

## In MCMC sampling

See Huang and Gelman (2005); Vehtari et al. (2019); Wang and Dunson (2013). ## Semiparametric

This is the main innovation I think, in the synthesis of Wand (2017).

## Without Mean-field assumptions

What do relaxations of the mean-field assumption look like, and do they ever arise in message-passing models? There is structured mean-field inference, below, but I think that often we throw out any hope of message passing methods when we relax the mean-field assumption. Variational autoencoder do not use this assumption in general, but they also do not use message passing inference. Re-imposing a graphical structure on VAE models is gruelling and AFAIK only done on a case-by-case basis. Perhaps there are some neural methods that do something else?

TBD.

## Loopy belief propagation From Murphy, Weiss, and Jordan (1999) and analysed in Weiss and Freeman (2001); M. J. Wainwright and Jordan (2008); Jonathan S. Yedidia, Freeman, and Weiss (2005). One day I might be interested in Bethe free energy too?

## Structured mean field

Directly generalises mean field . I am not sure how useful this is in practice.

## Phase transitions

Thomas Orton’s overview lecture connects message-passing with statistical mechanics of statistics.

Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behavior, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimization for physical systems.

More possibly in Barbier (2015). TBD. See .

## Automatic differentiation as message-passing

TBD

A good introduction is M. J. Wainwright and Jordan (2008), which does some deep stuff with setting up graphical models in exponential family context. Bishop (2006) provides extra application context and a different theoretical development. Dauwels (2007) explains why the factor graph setting is the simplest way to develop the message passing algorithms. I agree with him about this. It is simplest to learn message-passing inference for factor graphs and then translate to and from classic DAGs and MRFs as needed.

Dustin Tran and David Blei advocate for Wand (2017) as a kind of modern milestone in this field. Wingate and Weber (2013) connects us to modern stochastic VI.

Next, see the recommended texts about general graphical model theory in the graphical models notebook.

From a signal processing/coding perspective H.-A. Loeliger (2004) is interesting. Read this one if you know what a turbo code is.

Other overviews I have found useful for various reasons are Tom P. Minka (2005),J. S. Yedidia, Freeman, and Weiss (2003), Sutton and Minka (2006) and Cox, van de Laar, and de Vries (2019).

Recommended: Wee Sun Lee’s NeurIPS lecture about how to integrate neural function approximation into distributed inference: Message Passing In Machine Learning.

## Tools

• GAMP.

• ForneyLab looks especially useful for me:

The message passing paradigm offers a convenient method for leveraging model-specific structures, while remaining generally applicable. Message passing can be conveniently formulated on a Forney-style factor graph (FFG) representation of the model. Inference tasks on the model can then be decomposed in local computations, represented by messages that flow across the graph. This locality allows for storing pre-computed message updates in a look-up table that can be re-used across models. Automated algorithm construction then amounts to scheduling these messages in the order required by the inference task (see also this conference paper at JuliaCon).

ForneyLab (GitHub) is introduced in Cox, van de Laar, and de Vries (2019) as a novel Julia package that allows the user to specify a probabilistic model as an FFG and pose inference problems on this FFG. In return, ForneyLab automatically constructs a Julia program that executes a message passing-based (approximate) inference procedure. ForneyLab is designed with a focus on flexibility, extensibility and applicability to biologically plausible models for perception and decision making, such as the hierarchical Gaussian filter (HGF). With ForneyLab, the search for better models for perception and action can be accelerated

This idea has been baked into various other probabilistic programming frameworks by now, in that many of them seem to infer a reasonable message-passing structure for models.

1. or some function of that posterior density, but we don’t gain anything by doing that in the current context. If you want to calculate, e.g. the mode, consider looking up “max-product” inference.↩︎

2. I suspect that is a fairly deep question.↩︎

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.