# Belief propagation and related algorithms

November 25, 2014 — May 17, 2023

algebra
approximation
Bayes
distributed
dynamical systems
Gaussian
generative
graphical models
machine learning
networks
optimization
probability
signal processing
state space models
statistics
stochastic processes
swarm
time series

### Assumed audience:

People who know basic probability but have not applied it to graphs

$\renewcommand{\var}{\operatorname{Var}} \renewcommand{\corr}{\operatorname{Corr}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\rv}[1]{\mathsf{#1}} \renewcommand{\vrv}[1]{\vv{\rv{#1}}} \renewcommand{\disteq}{\stackrel{d}{=}} \renewcommand{\dif}{\backslash} \renewcommand{\gvn}{\mid} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}}$

The basic inspiration of message-passing inference which turns out to be not always implementable, but gives us some basic inspiration. A concrete, implementable version of use to me is Gaussian Belief Propagation. See also graph computations for a more general sense of Message Passing, and graph NNs for the same idea but with more neural dust sprinkled on it.

## 1 History

The grandparent of Belief Propagation is Pearl (1982) for DAGs and then generalised to various other graphical models later. This definition subsumes such diverse methods as the Viterbi and Baum-Welch algorithms and state filters.

## 2 Let’s go

Bishop (2006)’s notation is increasingly standard. We can set this up a few ways but a nice general one is factor graphs. Below is a factor graph; if it makes no sense maybe dash over to that page and read about ’em. We assume that the factor graphs are trees, which means that the variables are all connected, without any loops.

If we are dealing with continuous RVs this factor graph corresponds to a density function which factorises $p(\vv{x})=f_a(x_1)f_b(x_1,x_2)f_d(x_2)f_c(x_2,x_3)f_g(x_3,x_4,x_5)f_k(x_5).$ For reasons of tradition we often assume the variables are discrete at this point and work with sums. I find densities more intuitive.

For the moment we use the mnemonic that $$p$$s are densities and thus normalised, whereas $$f$$s are factors, which are not necessarily normalised (but are non-negative and generate finite Lebesgue measures over their arguments, so they can be normalized.)

OK, what can we do with this kind of factorisation? I follow Davison and Ortiz (2019)’s explanation, which is the shortest one I can find that is still comprehensible.

Suppose we want to find the marginal density at some node $p(x_n)=\int p(\vv{x})\dd(\vv{x}\dif x).$ Read $$\dd (\vv{x}\dif x)$$ as “an integral over all axes except $$x$$”, e.g. $$\dd (\vv{x}\dif x_1)=\dd x_2 \dd x_3 \dd x_4\dots$$

We can write the density at $$x$$ in terms of a product over the density at each of the factors connected to it. $p(\vv{x})=\prod_{s\in n(s)}F_s(x,\vv{x}_s)$ Here $$n(x)$$ is the set of factor nodes that are neighbours of $$x$$. $$F_{s}$$ is the product of all factors in the group associated with $$f_{s}.$$ $$\vv{x}_{s}$$ is the vector of all variables in the subtree connected to $$x$$ via $$f_{s}$$. Combining these, \begin{aligned} p(x) &=\int \left[\prod_{s \in n(x)} F_{s}\left(x, \vv{x}_{s}\right)\right] \dd(\vv{x}\dif x)\\ &=\prod_{s \in n(x)}\left[\int F_{s}\left(x, \vv{x}_{s}\right)\dd(\vv{x}_s)\right]. \end{aligned} That second step, reordering the integral and product, is pretty much our whole justification. How can we express that in terms of operations on a factor graph?

A clue is: These inner integrals $$\int F_{s}\left(x, \vv{x}_{s}\right)\dd(\vv{x}_s)$$ are marginal densities over an individual $$x,$$ and we have removed the dependence on the other arguments in this marginalisation; Looking ‘through’ each integral we see only one argument, so we know that those other arguments are in some sense irrelevant from where we sit with regard to calculating this marginal. Next we represent this decomposition of integrals and products into messages between axes that correspond to marginalising out dependencies, and thus removing variables from our consideration from our current position.

In doing so we assume that these marginal integrals have convenient and analytic solutions, which is great for proving interesting theories. In practice… If that assumption looks dangerous, that is because it should. These integrals cannot, in general, be so solved, and later on we make all kinds of approximations to make this go.1

Anyway, let us ignore tedious calculus problems for now and assume that works. Describing a recipe which actually does this message passing looks confusing because it involves some notation challenges, but it is not awful if we dedicate some time to staring and swearing at it.

There are two types of message needed: factor to variable messages, which will be multiplied to give the marginal distribution of that variable, and variable to factor messages, which are harder to explain punchily but eventually make sense if we stare long enough.

Belief Update: The variable node beliefs are updated by taking a product of the incoming messages from all adjacent factors, each of which represents that factor’s belief on the receiving node’s variables.

Factor-to-variable message: To send a message to an adjacent variable node, a factor aggregates messages from all other adjacent variable nodes and marginalizes over all the other nodes’variables to produce a message that expresses the factor’s belief over the receiving node’s variables.

Variable-to-factor message: A variable-to-factor message tells the factor what the belief of the variable would be if the receiving factor node did not exist. This is computed by taking the product of the messages the variable node has received from all other factor nodes.

The factor to variable message looks like this: $\mu_{f_{s} \rightarrow x}(x)=\int F_{s}\left(x, \vv{x}_{s}\right)\dd(\vv{x}_s).$ This term $$\mu_{f_{s} \rightarrow x}(x)$$ we think of as a message from factor $$f_{s}$$ to variable $$x$$ The message has the form of a function over variable $$x$$ only, which corresponds to marginalised probability over $$x$$ as the result of considering all factors in one branch of the tree. Thus, $p(x) = \prod_{s\in n(x)}\mu_{f_{s} \rightarrow x}(x).$ That is the easy bit. Next is the slightly weirder variable-to-factor message. \begin{aligned} \mu_{x_{m} \rightarrow f_{s}}\left(x_{m}\right) &=\int \prod_{l \in n\left(x_{m}\right) \backslash f_{s}} F_{l}\left(x_{m}, \vv{x}_{m_{l}}\right) \dd \vv{x}_{sm}\\ &=\prod_{l \in n\left(x_{m}\right) \backslash f_{s}} \int F_{l}\left(x_{m}, \vv{x}_{m_{l}}\right) \dd \vv{x}_{m_{l}}\\ &=\prod_{l \in n\left(x_{m}\right) \backslash f_{s}} \mu_{f_{l} \rightarrow x_{m}}\left(x_{m}\right). \end{aligned}

TBC.

We can invent a lot of extra machinery here to do belief propagation in ever more sophisticated ways as regards the message-passing schedule and marginals. A classic example is the junction tree, which is a particular non-unique alternative decomposition of the graphical models in a kind of clustered graph. AFAICT most of these extended belief-y algorithms are interesting but complicated, hard to implement in general, and have been useless to me thus far for practical purposes. Exceptions:

1. Quantifying independence with these tools leads us to causal inference, which has indeed been useful for me.
2. Gaussian Belief Propagation turns out to be what I am writing a paper on right now.

For purely discrete RVs it is OK, but that is a setting I almost never care about, except for generating neat graphs for causal inference thought experiments.

Having expressed all those negative sentiments about practical application of exact belief passing, nonetheless it is pedagogically useful to learn at least this much, because it provides a heuristic motivation and structure for variational message passing, which does approximately work, approximately generally.

For the dedicated there are many treatises on this topic in the literature, going deep into the most tedious recesses of the subject and they make virtuous and improving class exercises, but offer diminishing returns in understanding. I would to prefer texts which build up this machinery towards modern application or for profound reinterpretation if I wanted to put this into a wider perspective.

## 5 References

Bishop. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics.
Buntine. 1994. Journal of Artificial Intelligence Research.
Davison, and Ortiz. 2019. arXiv:1910.14139 [Cs].
Erdogdu, Deshpande, and Montanari. 2017. arXiv:1709.06525 [Cs, Stat].
Huang, and Jojic. 2010. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.
Kirkley, Cantwell, and Newman. 2021. Science Advances.
Koller, and Friedman. 2009. Probabilistic Graphical Models : Principles and Techniques.
Liu, and Ihler. 2011. In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11.
Noorshams, and Wainwright. 2013. “Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees.” The Journal of Machine Learning Research.
Ortiz, Evans, and Davison. 2021. arXiv:2107.02308 [Cs].
Pearl. 1982. In Proceedings of the Second AAAI Conference on Artificial Intelligence. AAAI’82.
———. 1986. Artificial Intelligence.
Song, Gretton, Bickson, et al. 2011. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.
Wiegerinck, and Heskes. 2002. In Advances in Neural Information Processing Systems.
Zhou, Dedieu, Kumar, et al. 2023.

## Footnotes

1. Question: does anyone ever use a large numerical solver for this? What breaks in that case, apart from the messages growing embarrassingly large?↩︎