Belief propagation and related algorithms

2014-11-25 — 2023-05-17

Suspiciously similar content

Assumed audience:

People who know basic probability but have not applied it to graphs

Figure 1: Belief propagation works incredibly well if I have configured all my assumptions *juuuuuuust* so.

The basic inspiration of message-passing inference turns out to be not always implementable but gives us some basic inspiration. A concrete, implementable version useful to me is Gaussian Belief Propagation. See also graph computations for a more general sense of Message Passing, and graph NNs for the same idea but with more neural dust sprinkled on it.

1 History

The grandparent of Belief Propagation is Pearl (1982) for DAGs and then generalised to various other graphical models later. This definition subsumes such diverse methods as the Viterbi and Baum-Welch algorithms and state filters.

2 Let’s go

Bishop (2006)’s notation is increasingly standard. We can set this up a few ways but a nice general one is factor graphs. Below is a factor graph; if it makes no sense maybe dash over to that page and read about ’em. We assume that the factor graphs are trees, which means that the variables are all connected, without any loops.

Figure 2: A tree factor graph. The *factors* are represented by square nodes, and the random variates or *stochastic nodes* by circular ones.

If we are dealing with continuous RVs this factor graph corresponds to a density function which factorises $p (x) = f_{a} (x_{1}) f_{b} (x_{1}, x_{2}) f_{d} (x_{2}) f_{c} (x_{2}, x_{3}) f_{g} (x_{3}, x_{4}, x_{5}) f_{k} (x_{5}) .$ For reasons of tradition, we often assume the variables are discrete at this point and work with sums. I find densities more intuitive.

For the moment we use the mnemonic that $p$ s are densities and thus normalised, whereas $f$ s are factors, which are not necessarily normalised (but are non-negative and generate finite Lebesgue measures over their arguments, so they can be normalised.)

OK, what can we do with this kind of factorization? I follow Davison and Ortiz (2019)’s explanation, which is the shortest one I can find that is still comprehensible.

Suppose we want to find the marginal density at some node $p (x_{n}) = \int p (x) d (x ∖ x) .$ Read $d (x ∖ x)$ as “an integral over all axes except $x$ ,” e.g. $(x_1)=x_2 x_3 x_4”

We can write the density at $x$ in terms of a product over the density at each of the factors connected to it. $p (x) = \prod_{s \in n (s)} F_{s} (x, x_{s})$ Here $n (x)$ is the set of factor nodes that are neighbours of $x$ . $F_{s}$ is the product of all factors in the group associated with $f_{s} .$ $x_{s}$ is the vector of all variables in the subtree connected to $x$ via $f_{s}$ . Combining these, $\begin{aligned} p (x) & = \int [\prod_{s \in n (x)} F_{s} (x, x_{s})] d (x ∖ x) \\ = \prod_{s \in n (x)} [\int F_{s} (x, x_{s}) d (x_{s})] . \end{aligned}$ That second step, reordering the integral and product, is pretty much our whole justification. How can we express that in terms of operations on a factor graph?

A clue is: These inner integrals $\int F_{s} (x, x_{s}) d (x_{s})$ are marginal densities over an individual $x,$ and we have removed the dependence on the other arguments in this marginalization; Looking ‘through’ each integral we see only one argument, so we know that those other arguments are in some sense irrelevant from where we sit with regard to calculating this marginal. Next, we represent this decomposition of integrals and products into messages between axes that correspond to marginalizing out dependencies, and thus removing variables from our consideration from our current position.

In doing so we assume that these marginal integrals have convenient and analytic solutions, which is great for proving interesting theories. In practice… If that assumption looks dangerous, that is because it should. These integrals cannot, in general, be so solved, and later on we make all kinds of approximations to make this go.¹

Anyway, let us ignore tedious calculus problems for now and assume that works. Describing a recipe which actually does this message passing looks confusing because it involves some notation challenges, but it is not awful if we dedicate some time to staring and swearing at it.

There are two types of messages needed: factor to variable messages, which will be multiplied to give the marginal distribution of that variable, and variable to factor messages, which are harder to explain punchily but eventually make sense if we stare long enough.

Ortiz et al explain it thusly:

Belief Update: The variable node beliefs are updated by taking a product of the incoming messages from all adjacent factors, each of which represents that factor’s belief on the receiving node’s variables.

Factor-to-variable message: To send a message to an adjacent variable node, a factor aggregates messages from all other adjacent variable nodes and marginalises over all the other nodes’ variables to produce a message that expresses the factor’s belief over the receiving node’s variables.

Variable-to-factor message: A variable-to-factor message tells the factor what the belief of the variable would be if the receiving factor node did not exist. This is computed by taking the product of the messages the variable node has received from all other factor nodes.

Figure 3: Exact belief propagation updates steps from Ortiz et al’s interactive GBP demo.

The factor to variable message looks like this: $μ_{f_{s} \to x} (x) = \int F_{s} (x, x_{s}) d (x_{s}) .$ This term $μ_{f_{s} \to x} (x)$ we think of as a message from factor $f_{s}$ to variable $x$ The message has the form of a function over variable $x$ only, which corresponds to marginalised probability over $x$ as the result of considering all factors in one branch of the tree. Thus, $p (x) = \prod_{s \in n (x)} μ_{f_{s} \to x} (x) .$ That is the easy bit. Next is the slightly weirder variable-to-factor message. $\begin{aligned} μ_{x_{m} \to f_{s}} (x_{m}) & = \int \prod_{l \in n (x_{m}) ∖ f_{s}} F_{l} (x_{m}, x_{m_{l}}) d x_{s m} \\ = \prod_{l \in n (x_{m}) ∖ f_{s}} \int F_{l} (x_{m}, x_{m_{l}}) d x_{m_{l}} \\ = \prod_{l \in n (x_{m}) ∖ f_{s}} μ_{f_{l} \to x_{m}} (x_{m}) . \end{aligned}$

Figure 4: Using slightly different notation, here is Ortiz et al’s labelling of the important messages.

TBC.

3 Further reading

We can invent a lot of extra machinery here to do belief propagation in ever more sophisticated ways as regards the message-passing schedule and marginals. A classic example is the junction tree, which is a particular non-unique alternative decomposition of the graphical models in a kind of clustered graph. AFAICT most of these extended belief-y algorithms are interesting but complicated, hard to implement in general, and have been useless to me thus far for practical purposes. Exceptions:

Quantifying independence with these tools leads us to causal inference, which has indeed been useful for me.
Gaussian Belief Propagation turns out to be what I am writing a paper on right now.

For purely discrete RVs it is OK, but that is a setting I almost never care about, except for generating neat graphs for causal inference thought experiments.

Having expressed all those negative sentiments about practical application of exact belief passing, nonetheless it is pedagogically useful to learn at least this much, because it provides a heuristic motivation and structure for variational message passing, which does approximately work, approximately generally.

For the dedicated there are many treatises on this topic in the literature, going deep into the most tedious recesses of the subject and they make virtuous and improving class exercises, but offer diminishing returns in understanding. I would prefer texts which build up this machinery towards modern application (Bishop 2006) or for profound reinterpretation (Koller and Friedman 2009) if I wanted to put this into a wider perspective.

4 Roll your own

Pedagogically useful, although probably not industrial-grade, David Barber’s discrete graphical model code (Julia) can do queries over graphical models.
similar domain, more hype: deepmind/PGMax: Loopy belief propagation for factor graphs on discrete variables in JAX (G. Zhou et al. 2023)
danbar/fglib: factor graph library
The factor graph library (fglib) is a Python package to simulate message passing on factor graphs. It supports the
- sum-product algorithm (belief propagation)
- max-product algorithm
- max-sum algorithm
- mean-field algorithm - in development
with discrete and Gaussian random variables.
Pure numpy
Prasad’s example implementation of Linear Gaussian networks
Krashkov’s summary notebooks
Joe Ortiz’ GBP implementations are elegant and easy to understand. Here are two python ones:
- gbp/gbp.py (numpy)
- Gaussian Belief Propagation Library (pytorch)
Both these are quite pedagogic in code style, but not really so robust that I would do industrial work with them.
PGMPY:

pgmpy is a pure python implementation for Bayesian Networks with a focus on modularity and extensibility. Implementations of various algorithms for Structure Learning, Parameter Estimation, Approximate (Sampling Based) and Exact inference, and Causal Inference are available.

See the Belief Propagation for GBP.

5 References

Bishop. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics.

Buntine. 1994. “Operations for Learning with Graphical Models.” Journal of Artificial Intelligence Research.

Davison, and Ortiz. 2019. “FutureMapping 2: Gaussian Belief Propagation for Spatial AI.” arXiv:1910.14139 [Cs].

Erdogdu, Deshpande, and Montanari. 2017. “Inference in Graphical Models via Semidefinite Programming Hierarchies.” arXiv:1709.06525 [Cs, Stat].

Huang, and Jojic. 2010. “Maximum-Likelihood Learning of Cumulative Distribution Functions on Graphs.” In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.

Kirkley, Cantwell, and Newman. 2021. “Belief Propagation for Networks with Loops.” Science Advances.

Koller, and Friedman. 2009. Probabilistic Graphical Models : Principles and Techniques.

Liu, and Ihler. 2011. “Bounding the Partition Function Using Hölder’s Inequality.” In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11.

Noorshams, and Wainwright. 2013. “Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees.” The Journal of Machine Learning Research.

Ortiz, Evans, and Davison. 2021. “A Visual Introduction to Gaussian Belief Propagation.” arXiv:2107.02308 [Cs].

Pavlasek, Mah, Xu, et al. 2024. “Stein Variational Belief Propagation for Multi-Robot Coordination.”

Pearl. 1982. “Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach.” In Proceedings of the Second AAAI Conference on Artificial Intelligence. AAAI’82.

———. 1986. “Fusion, Propagation, and Structuring in Belief Networks.” Artificial Intelligence.

Song, Gretton, Bickson, et al. 2011. “Kernel Belief Propagation.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.

Wang, Zeng, and Liu. 2018. “Stein Variational Message Passing for Continuous Graphical Models.”

Wiegerinck, and Heskes. 2002. “Fractional Belief Propagation.” In Advances in Neural Information Processing Systems.

Zhou, Guangyao, Dedieu, Kumar, et al. 2023. “PGMax: Factor Graphs for Discrete Probabilistic Graphical Models and Loopy Belief Propagation in JAX.”

Zhou, Jiankui, and Qiu. 2023. “Augmented Message Passing Stein Variational Gradient Descent.”

Zhuo, Liu, Shi, et al. 2018. “Message Passing Stein Variational Gradient Descent.” In Proceedings of the 35th International Conference on Machine Learning.

Footnotes

Question: does anyone ever use a large numerical solver for this? What breaks in that case, apart from the messages growing embarrassingly large?↩︎