Factor graphs

December 16, 2019 — February 27, 2024

algebra
graphical models
machine learning
networks
probability
statistics
Figure 1: For as the sufferings of dimensions abound in us, so our consolation aboundeth through localisation.

Many statistical models are naturally treated as directed Bayesian networks — for example, in causal inference I might naturally think about generating processes for my work as a graphical model. Factor graphs decompose the model differently than the balls-and-arrows DAGs. Specifically, the factor graph structures the graph with regard to how we would implement method-passing inference, rather than how we would statistically interpret the model. The result is harder to interpret intuitively but easier to apply practically. Also, it is more general than a DAG; factor graphs can also be applied to undirected graphical models.

There are at least two different factor graph formalisms, and I will try to explain both. The first, “classic” factor graphs, are the ones that I first encountered in the literature, and the ones that I think are most commonly used. The second, “Forney-style” factor graphs, are even less intuitive and even more practical AFAICT.

Figure 2: Loeliger’s 2004 zoo of graphical models, including two flavours of factor graphs.

1 Classic factor graphs

A long history of multiple invention, probably. A recognisable modern form appears in (Frey et al. 1997; Frey 2003; Kschischang, Frey, and Loeliger 2001). This classic version is explained lots of places, but for various reasons I needed the FFG version first, so my explanation of that is longer.

The representation that we need to “make all the inference steps easy” ends up being a bipartite graph with two types of nodes: variables and factors. Given a factorization of a function \(g\left(X_1, X_2, \ldots, X_n\right)\), \[ g\left(X_1, X_2, \ldots, X_n\right)=\prod_{j=1}^m f_j\left(S_j\right) \] where \(S_j \subseteq\left\{X_1, X_2, \ldots, X_n\right\}\), the corresponding factor graph \(G=(X, F, E)\) consists of variable vertices \(X=\left\{X_1, X_2, \ldots, X_n\right\}\), factor vertices \(F=\left\{f_1, f_2, \ldots, f_m\right\}\), and edges \(E\). The edges depend on the factorization as follows: there is an undirected edge between factor vertex \(f_j\) and variable vertex \(X_k\) if \(X_k \in S_j\).

Quibble: We often assume the function to be factored is a probability density function; more generally we wish to factorise measures.

Once the factor graph is written out, computations on the graph are essentially the same for every node. By contrast, in a classic I-graph DAG there are several different rules to remember for leaf nodes and branches, for colliders and forks etc.

I may return to classic factor graphs.

For now, a pretty good introduction is included in Ortiz, Evans and Davison’s tutorial on Gaussian belief propagation, or Minka’s Minka (2005).

2 Forney-style factor graphs (FFGs)

A tweaked formalism. Citations tell me this was introduced in a recondite article (Forney 2001) which I did not remotely understand because it was way too far into coding theory for me. It is explained and exploited much better for someone of my background in subsequent articles and used in several computational toolkits (Korl 2005; Cox, van de Laar, and de Vries 2019; H.-A. Loeliger et al. 2007; H.-A. Loeliger 2004; van de Laar et al. 2018; Akbayrak, Bocharov, and de Vries 2021).

Relative to classic factor graphs, H.-A. Loeliger (2004) advocates for FFGs for the following advantages

  • suited for hierarchical modeling (“boxes within boxes”)
  • compatible with standard block diagrams
  • simplest formulation of the summary-product message update rule
  • natural setting for Forney’s results on Fourier transforms and duality.

Mao and Kschischang (2005) argues:

Forney graphs possess a strikingly elegant duality property: by a local dualization operation, a Forney graph for a linear code may be transformed into another graph, called the dual Forney graph, which represents the dual code.

The explanation in Cox, van de Laar, and de Vries (2019) gives the flavour of how Forney-style graphs work:

Figure 1 from Cox, van de Laar, and de Vries (2019): In an FFG, edges correspond to variables and nodes represent factors that encode constraints among variables. A node connects to all edges that correspond to variables that occur in its factor function. For example, node \(f_{b}\) connects to edges \(x_{1}\) and \(x_{2}\) since those variables occur in \(f_{b}\left(x_{1}, x_{2}\right)\). Variables that occur in just one factor \(\left(x_{3}\right.\) and \(x_{5}\) in this case) are represented by half-edges. While an FFG is principally an undirected graph, we usually specify a direction for the (half-)edges to indicate the generative direction of the model and to anchor the direction of messages flowing on the graph.

A Forney-style factor graph (FFG) offers a graphical representation of a factorized probabilistic model. In an FFG, edges represent variables and nodes specify relations between variables. As a simple example, consider a generative model (joint probability distribution) over variables \(x_{1}, \ldots, x_{5}\) that factors as \[ f\left(x_{1}, \ldots, x_{5}\right)=f_{a}\left(x_{1}\right) f_{b}\left(x_{1}, x_{2}\right) f_{c}\left(x_{2}, x_{3}, x_{4}\right) f_{d}\left(x_{4}, x_{5}\right), \] where \(f_{\bullet}(\cdot)\) denotes a probability density function. This factorized model can be represented graphically as an FFG, as shown in Fig. 1. Note that although an FFG is principally an undirected graph, in the case of generative models we specify a direction for the edges to indicate the “generative direction”. The edge direction simply anchors the direction of messages flowing on the graph (we speak of forward and backward messages that flow with or against the edge direction, respectively). In other words, the edge directionality is purely a notational issue and has no computational consequences.…

Figure 2 from Cox, van de Laar, and de Vries (2019): Visualization of the message passing schedule corresponding with observed variable \(x_{5}=\hat{x}_{5}\). The observation is indicated by terminating edge \(x_{5}\) by a small solid node that technically represents the factor \(\delta\left(x_{5}-\hat{x}_{5}\right) .\) Messages are represented by numbered arrows, and the message sequence is chosen such that there are only backward dependencies. Dashed boxes mark the parts of the graph that are covered by the respective messages coming out of those boxes. The marginal posterior distribution \(f\left(x_{2} \mid x_{5}=\hat{x}_{5}\right)\) is obtained by taking the product of the messages that flow on edge \(x_{2}\) and normalising.

The FFG representation of a probabilistic model helps to automate probabilistic inference tasks. As an example, consider we observe \(x_{5}=\hat{x}_{5}\) and are interested in calculating the marginal posterior probability distribution of \(x_{2}\) given this observation.

In the FFG context, observing the realization of a variable leads to the introduction of an extra factor in the model which “clamps” the variable to its observed value. In our example where \(x_{5}\) is observed at value \(\hat{x}_{5}\), we extend the generative model to \(f\left(x_{1}, \ldots, x_{5}\right) \cdot \delta\left(x_{5}-\hat{x}_{5}\right).\) Following the notation introduced in Reller (2013), we denote such “clamping” factors in the FFG by solid black nodes. The FFG of the extended model is illustrated in Fig. 2

Another place clamping arises is that, since a variable can only appear in two factors, if we want a variable to appear in more than two, we add extra factors in, each of which constrains the variables touching it to be equal to one another. This seems weird; but it kinda-sorta corresponds to be what we would do in a classic factor graph anyway, in the sense that we would add an extra message passing step in for each extra factor, so in a sense this weird contrivance could be simply described as writing out our plan in full.

Computing the marginal posterior distribution of \(x_{2}\) under the observation \(x_{5}=\hat{x}_{5}\) involves integrating the extended model over all variables except \(x_{2},\) and renormalising: \[ f\left(x_{2} \mid x_{5}=\hat{x}_{5}\right) \propto \int \ldots \int f\left(x_{1}, \ldots, x_{5}\right) \cdot \delta\left(x_{5}-\hat{x}_{5}\right) \mathrm{d} x_{1} \mathrm{~d} x_{3} \mathrm{~d} x_{4} \mathrm{~d} x_{5} \] \[ =\overbrace{\int \underbrace{f_{a}\left(x_{1}\right)}_{1} f_{b}\left(x_{1}, x_{2}\right) \mathrm{d} x_{1}}^{2} \overbrace{\iint f_{c}\left(x_{2}, x_{3}, x_{4}\right) \underbrace{\left(\int f_{d}\left(x_{4}, x_{5}\right) \cdot \delta\left(x_{5}-\hat{x}_{5}\right) \mathrm{d} x_{5}\right)}_{3} \mathrm{~d} x_{3} \mathrm{~d} x_{4}}^{(4)} . \] The nested integrals result from substituting the [original] factorization and rearranging the integrals according to the distributive law. Rearranging large integrals of this type as a product of nested sub-integrals can be automated by exploiting the FFG representation of the corresponding model. The sub-integrals indicated by circled numbers correspond to integrals over parts of the model (indicated by dashed boxes in Fig. 2), and their solutions can be interpreted as messages flowing on the FFG. Therefore, this procedure is known as message passing (or summary propagation). The messages are ordered (“scheduled”) in such a way that there are only backward dependencies, i.e., each message can be calculated from preceding messages in the schedule. Crucially, these schedules can be generated automatically, for example by performing a depth-first search on the FFG.

What local means in the context of graphical models was not intuitive to me at first. Local means that: in this high dimensional integral, only some dimensions are involved in each sub-step; the graph tells us which integrals can be moved outside the calculation by showing how it factorises. This is nothing to do with locality over the domain of the functions. This needs a graphical illustration.

3 Plated

Obermeyer et al. (2019):

A plated factor graph is a labelled bipartite graph \((V, F, E, P)\) whose vertices are labelled by the plates on which they are replicated \(P: V \cup F \rightarrow \mathcal{P}(B)\), where \(B\) is a set of plates. We require that each factor is in each of the plates of its variables: \(\forall(v, f) \in E, P(v) \subseteq P(f).\)

What now?

The key operation for plated factor graphs will be “unrolling” to standard factor graphs. First define the following plate notation for either a factor or variable \(z\) : \(\mathcal{M}_z(b)=\{1, \ldots, M(b)\}\) if \(b \in P(z)\) and \(\{1\}\) otherwise. This is the set of indices that index into the replicated variable or factor. Now define a function to unroll a plate, \[ \left(V^{\prime}, F^{\prime}, E^{\prime}, P^{\prime}\right)=\operatorname{unroll}((V, F, E, P), M, b) \] where \(v_i\) indicates an unrolled index of \(v\) and, \[ \begin{aligned} V^{\prime} & =\left\{v_i \mid v \in V, i \in \mathcal{M}_v(b)\right\} \\ F^{\prime} & =\left\{f_i \mid f \in F, i \in \mathcal{M}_f(b)\right\} \\ E^{\prime} & =\left\{\left(v_i, f_j\right) \mid(v, f) \in E, i \in \mathcal{M}_v(b), j \in \mathcal{M}_f(b)\right. \\ & \quad(i=j) \vee b \notin(P(v) \cap P(f))\} \\ P^{\prime}(z) & =P(z) \backslash\{b\} \quad \end{aligned} \]

An illustration might make that clearer:

Consider a plated factor graph with two variables \(X, Y,\) three factors \(F, G, H,\) and two nested plates Figure 3. Assuming sizes \(I=2\) and \(J=3,\) this plated factor graph unrolls to a factor graph, Figure 4.

Figure 3
Figure 4

Or:

Consider a plated factor graph with two variables, one factor, and two overlapping non-nested plates, denoting a Restricted Boltzmann Machine (RBM) Figure 5. Assuming sizes \(I=2\) and \(J=2\), this plated factor graph unrolls to a factor graph Figure 5.

Figure 5
Figure 6

I think that gives the vibe.

4 Generic messages

Question: Introductory texts discuss sum-product message passing, which is essentially about solving integrals. It might seem like I want to pass some other kind of update, e.g. maximum likelihood? Does this factor graph still help us? Yes, we can calculate max-product messages then. What else can we calculate locally? I am not sure how wide the class of things for which this holds true is. Something about which algebras are defined?

5 Fourier transforms in

What does it benefit us to take a Fourier transform on a graph? What does that even mean? (Kschischang, Frey, and Loeliger 2001; Forney 2001; Mao and Kschischang 2005)

6 In Bayesian brain models

See de Vries and Friston (2017) and van de Laar et al. (2018) for a connection to predictive coding.

7 Causal inference in

How does do-calculus work in factor graphs?

8 Tooling

8.1 GTSAM & OpenSAM

The goals for these two projects are:

  • GTSAM: Advance the state-of-the-art research on factor graphs for large-scale optimization and sensor fusion for robotics and computer vision problems. GTSAM is popular among both academia and industry researchers, with algorithms that are more adaptive and easier to expand.
  • OpenSAM: Provide an industry-standard, factor graph sensor fusion reference implementation optimized for embedded systems. OpenSAM provides a reference for product development in industry and pays careful attention to embedded optimization, security, and certification.

GTSAM and OpenSAM are compatible with each other for easy migration. Successful exploration in GTSAM will lead to new features in OpenSAM.

9 References

Abbeel, Koller, and Ng. 2006. Learning Factor Graphs in Polynomial Time and Sample Complexity.” Journal of Machine Learning Research.
Akbayrak, Bocharov, and de Vries. 2021. Extended Variational Message Passing for Automated Approximate Bayesian Inference.” Entropy.
Bickson. 2009. Gaussian Belief Propagation: Theory and Application.”
Cox, and De Vries. 2018. Robust Expectation Propagation in Factor Graphs Involving Both Continuous and Binary Variables.” In 2018 26th European Signal Processing Conference (EUSIPCO).
Cox, van de Laar, and de Vries. 2019. A Factor Graph Approach to Automated Design of Bayesian Signal Processing Algorithms.” International Journal of Approximate Reasoning.
Dauwels. 2007. On Variational Message Passing on Factor Graphs.” In 2007 IEEE International Symposium on Information Theory.
de Vries, and Friston. 2017. A Factor Graph Description of Deep Temporal Active Inference.” Frontiers in Computational Neuroscience.
Dellaert, and Kaess. 2017. Factor Graphs for Robot Perception.” Foundations and Trends® in Robotics.
El-Kurdi. 2014. “Parallel Finite Element Processing Using Gaussian Belief Propagation Inference on Probabilistic Graphical Models.”
Forney. 2001. Codes on Graphs: Normal Realizations.” IEEE Transactions on Information Theory.
Frey. 2003. Extending Factor Graphs so as to Unify Directed and Undirected Graphical Models.” In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. UAI’03.
Frey, Kschischang, Loeliger, et al. 1997. “Factor Graphs and Algorithms.” In Proceedings of the Annual Allerton Conference on Communication Control and Computing.
Grimmett. 1973. A Theorem about Random Fields.” Bulletin of the London Mathematical Society.
Hasenclever, Webb, Lienart, et al. 2017. Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server.” In The Journal of Machine Learning Research.
Korl. 2005. A Factor Graph Approach to Signal Modelling, System Identification and Filtering.” Application/pdf.
Kschischang, Frey, and Loeliger. 2001. Factor Graphs and the Sum-Product Algorithm.” IEEE Transactions on Information Theory.
Loeliger, H.-A. 2004. An Introduction to Factor Graphs.” IEEE Signal Processing Magazine.
Loeliger, Hans-Andrea, Dauwels, Hu, et al. 2007. The Factor Graph Approach to Model-Based Signal Processing.” Proceedings of the IEEE.
Mao, and Kschischang. 2005. On Factor Graphs and the Fourier Transform.” IEEE Transactions on Information Theory.
Mao, Kschischang, and Frey. 2004. Convolutional Factor Graphs As Probabilistic Models.” In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. UAI ’04.
Ma, and Zhang. 2019. Incorporating Biological Knowledge with Factor Graph Neural Network for Interpretable Deep Learning.”
Minka. 2005. Divergence Measures and Message Passing.”
Obermeyer, Bingham, Jankowiak, et al. 2019. Tensor Variable Elimination for Plated Factor Graphs.”
Reller. 2013. State-Space Methods in Statistical Signal Processing: New Ideas and Applications.” Application/pdf.
Satorras, and Welling. 2021. Neural Enhanced Belief Propagation on Factor Graphs.” In.
Sherman. 1973. Markov Random Fields and Gibbs Random Fields.” Israel Journal of Mathematics.
van de Laar, Cox, Senoz, et al. 2018. ForneyLab: A Toolbox for Biologically Plausible Free Energy Minimization in Dynamic Neural Models.” In Conference on Complex Systems.
Vehtari, Gelman, Sivula, et al. 2019. Expectation Propagation as a Way of Life: A Framework for Bayesian Inference on Partitioned Data.” arXiv:1412.4869 [Stat].
Wiegerinck, and Heskes. 2002. Fractional Belief Propagation.” In Advances in Neural Information Processing Systems.
Xu, Lakshminarayanan, Teh, et al. 2014. Distributed Bayesian Posterior Sampling via Moment Sharing.” In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14.
Zhang, Wu, and Lee. 2020. Factor Graph Neural Networks.” In Advances in Neural Information Processing Systems.