Gaussian belief propagation

Least squares at maximal elaboration

2014-11-25 — 2022-09-01

Wherein Gaussian belief propagation is presented as a message‑passing algorithm on jointly Gaussian variables, and updates are executed as simple linear‑algebra operations in the information (precision) parameterisation.

algebra

approximation

Bayes

distributed

dynamical systems

Gaussian

generative

graphical models

Hilbert space

linear algebra

machine learning

networks

optimization

probability

signal processing

state space models

statistics

stochastic processes

time series

\[\renewcommand{\argmin}{\operatorname{arg min}} \renewcommand{\KL}{\operatorname{KL}} \renewcommand{\var}{\operatorname{Var}} \renewcommand{\cov}{\operatorname{Cov}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\rv}[1]{\mathsf{#1}} \renewcommand{\vrv}[1]{\vv{\rv{#1}}} \renewcommand{\disteq}{\stackrel{d}{=}} \renewcommand{\gvn}{\mid} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}} \renewcommand{\one}{\unicode{x1D7D9}}\]

A particularly tractable model assumption for message-passing inference generalizes classic Gaussian Kalman filters and Bayesian linear regression with Gaussian prior, or, in a frequentist setting, least squares regression. Essentially, we regard the various nodes in our system as jointly Gaussian RVs with given prior mean and covariance (i.e. we do not allow the variances themselves to be random, as a Gaussian is not a valid prior for a variance). This works well because the Gaussian distribution is a natural exponential family, and moreover, self-conjugate under linear transformations, and thus the message-passing updates are simple linear algebra operations.

How does this relate to least squares? Bickson (2009) puts it elegantly: \[\begin{array}{c} \mathbf{A} \mathbf{x}=\mathbf{b}\\ \mathbb{\Downarrow}\\ \mathbf{A x}-\mathbf{b}=0\\ \mathbb{\Downarrow}\\ \min _{\mathbf{x}}\left(\frac{1}{2} \mathbf{x}^{T} \mathbf{A} \mathbf{x}-\mathbf{b}^{T} \mathbf{x}\right)\\ \mathbb{\Downarrow}\\ \max _{\mathbf{x}}\left(-\frac{1}{2} \mathbf{x}^{T} \mathbf{A} \mathbf{x}+\mathbf{b}^{T} \mathbf{x}\right)\\ \mathbb{\Downarrow}\\ \max _{\mathbf{x}} \exp \left(-\frac{1}{2} \mathbf{x}^{T} \mathbf{A} \mathbf{x}+\mathbf{b}^{T} \mathbf{x}\right) \end{array}\]

GBP comes with various conveniences. For example, we can implement it mechanically without knowing the structure of the graphical model in advance. We can execute it concurrently since the model is robust against random execution schedules. This seems to be a happy coincidence of several features of multivariate Gaussians, e.g. that the Gaussian is self-conjugate and thus passes through the factor and variable nodes in commensurable form. Also, all the required operations can be represented by simple linear algebra.

NB: notation is varied and thus possibly confusing. In the original paper (Weiss and Freeman 2001) variables are presented as pairwise linear models rather than a classic graphical model, which is a good insight but possibly confusing.

A famous interactive tutorial is gaussianbp.github.io, reprising Davison and Ortiz (2019). Loeliger et al. (2007) is also nice; it includes explicit accounting for Fourier transforms and arbitrary matrix products.

1 Parameterization

Figure 2: Ortiz et al summarize Gaussian parameterisations

Gaussian BP is often conducted using the canonical parameterization, in which a particular node prior \(m\) may be written as: \[ p_{m}\left(\mathbf{x}_{m}\right)\propto e^{-\frac{1}{2}\left[\left(\mathbf{x}_{m}-\boldsymbol{\mu}_{m}\right)^{\top} \boldsymbol{\Lambda}_{m}\left(\mathbf{x}_{m}-\boldsymbol{\mu}_{m}\right)\right]} \] where \(\boldsymbol{\mu}_{m}\) is the mean of the distribution and \(\Lambda_{m}\) is its precision (i.e. inverse covariance.)

Davison and Ortiz (2019) recommend the “information parameterization” of Eustice, Singh, and Leonard (2006): \[ p_{m}\left(\mathbf{x}_{m}\right)\propto e^{\left[-\frac{1}{2} \mathbf{x}_{m}^{\top} \Lambda_{m} \mathbf{x}_{m}+\boldsymbol{\eta}_{m}^{\top} \mathbf{x}_{m}\right]} \] the value of either. The information vector \(\boldsymbol{\eta}_{m}\) relates to the mean vector as \[ \boldsymbol{\eta}_{m}=\Lambda_{m} \boldsymbol{\mu}_{m}. \] The information form is convenient as multiplication of distributions (when we update) is handled simply by adding the information vectors and precision matrices.

Let us recap that in Gaussian tricks.

2 Linearization

See GP noise transforms for a typical way to do it. In the information parameterization the rules look different. Suppose the expected value of the measurement is given by non-linear map \(h,\) in the sense that we observe Gaussian noise centred on \(h(\mathbf{x}_s)\). We define the associated Gaussian factor as follows: \[ f_{s}\left(\mathbf{x}_{s}\right)\propto e^{-\frac{1}{2}\left[\left(\mathbf{z}_{s}-\mathbf{h}_{s}\left(\mathbf{x}_{s}\right)\right)^{\top} \Lambda_{s}\left(\mathbf{z}_{s}-\mathbf{h}_{s}\left(\mathbf{x}_{s}\right)\right)\right]} \] We apply the first-order Taylor series expansion of non-linear measurement function \(\mathbf{h}_{s}\) to find its approximate value for state values \(\mathbf{x}_{s}\) close to \(\mathbf{x}_{0}\) : \[ \mathbf{h}_{s}\left(\mathbf{x}_{s}\right) \approx \mathbf{h}_{s}\left(\mathbf{x}_{0}\right)+\mathrm{J}_{s}\left(\mathbf{x}_{s}-\mathbf{x}_{0}\right). \] Here \(\mathrm{J}_{s}\) is the Jacobian \(\left.\frac{\partial \mathbf{h}_{s}}{\partial \mathbf{x}_{s}}\right|_{\mathbf{x}_{s}=\mathbf{x}_{0}} .\) With a bit of rearranging quadratic forms, we can massage \(\mathbf{h}_{s}\left(\mathbf{x}_{s}\right)\) around state variables \(\mathbf{x}_{0}\), turning it into a Gaussian factor expressed in terms of \(\mathbf{x}_{s}\). We use the linear factor represented in information form by information vector \(\boldsymbol{\eta}_{s}\) and precision matrix \(\Lambda_{s}^{\prime}\) calculated as follows: \[ \begin{aligned} \boldsymbol{\eta}_{s} &=\mathrm{J}_{s}^{\top} \Lambda_{s}\left[\mathrm{~J}_{s} \mathbf{x}_{0}+\mathbf{z}_{s}-\mathbf{h}_{s}\left(\mathbf{x}_{0}\right)\right] \\ \Lambda_{s}^{\prime} &=\mathrm{J}_{s}^{\top} \Lambda_{s} \mathrm{~J}_{s}. \end{aligned} \]

3 Variational inference

Intuition building. Noting that the distance between two Gaussians in KL is given by \[ \KL(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right)\] We can think about belief propagation in terms of KL-similar updates.

Suppose I have a Gaussian RV with some prior distribution \(\vrv{x}\sim \mathcal{N}(\vv{m}_{\vrv{x}}, \Sigma_{\vrv{x}})\) and I observe some datum \(\vrv{y}\sim f(\vrv{x} + \epsilon)\) where \(\epsilon \sim \mathcal{N}(0, \Sigma_{\vrv{y}})\). Define \(\hat{\vrv{x}}:=\vrv{x}\gvn \vrv{y}.\) Assert that we can approximate \(\vrv{x}\sim \mathcal{N}(\vv{m}_{\hat{\vrv{x}}}, \Sigma_{\hat{\vrv{x}}})\). What is the best approximation to the posterior distribution of \(\vrv{x}\) given the observation? I.e. how do we choose \(\vv{m}_{\hat{\vrv{x}}}, \Sigma_{\hat{\vrv{x}}}\)? One way is to choose them to minimise a KL divergence, \[\begin{aligned} \vv{m}_{\hat{\vrv{x}}}, \Sigma_{\hat{\vrv{x}}} &=\argmin_{\vv{m}, \Sigma} \KL\left(\mathcal{N}(\vv{m}, \Sigma)\parallel\vrv{x}\gvn \vrv{y}\right)\\ &=\argmin_{\vv{m}, \Sigma} \KL\left(\mathcal{N}(\vv{m}, \Sigma)\parallel \mathcal{N}(\vv{m}_{\vrv{x}}, \Sigma_{\vrv{x}}\right). \end{aligned}\] We still need to choose a way of estimating this posterior which could exploit stochastic gradient estimation or a GP noise transforms. TBC.

4 Parallelisation

El-Kurdi’s thesis (Y. M. El-Kurdi 2014) exploits some assortment of tricks to attain high-speed convergence in pde-type models.

5 Robust

There are generalised belief propagations based on elliptical laws (Agarwal et al. 2013; Davison and Ortiz 2019), making this into a kind of elliptical belief propagation. It is usually referred to as a robust factor or as dynamic covariance scaling. Usually this is done with robust losses; I’m not sure why other elliptical distributions are unpopular.

6 Use in PDEs

Finite Element Models can be expressed as locally-linear constraints and thus expressed using GBP (Y. El-Kurdi et al. 2016; Y. M. El-Kurdi 2014; Y. El-Kurdi et al. 2015).

7 Tools

7.1 jaxfg

brentyi/jaxfg: Factor graphs and nonlinear optimization for JAX

7.2 gradslam

Differentiable GPB solver (Jatavallabhula, Iyer, and Paull 2020).

About gradslam

7.3 ceres solver

ceres-solver, (C++), the google least squares solver, seems to solve this kind of problem. I am not sure where the covariance matrices go in. I occasionally see mention of “CUDA” in the source repo so maybe it exploits GPUs these days.

7.4 Incoming

8 References

Agarwal, Tipaldi, Spinello, et al. 2013. “Robust Map Optimization Using Dynamic Covariance Scaling.” In 2013 IEEE International Conference on Robotics and Automation.

Barfoot. 2020. “Fundamental Linear Algebra Problem of Gaussian Inference.”

Bickson. 2009. “Gaussian Belief Propagation: Theory and Application.”

Bickson, Baron, Ihler, et al. 2011. “Fault Identification via Non-Parametric Belief Propagation.”

Bickson, Shental, Siegel, et al. 2007. “Linear Detection via Belief Propagation.” In Proc. 45th Allerton Conf. On Communications, Control and Computing.

Bickson, Tock, Zyrnnis, et al. 2009. “Distributed Large Scale Network Utility Maximization.” In Proceedings of the 2009 IEEE International Conference on Symposium on Information Theory - Volume 2. ISIT’09.

Bishop, and Doucet. 2014. “Distributed Nonlinear Consensus in the Space of Probability Measures.” IFAC Proceedings Volumes, 19th IFAC World Congress,.

Cattivelli, Lopes, and Sayed. 2008. “Diffusion Recursive Least-Squares for Distributed Estimation over Adaptive Networks.” IEEE Transactions on Signal Processing.

Cattivelli, and Sayed. 2009. “Diffusion LMS Strategies for Distributed Estimation.” IEEE Transactions on Signal Processing.

———. 2010. “Diffusion Strategies for Distributed Kalman Filtering and Smoothing.” IEEE Transactions on Automatic Control.

Chen, Yan, and Oliver. 2013. “Levenberg–Marquardt Forms of the Iterative Ensemble Smoother for Efficient History Matching and Uncertainty Quantification.” Computational Geosciences.

Chen, Wilson Y., and Wand. 2020. “Factor Graph Fragmentization of Expectation Propagation.” Journal of the Korean Statistical Society.

Cseke, and Heskes. 2011. “Properties of Bethe Free Energies and Message Passing in Gaussian Models.” Journal of Artificial Intelligence Research.

Davison, and Ortiz. 2019. “FutureMapping 2: Gaussian Belief Propagation for Spatial AI.” arXiv:1910.14139 [Cs].

Dean, Corrado, Monga, et al. 2012. “Large Scale Distributed Deep Networks.” In Advances in Neural Information Processing Systems.

Deisenroth, and Mohamed. 2012. “Expectation Propagation in Gaussian Process Dynamical Systems.” In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2. NIPS’12.

Dellaert, and Kaess. 2017. “Factor Graphs for Robot Perception.” Foundations and Trends® in Robotics.

Donoho, and Montanari. 2013. “High Dimensional Robust M-Estimation: Asymptotic Variance via Approximate Message Passing.” arXiv:1310.7320 [Cs, Math, Stat].

Dubrule. 2018. “Kriging, Splines, Conditional Simulation, Bayesian Inversion and Ensemble Kalman Filtering.” In Handbook of Mathematical Geosciences: Fifty Years of IAMG.

Du, Ma, Wu, et al. 2018. “Convergence Analysis of Belief Propagation on Gaussian Graphical Models.” arXiv:1801.06430 [Cs, Math].

El-Kurdi, Yousef Malek. 2014. “Parallel Finite Element Processing Using Gaussian Belief Propagation Inference on Probabilistic Graphical Models.”

El-Kurdi, Yousef, Dehnavi, Gross, et al. 2015. “Parallel Finite Element Technique Using Gaussian Belief Propagation.” Computer Physics Communications.

El-Kurdi, Yousef, Fernandez, Gross, et al. 2016. “Acceleration of the Finite-Element Gaussian Belief Propagation Solver Using Minimum Residual Techniques.” IEEE Transactions on Magnetics.

Eustice, Singh, and Leonard. 2006. “Exactly Sparse Delayed-State Filters for View-Based SLAM.” IEEE Transactions on Robotics.

Gao, Sitharam, and Roitberg. 2020. “Bounds on the Jensen Gap, and Implications for Mean-Concentrated Distributions.”

Gurevich, and Stuke. 2019. “Gradient Conjugate Priors and Multi-Layer Neural Networks.”

Jatavallabhula, Iyer, and Paull. 2020. “∇SLAM: Dense SLAM Meets Automatic Differentiation.” In 2020 IEEE International Conference on Robotics and Automation (ICRA).

Kamper, and Steel. 2020. “Regularized Gaussian Belief Propagation with Nodes of Arbitrary Size.”

Liao, and Sun. 2019. “Gaussian Belief Propagation for Solving Network Utility Maximization with Delivery Contracts.” Entropy.

Loeliger, Dauwels, Hu, et al. 2007. “The Factor Graph Approach to Model-Based Signal Processing.” Proceedings of the IEEE.

Malioutov, Johnson, and Willsky. 2006. “Walk-Sums and Belief Propagation in Gaussian Graphical Models.” Journal of Machine Learning Research.

Minka. 2008. “EP: A Quick Reference.” Techincal Report.

Murphy. 2012. Machine learning: a probabilistic perspective. Adaptive computation and machine learning series.

Nguyen, and Bonilla. 2014. “Automated Variational Inference for Gaussian Process Models.” In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. NIPS’14.

Opper, and Archambeau. 2009. “The Variational Gaussian Approximation Revisited.” Neural Computation.

Ortiz, Evans, and Davison. 2021. “A Visual Introduction to Gaussian Belief Propagation.” arXiv:2107.02308 [Cs].

Qi. 2004. “Extending Expectation Propagation for Graphical Models.”

Ranganathan, Kaess, and Dellaert. 2007. “Loopy SAM.” In Proceedings of the 20th International Joint Conference on Artifical Intelligence. IJCAI’07.

Rezende, Mohamed, and Wierstra. 2015. “Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” In Proceedings of ICML.

Sayed. 2014. “Adaptation, Learning, and Optimization over Networks.” Foundations and Trends® in Machine Learning.

Shental, Siegel, Wolf, et al. 2008. “Gaussian Belief Propagation Solver for Systems of Linear Equations.” In 2008 IEEE International Symposium on Information Theory.

Simic. 2008. “On a Global Upper Bound for Jensen’s Inequality.” Journal of Mathematical Analysis and Applications.

Song, Gretton, Bickson, et al. 2011. “Kernel Belief Propagation.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.

Su, and Wu. 2015. “On Convergence Conditions of Gaussian Belief Propagation.” IEEE Transactions on Signal Processing.

Wang, and Dekorsy. 2020. “A Factor Graph-Based Distributed Consensus Kalman Filter.” IEEE Signal Processing Letters.

Weiss, and Freeman. 2001. “Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology.” Neural Computation.

Yang, Fang, Duan, et al. 2018. “Fast Low-Rank Bayesian Matrix Completion with Hierarchical Gaussian Prior Models.” IEEE Transactions on Signal Processing.

Yedidia, Freeman, and Weiss. 2003. “Understanding Belief Propagation and Its Generalizations.” In Exploring Artificial Intelligence in the New Millennium.

———. 2005. “Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms.” IEEE Transactions on Information Theory.

Zivojevic, Delalic, Raca, et al. 2021. “Distributed Weighted Least-Squares and Gaussian Belief Propagation: An Integrated Approach.” Preprint.