\[\renewcommand{\argmin}{\operatorname{arg min}} \renewcommand{\KL}{\operatorname{KL}} \renewcommand{\var}{\operatorname{Var}} \renewcommand{\cov}{\operatorname{Cov}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\rv}[1]{\mathsf{#1}} \renewcommand{\vrv}[1]{\vv{\rv{#1}}} \renewcommand{\disteq}{\stackrel{d}{=}} \renewcommand{\gvn}{\mid} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}} \renewcommand{\one}{\unicode{x1D7D9}}\]

A particularly tractable model assumption for message-passing inference which generalises classic Gaussian Kalman filters and Bayesian linear regression with Gaussian prior, or, in a frequentist setting, [least squares regression](./least_squares.html. Essentially, we regard the various nodes in our system as jointly Gaussian RVs with given prior mean and covariance (i.e. we do not allow the variances themselves to be random, a Gaussian is not a valid prior for a variance.)

How does this relate to least squares? Bickson (2009) puts it elegantly: \[\begin{array}{c} \mathbf{A} \mathbf{x}=\mathbf{b}\\ \mathbb{\Downarrow}\\ \mathbf{A x}-\mathbf{b}=0\\ \mathbb{\Downarrow}\\ \min _{\mathbf{x}}\left(\frac{1}{2} \mathbf{x}^{T} \mathbf{A} \mathbf{x}-\mathbf{b}^{T} \mathbf{x}\right)\\ \mathbb{\Downarrow}\\ \max _{\mathbf{x}}\left(-\frac{1}{2} \mathbf{x}^{T} \mathbf{A} \mathbf{x}+\mathbf{b}^{T} \mathbf{x}\right)\\ \mathbb{\Downarrow}\\ \max _{\mathbf{x}} \exp \left(-\frac{1}{2} \mathbf{x}^{T} \mathbf{A} \mathbf{x}+\mathbf{b}^{T} \mathbf{x}\right) \end{array}\]

GBP comes with various conveniences. For example, we can implement it mechanically without knowing the structure of the graphical model in advance. We can execute it concurrently since the model is robust against random execution schedules. This seems to be a happy coincidence of several features of multivariate Gaussians, e.g. that the Gaussian is self conjugate and thus passes through the factor and variable nodes in commensurable form. Also, all the required operations can be represented by simple linear algebra.

NB: notation is varied and thus possibly confusing.
In the original paper (Weiss and Freeman 2001) variables are presented as *pairwise* linear models rather than a classic graphical model, which is a good insight but possibly confusing.

A famous interactive tutorial is gaussianbp.github.io, reprising Davison and Ortiz (2019). Loeliger et al. (2007) is also nice; it includes explicit accounting for Fourier transforms and arbitrary matrix products.

## Parameterization

Gaussian BP is often conducted using the canonical parameterization, where a particular node prior \(m\) may be written as:
\[
p_{m}\left(\mathbf{x}_{m}\right)\propto e^{-\frac{1}{2}\left[\left(\mathbf{x}_{m}-\boldsymbol{\mu}_{m}\right)^{\top} \boldsymbol{\Lambda}_{m}\left(\mathbf{x}_{m}-\boldsymbol{\mu}_{m}\right)\right]}
\]
where \(\boldsymbol{\mu}_{m}\) is the mean of the distribution and \(\Lambda_{m}\) is its *precision* (i.e. inverse covariance.)

Davison and Ortiz (2019) recommend the “information parameterization” of Eustice, Singh, and Leonard (2006): \[ p_{m}\left(\mathbf{x}_{m}\right)\propto e^{\left[-\frac{1}{2} \mathbf{x}_{m}^{\top} \Lambda_{m} \mathbf{x}_{m}+\boldsymbol{\eta}_{m}^{\top} \mathbf{x}_{m}\right]} \] the value of either. The information vector \(\boldsymbol{\eta}_{m}\) relates to the mean vector as \[ \boldsymbol{\eta}_{m}=\Lambda_{m} \boldsymbol{\mu}_{m}. \] The information form is convenient as multiplication of distributions (when we update) is handled simply by adding the information vectors and precision matrices.

## Linearization

See GP noise transforms for a the typical way to do it. In the information parameterization the rules look different. Suppose the expected value of the measurement is given by non-linear map \(h,\) in the sense we that we observe Gaussian noise centered on \(h(\mathbf{x}_s)\). We define the associated Gaussian factor as follows: \[ f_{s}\left(\mathbf{x}_{s}\right)\propto e^{-\frac{1}{2}\left[\left(\mathbf{z}_{s}-\mathbf{h}_{s}\left(\mathbf{x}_{s}\right)\right)^{\top} \Lambda_{s}\left(\mathbf{z}_{s}-\mathbf{h}_{s}\left(\mathbf{x}_{s}\right)\right)\right]} \] We apply the first order Taylor series expansion of non-linear measurement function \(\mathbf{h}_{s}\) to find its approximate value for state values \(\mathbf{x}_{s}\) close to \(\mathbf{x}_{0}\) : \[ \mathbf{h}_{s}\left(\mathbf{x}_{s}\right) \approx \mathbf{h}_{s}\left(\mathbf{x}_{0}\right)+\mathrm{J}_{s}\left(\mathbf{x}_{s}-\mathbf{x}_{0}\right). \] Here \(\mathrm{J}_{s}\) is the Jacobian \(\left.\frac{\partial \mathbf{h}_{s}}{\partial \mathbf{x}_{s}}\right|_{\mathbf{x}_{s}=\mathbf{x}_{0}} .\) With a bit of rearranging quadratic forms, we can massage \(\mathbf{h}_{s}\left(\mathbf{x}_{s}\right)\) around state variables \(\mathbf{x}_{0}\), turning it into a Gaussian factor expressed in terms of \(\mathbf{x}_{s}\). We use the linear factor represented in information form by information vector \(\boldsymbol{\eta}_{s}\) and precision matrix \(\Lambda_{s}^{\prime}\) calculated as follows: \[ \begin{aligned} \boldsymbol{\eta}_{s} &=\mathrm{J}_{s}^{\top} \Lambda_{s}\left[\mathrm{~J}_{s} \mathbf{x}_{0}+\mathbf{z}_{s}-\mathbf{h}_{s}\left(\mathbf{x}_{0}\right)\right] \\ \Lambda_{s}^{\prime} &=\mathrm{J}_{s}^{\top} \Lambda_{s} \mathrm{~J}_{s}. \end{aligned} \]

## Variational inference

Intuition building. Noting that the distance between two Gaussians in KL is given by \[ \KL(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right)\] We can think about belief propagation in terms of KL-similar updates.

Suppose I have a Gaussian RV with some prior distribution \(\vrv{x}\sim \mathcal{N}(\vv{m}_{\vrv{x}}, \Sigma_{\vrv{x}})\) and I observe some datum \(\vrv{y}\sim f(\vrv{x} + \epsilon)\) where \(\epsilon \sim \mathcal{N}(0, \Sigma_{\vrv{y}})\). Define \(\hat{\vrv{x}}:=\vrv{x}\gvn \vrv{y}.\) Assert that we can approximate \(\vrv{x}\sim \mathcal{N}(\vv{m}_{\hat{\vrv{x}}}, \Sigma_{\hat{\vrv{x}}})\). What is the best approximation to the posterior distribution of \(\vrv{x}\) given the observation? I.e. how do we choose \(\vv{m}_{\hat{\vrv{x}}}, \Sigma_{\hat{\vrv{x}}}\)? One way is to choose them to minimise a KL divergence, \[\begin{aligned} \vv{m}_{\hat{\vrv{x}}}, \Sigma_{\hat{\vrv{x}}} &=\argmin_{\vv{m}, \Sigma} \KL\left(\mathcal{N}(\vv{m}, \Sigma)\parallel\vrv{x}\gvn \vrv{y}\right)\\ &=\argmin_{\vv{m}, \Sigma} \KL\left(\mathcal{N}(\vv{m}, \Sigma)\parallel \mathcal{N}(\vv{m}_{\vrv{x}}, \Sigma_{\vrv{x}}\right). \end{aligned}\] We still need to choose a way of estimating this posterior which could exploit stochastic gradient estimation or a GP noise transforms. TBC.

## Parallelisation

El-Kurdi’s thesis (Y. M. El-Kurdi 2014) exploits some assortment of tricks to attain high speed convergence in pde-type models.

## Robust

There are generalised belief propagations based on elliptical laws (Agarwal et al. 2013; Davison and Ortiz 2019), making this into a kind of elliptical belief propagation.
It is usually referred to as a *robust factor* or as *dynamic covariance scaling*.
Usually this is done with robust losses; I’m not sure why other elliptical distributions are unpopular.

## Use in PDEs

Finite Element Models can be expressed as locally-linear constraints and thus expressed using GBP (Y. El-Kurdi et al. 2016; Y. M. El-Kurdi 2014; Y. El-Kurdi et al. 2015).

## Tools

### ceres solver

ceres-solver, (C++), the google least squares solver, seems to solve this kind of problem. I am not sure where the covariance matrices go in. I occasionally see mention of “CUDA” in the source repo so maybe it exploits GPUs these days.

## References

*2013 IEEE International Conference on Robotics and Automation*, 62–69.

*arXiv:2010.08022 [Cs, Math, Stat]*, October.

*IFAC Proceedings Volumes*, 19th IFAC World Congress, 47 (3): 8662–68.

*IEEE Transactions on Signal Processing*56 (5): 1865–77.

*IEEE Transactions on Signal Processing*58 (3): 1035–48.

*IEEE Transactions on Automatic Control*55 (9): 2069–84.

*Computational Geosciences*17 (4): 689–703.

*arXiv:1910.14139 [Cs]*, October.

*arXiv:1310.7320 [Cs, Math, Stat]*, October.

*arXiv:1801.06430 [Cs, Math]*, January.

*Handbook of Mathematical Geosciences: Fifty Years of IAMG*, edited by B.S. Daya Sagar, Qiuming Cheng, and Frits Agterberg, 3–24. Cham: Springer International Publishing.

*Computer Physics Communications*193 (August): 38–48.

*IEEE Transactions on Magnetics*52 (3): 1–4.

*IEEE Transactions on Robotics*22 (6): 1100–1114.

*2020 IEEE International Conference on Robotics and Automation (ICRA)*, 2130–37. Paris, France: IEEE.

*Proceedings of the IEEE*95 (6): 1295–1322.

*Journal of Machine Learning Research*7 (October): 2031–64.

*Machine learning: a probabilistic perspective*. 1 edition. Adaptive computation and machine learning series. Cambridge, MA: MIT Press.

*Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1*, 1404–12. NIPS’14. Cambridge, MA, USA: MIT Press.

*arXiv:2107.02308 [Cs]*, July.

*Foundations and Trends® in Machine Learning*7 (4-5): 311–801.

*IEEE Transactions on Signal Processing*63 (5): 1144–55.

*IEEE Signal Processing Letters*27: 2039–43.

*Neural Computation*13 (10): 2173–2200.

*IEEE Transactions on Signal Processing*66 (11): 2804–17.

## No comments yet. Why not leave one?