# Variational inference

On fitting the best model one can be bothered to

March 22, 2016 — May 24, 2020

approximation
metrics
optimization
probabilistic algorithms
probability
statistics

Inference where we approximate the density of the posterior variationally. That is, we use cunning tricks to turn solve an inference problem by optimising over some parameter set, usually one that allows us to trade off difficulty for fidelity in some useful way.

This idea is not intrinsically Bayesian (i.e. the density we are approximating need not be a posterior density or the marginal likelihood of the evidence), but much of the hot literature on it is from Bayesians doing something fashionable in probabilistic deep learning, so for concreteness I will assume Bayesian uses here.

This is usually mentioned in contrast from the other main method of approximating such densities: sampling from them, usually using Markov Chain Monte Carlo. In practice the two are related and nowadays even used together .

Once we have decided we are happy to use variational approximations, we are left with the question of … how? There are, AFAICT, two main schools of thought here - methods which leverage the graphical structure of the problem and maintain structural hygiene, which use variational message passing

## 1 Introduction

The classic intro seems to be , which considers diverse types of variational calculus applications and inference. Typical ML uses these days are more specific; an archetypal example would be the variational auto-encoder .

## 2 Inference via KL divergence

The most common version uses KL loss to construct the famous Evidence Lower Bound Objective. This is mathematically convenient and highly recommended if you can get away with it.

### 2.1 Implicit

Implicit VI is a special case of variational loss apparently.

## 3 Other loss functions

In which probability metric should one approximate the target density? For tradition and convenience, we usually use KL-loss, but this is not ideal, and alternatives are hot topics. There are simple ones, such as “reverse KL”, which is sometimes how we justify expectation propagation and also the modest generalisation to Rényi-divergence inference .

Ingmar Schuster’s critique of black box loss raises some issues :

It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses $$\pi, q$$ and test functions from some family $$\mathcal{F}$$. I completely agree with the motivation: KL-Divergence in the form $$\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x$$ indeed underestimates the variance of $$\pi$$ and approximates only one mode. Using KL the other way around, $$\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x$$ takes all modes into account, but still tends to underestimate variance.

[…] the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density $$q$$ at all but uses test functions exclusively.

## 4 “Generalized”

VAriational inference but where the variational loss is not KL

## 5 Philosophical interpretations

John Schulman’s Sending Samples Without Bits-Back is a nifty interpretation of KL variational bounds in terms of coding theory/message sending.

Not grandiose enough? See Karl Friston’s interpretation of variational inference a principle of cognition.

## 7 Mean-field assumption

TODO: mention the importance of this for classic-flavoured variational inference (Mean Field Variational Bayes). This confused me of aaaaages. AFAICT this is a problem of history. Not all variational inference makes the confusingly-named “mean-field” assumption, but for a long while that was the only game in town, so tutorials of a certain vintage take mean-field variational inference as a synonym for variational inference. If I have just learnt some non-mean-field SVI methods from a recent NeurIPS paper, then I run into this, I might well be confused.

## 8 Mixture models

Mixture models are classic and for ages, seemed to be the default choice for variational approximation. They are an interesting trick to make a graphical model conditionally conjugate by use of auxiliary variables.

## 12 References

Abbasnejad, Dick, and Hengel. 2016. In Advances in Neural Information Processing Systems 29.
Archer, Park, Buesing, et al. 2015. arXiv:1511.07367 [Stat].
Attias. 1999. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. UAI’99.
Bamler, and Mandt. 2017. arXiv:1707.01069 [Cs, Stat].
Bishop. 1994. Microsoft Research.
Blei, Kucukelbir, and McAuliffe. 2017. Journal of the American Statistical Association.
Burt, Rasmussen, and Wilk. 2020. Journal of Machine Learning Research.
Caterini, Doucet, and Sejdinovic. 2018. In Advances in Neural Information Processing Systems.
Chen, Rubanova, Bettencourt, et al. 2018. In Advances in Neural Information Processing Systems 31.
Chung, Kastner, Dinh, et al. 2015. In Advances in Neural Information Processing Systems 28.
Cutajar, Bonilla, Michiardi, et al. 2017. In PMLR.
Dhaka, and Catalina. 2020. “Robust, Accurate Stochastic Optimization for Variational Inference.”
Dhaka, Catalina, Welandawe, et al. 2021. arXiv:2103.01085 [Cs, Stat].
Doerr, Daniel, Schiegg, et al. 2018. arXiv:1801.10395 [Stat].
Fabius, and van Amersfoort. 2014. In Proceedings of ICLR.
Flunkert, Salinas, and Gasthaus. 2017. arXiv:1704.04110 [Cs, Stat].
Fortunato, Blundell, and Vinyals. 2017. arXiv:1704.02798 [Cs, Stat].
Fraccaro, Sø nderby, Paquet, et al. 2016. In Advances in Neural Information Processing Systems 29.
Frey, and Jojic. 2005. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Futami, Sato, and Sugiyama. 2017. arXiv:1710.06595 [Stat].
Gagen, and Nemoto. 2006. “Variational Optimization of Probability Measure Spaces Resolves the Chain Store Paradox.”
Gal, and van der Wilk. 2014. arXiv:1402.1412 [Stat].
Galy-Fajou, Perrone, and Opper. 2021. Entropy.
Giordano, Broderick, and Jordan. 2017. arXiv:1709.02536 [Stat].
Grathwohl, Chen, Bettencourt, et al. 2018. arXiv:1810.01367 [Cs, Stat].
Graves. 2011. In Proceedings of the 24th International Conference on Neural Information Processing Systems. NIPS’11.
Gu, Ghahramani, and Turner. 2015. In Advances in Neural Information Processing Systems 28.
Gulrajani, Ahmed, Arjovsky, et al. 2017. arXiv:1704.00028 [Cs, Stat].
He, Spokoyny, Neubig, et al. 2019. In PRoceedings of ICLR.
Hinton. 1995. Science.
Hoffman, Matthew, and Blei. 2015. In PMLR.
Hoffman, Matt, Blei, Wang, et al. 2013. arXiv:1206.7051 [Cs, Stat].
Huang, Krueger, Lacoste, et al. 2018. arXiv:1804.00779 [Cs, Stat].
Huggins, Kasprzak, Campbell, et al. 2019. arXiv:1910.04102 [Cs, Math, Stat].
Jaakkola, and Jordan. 1998. In Learning in Graphical Models. NATO ASI Series.
Jordan, Ghahramani, Jaakkola, et al. 1999. Machine Learning.
Karl, Soelch, Bayer, et al. 2016. In Proceedings of ICLR.
Khan, and Lin. 2017. In Artificial Intelligence and Statistics.
Kingma, Diederik P. 2017.
Kingma, Durk P, and Dhariwal. 2018. In Advances in Neural Information Processing Systems 31.
Kingma, Diederik P., Salimans, Jozefowicz, et al. 2016. In Advances in Neural Information Processing Systems 29.
Kingma, Diederik P., Salimans, and Welling. 2015. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.
Kingma, Diederik P., and Welling. 2014. In ICLR 2014 Conference.
Knoblauch, Jewson, and Damoulas. 2022. “An Optimization-Centric View on Bayes’ Rule: Reviewing and Generalizing Variational Inference.” Journal of Machine Learning Research.
Larsen, Sønderby, Larochelle, et al. 2015. arXiv:1512.09300 [Cs, Stat].
Leibfried, Dutordoir, John, et al. 2022.
Li, and Turner. 2016. In Advances in Neural Information Processing Systems.
Liu, Huidong, Gu, and Samaras. 2018. In International Conference on Machine Learning.
Liu, Qiang, and Wang. 2019. In Advances In Neural Information Processing Systems.
Louizos, Shalit, Mooij, et al. 2017. In Advances in Neural Information Processing Systems 30.
Louizos, and Welling. 2016. In arXiv Preprint arXiv:1603.04733.
———. 2017. In PMLR.
Luts, Jan. 2015. IEEE Transactions on Knowledge and Data Engineering.
Luts, J., Broderick, and Wand. 2014. Journal of Computational and Graphical Statistics.
MacKay. 2002a. In Information Theory, Inference & Learning Algorithms.
———. 2002b. Information Theory, Inference & Learning Algorithms.
Maddison, Lawson, Tucker, et al. 2017. arXiv Preprint arXiv:1705.09279.
Mahdian, Blanchet, and Glynn. 2019. arXiv:1906.03317 [Cs, Math, Stat].
Marzouk, Moselhy, Parno, et al. 2016. In Handbook of Uncertainty Quantification.
Matthews. 2017.
Minka. 2001. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.
Molchanov, Ashukha, and Vetrov. 2017. In Proceedings of ICML.
Ng, Zhu, Chen, et al. 2019. In Advances In Neural Information Processing Systems.
Nolan, Menictas, and Wand. 2020. Journal of Machine Learning Research.
Ormerod, and Wand. 2010. The American Statistician.
Papamakarios, Murray, and Pavlakou. 2017. In Advances in Neural Information Processing Systems 30.
Pereyra, Schniter, Chouzenoux, et al. 2016. IEEE Journal of Selected Topics in Signal Processing.
Plötz, Wannenwetsch, and Roth. 2018. In CVPR.
Ranganath, Tran, Altosaar, et al. 2016. In Advances in Neural Information Processing Systems 29.
Ranganath, Tran, and Blei. 2016. In PMLR.
Rezende, and Mohamed. 2015. In International Conference on Machine Learning. ICML’15.
Rezende, Mohamed, and Wierstra. 2015. In Proceedings of ICML.
Roychowdhury, and Kulis. 2015. In Artificial Intelligence and Statistics.
Ruiz, Titsias, and Blei. 2016. In Advances In Neural Information Processing Systems.
Ryder, Golightly, McGough, et al. 2018. arXiv:1802.03335 [Stat].
Salimans, Kingma, and Welling. 2015. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). ICML’15.
Schervish. 2012. Theory of Statistics. Springer Series in Statistics.
Spantini, Bigoni, and Marzouk. 2017. Journal of Machine Learning Research.
Staines, and Barber. 2012.
Titsias, and Lázaro-Gredilla. 2014. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. ICML’14.
Ullrich. 2020.
van de Meent, Paige, Yang, et al. 2021. arXiv:1809.10756 [Cs, Stat].
van den Berg, Hasenclever, Tomczak, et al. 2018. In UAI18.
Wainwright, Martin, and Jordan. 2005. “A Variational Principle for Graphical Models.” In New Directions in Statistical Signal Processing.
Wainwright, Martin J., and Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.
Wand. 2017. Journal of the American Statistical Association.
Wang, and Blei. 2017. arXiv:1705.03439 [Cs, Math, Stat].
Wiegerinck. 2000. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence. UAI ’00.
Wingate, and Weber. 2013. arXiv:1301.1299 [Cs, Stat].
Winn, and Bishop. 2005. In Journal of Machine Learning Research.
Xing, Jordan, and Russell. 2003. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. UAI’03.
Yao, Vehtari, Simpson, et al. n.d. “Yes, but Did It Work?: Evaluating Variational Inference.”
Yoshida, and West. 2010. Journal of Machine Learning Research.
Zahm, Constantine, Prieur, et al. 2018. arXiv:1801.07922 [Math].
Zhang, Liu, Chen, et al. 2022.