# Variational inference

On fitting the best model one can be bothered to

March 22, 2016 — May 24, 2020

Inference where we approximate the density of the posterior variationally. That is, we use cunning tricks to turn solve an inference problem by optimising over some parameter set, usually one that allows us to trade off difficulty for fidelity in some useful way.

This idea is not intrinsically Bayesian (i.e. the density we are approximating need not be a posterior density or the marginal likelihood of the evidence), but much of the hot literature on it is from Bayesians doing something fashionable in probabilistic deep learning, so for concreteness I will assume Bayesian uses here.

This is usually mentioned in contrast from the other main method of approximating such densities: sampling from them, usually using Markov Chain Monte Carlo. In practice the two are related (Salimans, Kingma, and Welling 2015) and nowadays even used together (Rezende and Mohamed 2015; Caterini, Doucet, and Sejdinovic 2018).

Once we have decided we are happy to use variational approximations, we are left with the question of … how? There are, AFAICT, two main schools of thought here - methods which leverage the graphical structure of the problem and maintain structural hygiene, which use variational message passing

## 1 Introduction

The classic intro seems to be (Jordan et al. 1999), which considers diverse types of variational calculus applications and inference. Typical ML uses these days are more specific; an archetypal example would be the variational auto-encoder (Diederik P. Kingma and Welling 2014).

## 2 Inference via KL divergence

The most common version uses KL loss to construct the famous Evidence Lower Bound Objective. This is mathematically convenient and highly recommended if you can get away with it.

## 3 Other loss functions

In which probability metric should one approximate the target density? For tradition and convenience, we usually use KL-loss, but this is not ideal, and alternatives are hot topics. There are simple ones, such as “reverse KL”, which is sometimes how we justify expectation propagation and also the modest generalisation to Rényi-divergence inference (Li and Turner 2016).

Ingmar Schuster’s critique of black box loss (Ranganath et al. 2016) raises some issues :

It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses \(\pi, q\) and test functions from some family \(\mathcal{F}\). I completely agree with the motivation: KL-Divergence in the form \(\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x\) indeed underestimates the variance of \(\pi\) and approximates only one mode. Using KL the other way around, \(\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x\) takes all modes into account, but still tends to underestimate variance.

[…] the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density \(q\) at all but uses test functions exclusively.

## 4 Philosophical interpretations

John Schulman’s Sending Samples Without Bits-Back is a nifty interpretation of KL variational bounds in terms of coding theory/message sending.

Not grandiose enough? See Karl Friston’s interpretation of variational inference a principle of cognition.

## 5 In graphical models

## 6 Mean-field assumption

TODO: mention the importance of this for classic-flavoured variational inference (Mean Field Variational Bayes). This confused me of aaaaages. AFAICT this is a problem of history. Not all variational inference makes the confusingly-named “mean-field” assumption, but for a long while that that was the only game in town, so tutorials of a certain vintage take mean-field variational inference as a synonym for variational inference. If I have just learnt some non-mean-field SVI methods from a recent NeurIPS paper then I run into this I might well be confused.

## 7 Mixture models

Mixture models are classic and for ages, seemed to be the default choice for variational approximation. They are an interesting trick to make a graphical model conditionally conjugate by use of auxiliary variables.

## 8 Reparameterization trick

See reparameterisation.

## 9 Autoencoders

## 10 Stochastic

## 11 References

*Advances in Neural Information Processing Systems 29*.

*arXiv:1511.07367 [Stat]*.

*Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence*. UAI’99.

*arXiv:1707.01069 [Cs, Stat]*.

*Microsoft Research*.

*Journal of the American Statistical Association*.

*Journal of Machine Learning Research*.

*Advances in Neural Information Processing Systems*.

*Advances in Neural Information Processing Systems 31*.

*Advances in Neural Information Processing Systems 28*.

*PMLR*.

*arXiv:2103.01085 [Cs, Stat]*.

*arXiv:1801.10395 [Stat]*.

*Proceedings of ICLR*.

*arXiv:1704.04110 [Cs, Stat]*.

*arXiv:1704.02798 [Cs, Stat]*.

*Advances in Neural Information Processing Systems 29*.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*.

*arXiv:1710.06595 [Stat]*.

*arXiv:1402.1412 [Stat]*.

*Entropy*.

*arXiv:1709.02536 [Stat]*.

*arXiv:1810.01367 [Cs, Stat]*.

*Proceedings of the 24th International Conference on Neural Information Processing Systems*. NIPS’11.

*Advances in Neural Information Processing Systems 28*.

*arXiv:1704.00028 [Cs, Stat]*.

*PRoceedings of ICLR*.

*Science*.

*PMLR*.

*arXiv:1206.7051 [Cs, Stat]*.

*arXiv:1804.00779 [Cs, Stat]*.

*arXiv:1910.04102 [Cs, Math, Stat]*.

*Learning in Graphical Models*. NATO ASI Series.

*Machine Learning*.

*Proceedings of ICLR*.

*Artificial Intelligence and Statistics*.

*Advances in Neural Information Processing Systems 31*.

*Advances in Neural Information Processing Systems 29*.

*Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2*. NIPS’15.

*ICLR 2014 Conference*.

*Journal of Machine Learning Research*.

*arXiv:1512.09300 [Cs, Stat]*.

*Advances in Neural Information Processing Systems*.

*International Conference on Machine Learning*.

*Advances In Neural Information Processing Systems*.

*Advances in Neural Information Processing Systems 30*.

*arXiv Preprint arXiv:1603.04733*.

*PMLR*.

*IEEE Transactions on Knowledge and Data Engineering*.

*Journal of Computational and Graphical Statistics*.

*Information Theory, Inference & Learning Algorithms*.

*Information Theory, Inference & Learning Algorithms*.

*arXiv Preprint arXiv:1705.09279*.

*arXiv:1906.03317 [Cs, Math, Stat]*.

*Handbook of Uncertainty Quantification*.

*Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence*. UAI’01.

*Proceedings of ICML*.

*Advances In Neural Information Processing Systems*.

*Journal of Machine Learning Research*.

*The American Statistician*.

*Advances in Neural Information Processing Systems 30*.

*IEEE Journal of Selected Topics in Signal Processing*.

*CVPR*.

*Advances in Neural Information Processing Systems 29*.

*PMLR*.

*International Conference on Machine Learning*. ICML’15.

*Proceedings of ICML*.

*Artificial Intelligence and Statistics*.

*Advances In Neural Information Processing Systems*.

*arXiv:1802.03335 [Stat]*.

*Proceedings of the 32nd International Conference on Machine Learning (ICML-15)*. ICML’15.

*Theory of Statistics*. Springer Series in Statistics.

*Journal of Machine Learning Research*.

*arXiv:1212.4507 [Cs, Stat]*.

*Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32*. ICML’14.

*arXiv:1809.10756 [Cs, Stat]*.

*UAI18*.

*New Directions in Statistical Signal Processing*.

*Graphical Models, Exponential Families, and Variational Inference*. Foundations and Trends® in Machine Learning.

*Journal of the American Statistical Association*.

*arXiv:1705.03439 [Cs, Math, Stat]*.

*Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence*. UAI ’00.

*arXiv:1301.1299 [Cs, Stat]*.

*Journal of Machine Learning Research*.

*Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence*. UAI’03.

*Journal of Machine Learning Research*.

*arXiv:1801.07922 [Math]*.