Inference where we approximate the density of the posterior variationally. That is, we use cunning tricks to turn solve an inference problem by optimising over some parameter set, usually one that allows us to trade off difficulty for fidelity in some useful way.

This idea is not intrinsically Bayesian (i.e. the density we are approximating need not be a posterior density or the marginal likelihood of the evidence), but much of the hot literature on it is from Bayesians doing probabilistic deep learning, so for concreteness I will assume Bayesian uses here.

This is usually mentioned in contrast from the other main method of approximating such densities: sampling from them, usually using Markov Chain Monte Carlo. In practice the two are related (Salimans, Kingma, and Welling 2015) and nowadays frequently used together. (Rezende and Mohamed 2015; Caterini, Doucet, and Sejdinovic 2018)

See also mixture models, probabilistic deep learning, directed graphical models, reparameterization tricks.

## Introduction

The classic intro seems to be (Jordan et al. 1999), which considers diverse types of variational calculus applications and inference. Typical ML uses these days are more specific; an archetypal example would be the variational auto-encoder (Diederik P. Kingma and Welling 2014).

## Philosophical interpretations

John Schulman’s Sending Samples Without Bits-Back is a nifty interpretation of KL variational bounds in terms of coding theory/message sending.

Not grandiose enough? See Karl Friston’s interpretation of variational inference a principle of cognition.

## In graphical models

## Inference via KL divergence

The most common version uses KL loss. This is mathematically convenient and highly recommended if you can get away with it. See ELBO.

## Mixture models

Mixture models are classic and for ages, seemed to be the default choice for variational approximation. I do not have much use for these.

## Reparameterization trick

See reparameterisation.

## Autoencoders

## Loss functions

In which probability metric should one approximate the target density? For tradition and convenience, we usually use KL-loss, but this is not ideal, and alternatives are hot topics.

Ingmar Schuster’s critique of black box loss raises some issues (Ranganath et al. 2016):

It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses \(\pi, q\) and test functions from some family \(\mathcal{F}\). I completely agree with the motivation: KL-Divergence in the form \(\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x\) indeed underestimates the variance of \(\pi\) and approximates only one mode. Using KL the other way around, \(\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x\) takes all modes into account, but still tends to underestimate variance.

[…] the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density \(q\) at all but uses test functions exclusively.

## References

*Advances in Neural Information Processing Systems 29*. http://arxiv.org/abs/1611.07800.

*Uai18*. http://arxiv.org/abs/1803.05649.

*Microsoft Research*, January. https://www.microsoft.com/en-us/research/publication/mixture-density-networks/.

*Journal of the American Statistical Association*112 (518): 859–77. https://doi.org/10.1080/01621459.2017.1285773.

*Journal of Machine Learning Research*21 (131): 1–63. http://jmlr.org/papers/v21/19-1015.html.

*Advances in Neural Information Processing Systems*. http://arxiv.org/abs/1805.11328.

*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 6572–83. Curran Associates, Inc. http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf.

*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2980–88. Curran Associates, Inc. http://papers.nips.cc/paper/5653-a-recurrent-latent-variable-model-for-sequential-data.pdf.

*PMLR*. http://proceedings.mlr.press/v70/cutajar17a.html.

*Proceedings of ICLR*. http://arxiv.org/abs/1412.6581.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2199–2207. Curran Associates, Inc. http://papers.nips.cc/paper/6039-sequential-neural-models-with-stochastic-layers.pdf.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*27 (9): 1392–1416. https://doi.org/10.1109/TPAMI.2005.169.

*Proceedings of the 24th International Conference on Neural Information Processing Systems*, 2348–56. NIPS’11. USA: Curran Associates Inc. https://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.

*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2629–37. Curran Associates, Inc. http://papers.nips.cc/paper/5961-neural-adaptive-sequential-monte-carlo.pdf.

*PRoceedings of ICLR*. http://arxiv.org/abs/1901.05534.

*Science*268 (5214): 1558–1161. https://doi.org/10.1126/science.7761831.

*PMLR*, 361–69. http://proceedings.mlr.press/v38/hoffman15.html.

*Learning in Graphical Models*, 163–73. NATO ASI Series. Springer, Dordrecht. https://doi.org/10.1007/978-94-011-5014-9_6.

*Machine Learning*37 (2): 183–233. https://doi.org/10.1023/A:1007665907178.

*Proceedings of ICLR*. http://arxiv.org/abs/1605.06432.

*Artificial Intelligence and Statistics*, 878–87. PMLR. http://arxiv.org/abs/1703.04265.

*Advances in Neural Information Processing Systems 29*. Curran Associates, Inc. http://arxiv.org/abs/1606.04934.

*ICLR 2014 Conference*. http://arxiv.org/abs/1312.6114.

*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 10236–45. Curran Associates, Inc. http://papers.nips.cc/paper/8224-glow-generative-flow-with-invertible-1x1-convolutions.pdf.

*International Conference on Machine Learning*, 3159–68. http://proceedings.mlr.press/v80/liu18d.html.

*Advances In Neural Information Processing Systems*. http://arxiv.org/abs/1608.04471.

*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 6446–56. Curran Associates, Inc. http://papers.nips.cc/paper/7223-causal-effect-inference-with-deep-latent-variable-models.pdf.

*PMLR*, 2218–27. http://proceedings.mlr.press/v70/louizos17a.html.

*Journal of Computational and Graphical Statistics*23 (3): 589–615. https://doi.org/10.1080/10618600.2013.810150.

*IEEE Transactions on Knowledge and Data Engineering*27 (2): 545–57. https://doi.org/10.1109/TKDE.2014.2334326.

*Information Theory, Inference & Learning Algorithms*, Chapter 45. Cambridge University Press. http://www.inference.phy.cam.ac.uk/mackay/itprnn/ps/534.548.pdf.

*Information Theory, Inference & Learning Algorithms*. Cambridge University Press.

*Handbook of Uncertainty Quantification*, edited by Roger Ghanem, David Higdon, and Houman Owhadi, 1–41. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-11259-6_23-1.

*Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence*, 362–69. UAI’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. https://dslpitt.org/uai/papers/01/p362-minka.pdf.

*Proceedings of ICML*. http://arxiv.org/abs/1701.05369.

*Advances In Neural Information Processing Systems*. http://arxiv.org/abs/1911.07420.

*Journal of Machine Learning Research*21 (157): 1–62. http://jmlr.org/papers/v21/19-222.html.

*The American Statistician*64 (2): 140–53. https://doi.org/10.1198/tast.2010.09058.

*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 2338–47. Curran Associates, Inc. http://papers.nips.cc/paper/6828-masked-autoregressive-flow-for-density-estimation.pdf.

*IEEE Journal of Selected Topics in Signal Processing*10 (2): 224–41. https://doi.org/10.1109/JSTSP.2015.2496908.

*CVPR*. http://arxiv.org/abs/1803.10586.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 496–504. Curran Associates, Inc. http://papers.nips.cc/paper/6091-operator-variational-inference.pdf.

*PMLR*, 324–33. http://proceedings.mlr.press/v48/ranganath16.html.

*International Conference on Machine Learning*, 1530–38. ICML’15. Lille, France: JMLR.org. http://arxiv.org/abs/1505.05770.

*Proceedings of ICML*. http://arxiv.org/abs/1401.4082.

*Artificial Intelligence and Statistics*, 800–808. PMLR. http://proceedings.mlr.press/v38/roychowdhury15.html.

*Advances In Neural Information Processing Systems*. http://arxiv.org/abs/1610.02287.

*Proceedings of the 32nd International Conference on Machine Learning (ICML-15)*, 1218–26. ICML’15. Lille, France: JMLR.org. http://proceedings.mlr.press/v37/salimans15.html.

*Theory of Statistics*. Springer Series in Statistics. New York, NY: Springer Science & Business Media. https://doi.org/10.1007/978-1-4612-4250-5_1.

*Journal of Machine Learning Research*19 (66): 2639–709. http://arxiv.org/abs/1703.06131.

*Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32*, II-1971-II-1980. ICML’14. Beijing, China: JMLR.org. http://proceedings.mlr.press/v32/titsias14.html.

*Graphical Models, Exponential Families, and Variational Inference*. Vol. 1. Foundations and Trends® in Machine Learning. Now Publishers. https://doi.org/10.1561/2200000001.

*New Directions in Statistical Signal Processing*. Vol. 155. MIT Press.

*Journal of the American Statistical Association*112 (517): 137–68. https://doi.org/10.1080/01621459.2016.1197833.

*Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence*, 626–33. UAI ’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://arxiv.org/abs/1301.3901.

*Journal of Machine Learning Research*, 661–94. http://johnwinn.org/Publications/papers/VMP2005.pdf.

*Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence*, 583–91. UAI’03. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://arxiv.org/abs/1212.2512.

*Journal of Machine Learning Research*11: 1771–98. http://www.jmlr.org/papers/v11/yoshida10a.html.