Variational inference

On fitting the best model one can be bothered to

Approximating the intractable measure (right) with a transformation of a tractable one (left)

Inference where we approximate the density of the posterior variationally. That is, we use cunning tricks to turn solve an inference problem by optimising over some parameter set, usually one that allows us to trade off difficulty for fidelity in some useful way.

This idea is not intrinsically Bayesian (i.e. the density we are approximating need not be a posterior density or the marginal likelihood of the evidence), but much of the hot literature on it is from Bayesians doing probabilistic deep learning, so for concreteness I will assume Bayesian uses here.

This is usually mentioned in contrast from the other main method of approximating such densities: sampling from them, usually using Markov Chain Monte Carlo. In practice the two are related (Salimans, Kingma, and Welling 2015) and nowadays frequently used together. (Rezende and Mohamed 2015; Caterini, Doucet, and Sejdinovic 2018)

See also mixture models, probabilistic deep learning, directed graphical models, reparameterization tricks.


The classic intro seems to be (Jordan et al. 1999), which considers diverse types of variational calculus applications and inference. Typical ML uses these days are more specific; an archetypal example would be the variational auto-encoder (Kingma and Welling 2014).

Philosophical interpretations

John Schulman’s Sending Samples Without Bits-Back is a nifty interpretation of KL variational bounds in terms of coding theory/message sending.

Not grandiose enough? See Karl Friston’s interpretation of variational inference a principle of cognition.

Inference via KL divergence

Practically we often want a variational approximation on the marginal (log-)likelihood \(\log p_{\theta}(\mathbf{x})\) for some probabilistic model with observations \(\mathbf{x},\) unobserved latent factors \(\mathbf{x}\) and model parameters \(\mathbb{\theta}.\)

\[\begin{aligned} \log p_{\theta}(\mathbf{x}) &=\log \int p_{\theta}(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) d \mathbf{z} \\ &=\log \int \frac{q_{\phi}(\mathbf{z} | \mathbf{x})}{q_{\phi}(\mathbf{z} | \mathbf{x})} p_{\theta}(\mathbf{x}|\mathbf{z}|) p(\mathbf{z}) d \mathbf{z} \\ &\geq-\mathbb{D}_{KL}\left[q_{\phi}(\mathbf{z} | \mathbf{x}) \| p(\mathbf{z})\right]+\mathbb{E}_{q}\left[\log p_{\theta}(\mathbf{x} | \mathbf{z})\right]\\ &=-\mathcal{F}(\mathbf{x}) \end{aligned}\]

\(\mathcal{F}\) is called the free energy.

Mixture models

Mixture models are classic and for ages, seemed to be the default choice for variational approximation. I do not have much use for these.

Reparameterization trick

See reparameterisation.


See variational autoencoders?

Loss functions

In which probability metric should one approximate the target density? For tradition and convenience, we usually use KL-loss, but this is not ideal, and alternatives are hot topics.

Ingmar Schuster’s critique of black box loss raises some issues (Ranganath et al. 2016):

It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses \(\pi, q\) and test functions from some family \(\mathcal{F}\). I completely agree with the motivation: KL-Divergence in the form \(\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x\) indeed underestimates the variance of \(\pi\) and approximates only one mode. Using KL the other way around, \(\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x\) takes all modes into account, but still tends to underestimate variance.

[…] the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density \(q\) at all but uses test functions exclusively.

Abbasnejad, Ehsan, Anthony Dick, and Anton van den Hengel. 2016. “Infinite Variational Autoencoder for Semi-Supervised Learning.” In Advances in Neural Information Processing Systems 29.

Archer, Evan, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. 2015. “Black Box Variational Inference for State Space Models,” November.

Bamler, Robert, and Stephan Mandt. 2017. “Structured Black Box Variational Inference for Latent Time Series Models,” July.

Berg, Rianne van den, Leonard Hasenclever, Jakub M. Tomczak, and Max Welling. 2018. “Sylvester Normalizing Flows for Variational Inference.” In UAI18.

Bishop, Christopher. 1994. “Mixture Density Networks.” Microsoft Research, January.

Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association 112 (518): 859–77.

Caterini, Anthony L., Arnaud Doucet, and Dino Sejdinovic. 2018. “Hamiltonian Variational Auto-Encoder.” In Advances in Neural Information Processing Systems.

Chen, Tian Qi, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. “Neural Ordinary Differential Equations.” In Advances in Neural Information Processing Systems 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 6572–83. Curran Associates, Inc.

Chung, Junyoung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. 2015. “A Recurrent Latent Variable Model for Sequential Data.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2980–8. Curran Associates, Inc.

Cutajar, Kurt, Edwin V. Bonilla, Pietro Michiardi, and Maurizio Filippone. 2017. “Random Feature Expansions for Deep Gaussian Processes.” In PMLR.

Doerr, Andreas, Christian Daniel, Martin Schiegg, Duy Nguyen-Tuong, Stefan Schaal, Marc Toussaint, and Sebastian Trimpe. 2018. “Probabilistic Recurrent State-Space Models,” January.

Fabius, Otto, and Joost R. van Amersfoort. 2014. “Variational Recurrent Auto-Encoders.” In Proceedings of ICLR.

Flunkert, Valentin, David Salinas, and Jan Gasthaus. 2017. “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks,” April.

Fortunato, Meire, Charles Blundell, and Oriol Vinyals. 2017. “Bayesian Recurrent Neural Networks,” April.

Fraccaro, Marco, Sø ren Kaae Sø nderby, Ulrich Paquet, and Ole Winther. 2016. “Sequential Neural Models with Stochastic Layers.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2199–2207. Curran Associates, Inc.

Frey, B. J., and Nebojsa Jojic. 2005. “A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical Models.” IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (9): 1392–1416.

Gagen, Michael J, and Kae Nemoto. 2006. “Variational Optimization of Probability Measure Spaces Resolves the Chain Store Paradox.”

Gal, Yarin, and Mark van der Wilk. 2014. “Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models - a Gentle Tutorial,” February.

Giordano, Ryan, Tamara Broderick, and Michael I. Jordan. 2017. “Covariances, Robustness, and Variational Bayes,” September.

Grathwohl, Will, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018. “FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models,” October.

Graves, Alex. 2011. “Practical Variational Inference for Neural Networks.” In Proceedings of the 24th International Conference on Neural Information Processing Systems, 2348–56. NIPS’11. USA: Curran Associates Inc.

Gu, Shixiang, Zoubin Ghahramani, and Richard E Turner. 2015. “Neural Adaptive Sequential Monte Carlo.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2629–37. Curran Associates, Inc.

Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. “Improved Training of Wasserstein GANs,” March.

He, Junxian, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. 2019. “Lagging Inference Networks and Posterior Collapse in Variational Autoencoders.” In PRoceedings of ICLR.

Hinton, G. E. 1995. “The Wake-Sleep Algorithm for Unsupervised Neural Networks.” Science 268 (5214): 1558–1161.

Hoffman, Matt, David M. Blei, Chong Wang, and John Paisley. 2013. “Stochastic Variational Inference” 14 (1).

Hoffman, Matthew, and David Blei. 2015. “Stochastic Structured Variational Inference.” In PMLR, 361–69.

Huang, Chin-Wei, David Krueger, Alexandre Lacoste, and Aaron Courville. 2018. “Neural Autoregressive Flows,” April.

Huggins, Jonathan H., Mikołaj Kasprzak, Trevor Campbell, and Tamara Broderick. 2019. “Practical Posterior Error Bounds from Variational Objectives,” October.

Jaakkola, Tommi S., and Michael I. Jordan. 1998. “Improving the Mean Field Approximation via the Use of Mixture Distributions.” In Learning in Graphical Models, 163–73. NATO ASI Series. Springer, Dordrecht.

Johnson, Matthew J., David Duvenaud, Alexander B. Wiltschko, Sandeep R. Datta, and Ryan P. Adams. 2016. “Composing Graphical Models with Neural Networks for Structured Representations and Fast Inference,” March.

Jordan, Michael I., Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. 1999. “An Introduction to Variational Methods for Graphical Models.” Machine Learning 37 (2): 183–233.

Karl, Maximilian, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. 2016. “Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data.” In Proceedings of ICLR.

Kingma, Diederik P., Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. “Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29. Curran Associates, Inc.

Kingma, Diederik P., Tim Salimans, and Max Welling. 2015. “Variational Dropout and the Local Reparameterization Trick,” June.

Kingma, Diederik P., and Max Welling. 2014. “Auto-Encoding Variational Bayes.” In ICLR 2014 Conference.

Kingma, Durk P, and Prafulla Dhariwal. 2018. “Glow: Generative Flow with Invertible 1x1 Convolutions.” In Advances in Neural Information Processing Systems 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 10236–45. Curran Associates, Inc.

Larsen, Anders Boesen Lindbo, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. 2015. “Autoencoding Beyond Pixels Using a Learned Similarity Metric,” December.

Liu, Huidong, Xianfeng Gu, and Dimitris Samaras. 2018. “A Two-Step Computation of the Exact GAN Wasserstein Distance.” In International Conference on Machine Learning, 3159–68.

Liu, Qiang, and Dilin Wang. 2019. “Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In Advances in Neural Information Processing Systems.

Louizos, Christos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. “Causal Effect Inference with Deep Latent-Variable Models.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 6446–56. Curran Associates, Inc.

Louizos, Christos, and Max Welling. 2016. “Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors.” In arXiv Preprint arXiv:1603.04733, 1708–16.

———. 2017. “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR, 2218–27.

Luts, Jan. 2015. “Real-Time Semiparametric Regression for Distributed Data Sets.” IEEE Transactions on Knowledge and Data Engineering 27 (2): 545–57.

Luts, J., T. Broderick, and M. P. Wand. 2014. “Real-Time Semiparametric Regression.” Journal of Computational and Graphical Statistics 23 (3): 589–615.

MacKay, David J C. 2002a. “Gaussian Processes.” In Information Theory, Inference & Learning Algorithms, Chapter 45. Cambridge University Press.

———. 2002b. Information Theory, Inference & Learning Algorithms. Cambridge University Press.

Maddison, Chris J., Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Whye Teh. 2017. “Filtering Variational Objectives.” arXiv Preprint arXiv:1705.09279.

Mahdian, Saied, Jose Blanchet, and Peter Glynn. 2019. “Optimal Transport Relaxations with Application to Wasserstein GANs,” June.

Marzouk, Youssef, Tarek Moselhy, Matthew Parno, and Alessio Spantini. 2016. “Sampling via Measure Transport: An Introduction.” In Handbook of Uncertainty Quantification, edited by Roger Ghanem, David Higdon, and Houman Owhadi, 1–41. Cham: Springer International Publishing.

Minka, Thomas P. 2001. “Expectation Propagation for Approximate Bayesian Inference.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 362–69. UAI’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML.

Ormerod, J. T., and M. P. Wand. 2010. “Explaining Variational Approximations.” The American Statistician 64 (2): 140–53.

Papamakarios, George, Iain Murray, and Theo Pavlakou. 2017. “Masked Autoregressive Flow for Density Estimation.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 2338–47. Curran Associates, Inc.

Pereyra, M., P. Schniter, É Chouzenoux, J. C. Pesquet, J. Y. Tourneret, A. O. Hero, and S. McLaughlin. 2016. “A Survey of Stochastic Simulation and Optimization Methods in Signal Processing.” IEEE Journal of Selected Topics in Signal Processing 10 (2): 224–41.

Ranganath, Rajesh, Dustin Tran, Jaan Altosaar, and David Blei. 2016. “Operator Variational Inference.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 496–504. Curran Associates, Inc.

Ranganath, Rajesh, Dustin Tran, and David Blei. 2016. “Hierarchical Variational Models.” In PMLR, 324–33.

Rezende, Danilo Jimenez, and Shakir Mohamed. 2015. “Variational Inference with Normalizing Flows.” In International Conference on Machine Learning, 1530–8. ICML’15. Lille, France:

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2015. “Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” In Proceedings of ICML.

Ruiz, Francisco J. R., Michalis K. Titsias, and David M. Blei. 2016. “The Generalized Reparameterization Gradient.” In Advances in Neural Information Processing Systems.

Ryder, Thomas, Andrew Golightly, A. Stephen McGough, and Dennis Prangle. 2018. “Black-Box Variational Inference for Stochastic Differential Equations,” February.

Salimans, Tim, Diederik Kingma, and Max Welling. 2015. “Markov Chain Monte Carlo and Variational Inference: Bridging the Gap.” In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 1218–26. ICML’15. Lille, France:

Spantini, Alessio, Daniele Bigoni, and Youssef Marzouk. 2017. “Inference via Low-Dimensional Couplings.” Journal of Machine Learning Research 19 (66): 2639–2709.

Staines, Joe, and David Barber. 2012. “Variational Optimization,” December.

Titsias, Michalis K., and Miguel Lázaro-Gredilla. 2014. “Doubly Stochastic Variational Bayes for Non-Conjugate Inference.” In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, II–1971–II–1980. ICML’14. Beijing, China:

Wainwright, Martin J., and Michael I. Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Vol. 1. Foundations and Trends® in Machine Learning.

Wainwright, M., and M. Jordan. 2005. “A Variational Principle for Graphical Models.” In New Directions in Statistical Signal Processing. Vol. 155. MIT Press.

Wand, M. P. 2016. “Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing.” arXiv Preprint arXiv:1602.07412.

Wang, Yixin, and David M. Blei. 2017. “Frequentist Consistency of Variational Bayes,” May.

Wiegerinck, Wim. 2000. “Variational Approximations Between Mean Field Theory and the Junction Tree Algorithm.” In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, 626–33. UAI ’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Winn, John M., and Christopher M. Bishop. 2005. “Variational Message Passing.” In Journal of Machine Learning Research, 661–94.

Xing, Eric P., Michael I. Jordan, and Stuart Russell. 2003. “A Generalized Mean Field Algorithm for Variational Inference in Exponential Families.” In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 583–91. UAI’03. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. n.d. “Yes, but Did It Work?: Evaluating Variational Inference,” 18.

Yoshida, Ryo, and Mike West. 2010. “Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing.” Journal of Machine Learning Research 11 (May): 1771–98.

Zahm, Olivier, Paul Constantine, Clémentine Prieur, and Youssef Marzouk. 2018. “Gradient-Based Dimension Reduction of Multivariate Vector-Valued Functions,” January.