Variational inference

On fitting the best model one can be bothered to

2016-03-22 — 2020-05-24

approximation

metrics

optimization

probabilistic algorithms

probability

statistics

Suspiciously similar content

Figure 1: Approximating the intractable measure (right) with a transformation of a tractable one (left)

Inference where we approximate the density of the posterior variationally. That is, we use cunning tricks to solve an inference problem by optimising over some parameter set, usually one that allows us to trade off difficulty for fidelity in some useful way.

This idea is not intrinsically Bayesian (i.e. the density we are approximating need not be a posterior density or the marginal likelihood of the evidence), but much of the hot literature on it is from Bayesians doing something fashionable in probabilistic deep learning, so for concreteness I will assume Bayesian uses here.

This is usually mentioned in contrast to the other main method of approximating such densities: sampling from them, usually using Markov Chain Monte Carlo. In practice, the two are related (Salimans, Kingma, and Welling 2015) and nowadays even used together (Rezende and Mohamed 2015; Caterini, Doucet, and Sejdinovic 2018).

Once we have decided we are happy to use variational approximations, we are left with the question of … how? There are, AFAICT, two main schools of thought here - methods which leverage the graphical structure of the problem and maintain structural hygiene, which use variational message passing

1 “Variational”

The term “variational” comes from the calculus of variations, where one seeks to find the function that minimises a functional. In the context of variational inference, the functional is assumed to be the KL divergence between the true posterior and the variational approximation.

This usage really annoys me; there are other functionals we could minimise than KL, and the term “variational” is not helpful in distinguishing these.

This name is IMO bad. I want to change it. I will lose this battle; it is too late.

2 Introduction

The classic intro seems to be (Jordan et al. 1999), which considers diverse types of variational calculus applications and inference. Typical ML uses these days are more specific; an archetypal example would be the variational auto-encoder (Diederik P. Kingma and Welling 2014).

3 Inference via KL divergence

The most common version uses KL loss to construct the famous Evidence Lower Bound Objective. This is mathematically convenient and highly recommended if you can get away with it.

3.1 Implicit

Implicit VI is a special case of variational loss, it sounds like? TBD.

4 Other loss functions

In which probability metric should one approximate the target density? For tradition and convenience, we usually use KL-loss, but this is not ideal, and alternatives are hot topics. There are simple ones, such as “reverse KL,” which is sometimes how we justify expectation propagation and also the modest generalisation to Rényi-divergence inference (Li and Turner 2016).

Ingmar Schuster’s critique of black box loss (Ranganath et al. 2016) raises some issues:

It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses $π, q$ and test functions from some family $F$ . I completely agree with the motivation: KL-Divergence in the form $\int q (x) \log \frac{q (x)}{π (x)} d x$ indeed underestimates the variance of $π$ and approximates only one mode. Using KL the other way around, $\int π (x) \log \frac{π (x)}{q (x)} d x$ takes all modes into account, but still tends to underestimate variance.

[…] the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density $q$ at all but uses test functions exclusively.

5 “Generalized”

Variational inference but where the variational loss is not KL. See Generalized Variational Inference.

6 Philosophical interpretations

John Schulman’s Sending Samples Without Bits-Back is a nifty interpretation of KL variational bounds in terms of coding theory/message sending.

Not grandiose enough? See Karl Friston’s interpretation of variational inference a principle of cognition.

An information maximization view on the $β$ -VAE objective

7 In graphical models

See variational inference in graphical models

8 Mean-field assumption

TODO: mention the importance of this for classic-flavoured variational inference (Mean Field Variational Bayes). This confused me for aaaaages. AFAICT this is a problem of history. Not all variational inference makes the confusingly-named “mean-field” assumption, but for a long while that was the only game in town, so tutorials of a certain vintage take mean-field variational inference as a synonym for variational inference. If I have just learnt some non-mean-field SVI methods from a recent NeurIPS paper, then I run into this, I might well be confused.

9 Mixture models

Mixture models are classic and for ages, seemed to be the default choice for variational approximation. They are an interesting trick to make a graphical model conditionally conjugate by use of auxiliary variables.

10 Reparameterization trick

See reparameterisation.

11 Autoencoders

See variational autoencoders?

12 Stochastic

See Stochastic variational inference.

13 Incoming

VIABEL: “Efficient, lightweight variational inference and approximation bounds” (Huggins et al. 2020; Welandawe et al. 2024) (source jhuggins/viabel)

14 References

Abbasnejad, Dick, and Hengel. 2016. “Infinite Variational Autoencoder for Semi-Supervised Learning.” In Advances in Neural Information Processing Systems 29.

Archer, Park, Buesing, et al. 2015. “Black Box Variational Inference for State Space Models.” arXiv:1511.07367 [Stat].

Attias. 1999. “Inferring Parameters and Structure of Latent Variable Models by Variational Bayes.” In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. UAI’99.

Bamler, and Mandt. 2017. “Structured Black Box Variational Inference for Latent Time Series Models.” arXiv:1707.01069 [Cs, Stat].

Bishop. 1994. “Mixture Density Networks.” Microsoft Research.

Blei, Kucukelbir, and McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association.

Burt, Rasmussen, and Wilk. 2020. “Convergence of Sparse Variational Inference in Gaussian Processes Regression.” Journal of Machine Learning Research.

Caterini, Doucet, and Sejdinovic. 2018. “Hamiltonian Variational Auto-Encoder.” In Advances in Neural Information Processing Systems.

Chen, Rubanova, Bettencourt, et al. 2018. “Neural Ordinary Differential Equations.” In Advances in Neural Information Processing Systems 31.

Chung, Kastner, Dinh, et al. 2015. “A Recurrent Latent Variable Model for Sequential Data.” In Advances in Neural Information Processing Systems 28.

Cutajar, Bonilla, Michiardi, et al. 2017. “Random Feature Expansions for Deep Gaussian Processes.” In PMLR.

Dhaka, and Catalina. 2020. “Robust, Accurate Stochastic Optimization for Variational Inference.”

Dhaka, Catalina, Welandawe, et al. 2021. “Challenges and Opportunities in High-Dimensional Variational Inference.” arXiv:2103.01085 [Cs, Stat].

Doerr, Daniel, Schiegg, et al. 2018. “Probabilistic Recurrent State-Space Models.” arXiv:1801.10395 [Stat].

Fabius, and van Amersfoort. 2014. “Variational Recurrent Auto-Encoders.” In Proceedings of ICLR.

Flunkert, Salinas, and Gasthaus. 2017. “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.” arXiv:1704.04110 [Cs, Stat].

Fortunato, Blundell, and Vinyals. 2017. “Bayesian Recurrent Neural Networks.” arXiv:1704.02798 [Cs, Stat].

Fraccaro, Sø nderby, Paquet, et al. 2016. “Sequential Neural Models with Stochastic Layers.” In Advances in Neural Information Processing Systems 29.

Frey, and Jojic. 2005. “A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical Models.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Futami, Sato, and Sugiyama. 2017. “Variational Inference Based on Robust Divergences.” arXiv:1710.06595 [Stat].

Gagen, and Nemoto. 2006. “Variational Optimization of Probability Measure Spaces Resolves the Chain Store Paradox.”

Gal, and van der Wilk. 2014. “Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models - a Gentle Tutorial.” arXiv:1402.1412 [Stat].

Galy-Fajou, Perrone, and Opper. 2021. “Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation.” Entropy.

Giordano, Broderick, and Jordan. 2017. “Covariances, Robustness, and Variational Bayes.” arXiv:1709.02536 [Stat].

Grathwohl, Chen, Bettencourt, et al. 2018. “FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models.” arXiv:1810.01367 [Cs, Stat].

Graves. 2011. “Practical Variational Inference for Neural Networks.” In Proceedings of the 24th International Conference on Neural Information Processing Systems. NIPS’11.

Gu, Ghahramani, and Turner. 2015. “Neural Adaptive Sequential Monte Carlo.” In Advances in Neural Information Processing Systems 28.

Gulrajani, Ahmed, Arjovsky, et al. 2017. “Improved Training of Wasserstein GANs.” arXiv:1704.00028 [Cs, Stat].

He, Spokoyny, Neubig, et al. 2019. “Lagging Inference Networks and Posterior Collapse in Variational Autoencoders.” In PRoceedings of ICLR.

Hinton. 1995. “The Wake-Sleep Algorithm for Unsupervised Neural Networks.” Science.

Hoffman, Matthew, and Blei. 2015. “Stochastic Structured Variational Inference.” In PMLR.

Hoffman, Matt, Blei, Wang, et al. 2013. “Stochastic Variational Inference.” arXiv:1206.7051 [Cs, Stat].

Huang, Krueger, Lacoste, et al. 2018. “Neural Autoregressive Flows.” arXiv:1804.00779 [Cs, Stat].

Huggins, Kasprzak, Campbell, et al. 2019. “Practical Posterior Error Bounds from Variational Objectives.” arXiv:1910.04102 [Cs, Math, Stat].

———, et al. 2020. “Validated Variational Inference via Practical Posterior Error Bounds.” In Proc. Of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS).

Jaakkola, and Jordan. 1998. “Improving the Mean Field Approximation Via the Use of Mixture Distributions.” In Learning in Graphical Models. NATO ASI Series.

Jordan, Ghahramani, Jaakkola, et al. 1999. “An Introduction to Variational Methods for Graphical Models.” Machine Learning.

Karl, Soelch, Bayer, et al. 2016. “Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data.” In Proceedings of ICLR.

Khan, and Lin. 2017. “Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models.” In Artificial Intelligence and Statistics.

Kingma, Diederik P. 2017. “Variational Inference & Deep Learning: A New Synthesis.”

Kingma, Durk P, and Dhariwal. 2018. “Glow: Generative Flow with Invertible 1x1 Convolutions.” In Advances in Neural Information Processing Systems 31.

Kingma, Diederik P., Salimans, Jozefowicz, et al. 2016. “Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29.

Kingma, Diederik P., Salimans, and Welling. 2015. “Variational Dropout and the Local Reparameterization Trick.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.

Kingma, Diederik P., and Welling. 2014. “Auto-Encoding Variational Bayes.” In ICLR 2014 Conference.

Knoblauch, Jewson, and Damoulas. 2022. “An Optimization-Centric View on Bayes’ Rule: Reviewing and Generalizing Variational Inference.” Journal of Machine Learning Research.

Larsen, Sønderby, Larochelle, et al. 2015. “Autoencoding Beyond Pixels Using a Learned Similarity Metric.” arXiv:1512.09300 [Cs, Stat].

Leibfried, Dutordoir, John, et al. 2022. “A Tutorial on Sparse Gaussian Processes and Variational Inference.”

Li, and Turner. 2016. “Rényi Divergence Variational Inference.” In Advances in Neural Information Processing Systems.

Liu, Huidong, Gu, and Samaras. 2018. “A Two-Step Computation of the Exact GAN Wasserstein Distance.” In International Conference on Machine Learning.

Liu, Qiang, and Wang. 2019. “Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In Advances In Neural Information Processing Systems.

Louizos, Shalit, Mooij, et al. 2017. “Causal Effect Inference with Deep Latent-Variable Models.” In Advances in Neural Information Processing Systems 30.

Louizos, and Welling. 2016. “Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors.” In arXiv Preprint arXiv:1603.04733.

———. 2017. “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR.

Luts, Jan. 2015. “Real-Time Semiparametric Regression for Distributed Data Sets.” IEEE Transactions on Knowledge and Data Engineering.

Luts, J., Broderick, and Wand. 2014. “Real-Time Semiparametric Regression.” Journal of Computational and Graphical Statistics.

MacKay. 2002a. “Gaussian Processes.” In Information Theory, Inference & Learning Algorithms.

———. 2002b. Information Theory, Inference & Learning Algorithms.

Maddison, Lawson, Tucker, et al. 2017. “Filtering Variational Objectives.” arXiv Preprint arXiv:1705.09279.

Mahdian, Blanchet, and Glynn. 2019. “Optimal Transport Relaxations with Application to Wasserstein GANs.” arXiv:1906.03317 [Cs, Math, Stat].

Marzouk, Moselhy, Parno, et al. 2016. “Sampling via Measure Transport: An Introduction.” In Handbook of Uncertainty Quantification.

Matthews. 2017. “Scalable Gaussian Process Inference Using Variational Methods.”

Minka. 2001. “Expectation Propagation for Approximate Bayesian Inference.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.

Molchanov, Ashukha, and Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML.

Ng, Zhu, Chen, et al. 2019. “A Graph Autoencoder Approach to Causal Structure Learning.” In Advances In Neural Information Processing Systems.

Nolan, Menictas, and Wand. 2020. “Streamlined Variational Inference with Higher Level Random Effects.” Journal of Machine Learning Research.

Ormerod, and Wand. 2010. “Explaining Variational Approximations.” The American Statistician.

Papamakarios, Murray, and Pavlakou. 2017. “Masked Autoregressive Flow for Density Estimation.” In Advances in Neural Information Processing Systems 30.

Pereyra, Schniter, Chouzenoux, et al. 2016. “A Survey of Stochastic Simulation and Optimization Methods in Signal Processing.” IEEE Journal of Selected Topics in Signal Processing.

Plötz, Wannenwetsch, and Roth. 2018. “Stochastic Variational Inference with Gradient Linearization.” In CVPR.

Ranganath, Tran, Altosaar, et al. 2016. “Operator Variational Inference.” In Advances in Neural Information Processing Systems 29.

Ranganath, Tran, and Blei. 2016. “Hierarchical Variational Models.” In PMLR.

Rezende, and Mohamed. 2015. “Variational Inference with Normalizing Flows.” In International Conference on Machine Learning. ICML’15.

Rezende, Mohamed, and Wierstra. 2015. “Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” In Proceedings of ICML.

Roychowdhury, and Kulis. 2015. “Gamma Processes, Stick-Breaking, and Variational Inference.” In Artificial Intelligence and Statistics.

Ruiz, Titsias, and Blei. 2016. “The Generalized Reparameterization Gradient.” In Advances In Neural Information Processing Systems.

Ryder, Golightly, McGough, et al. 2018. “Black-Box Variational Inference for Stochastic Differential Equations.” arXiv:1802.03335 [Stat].

Salimans, Kingma, and Welling. 2015. “Markov Chain Monte Carlo and Variational Inference: Bridging the Gap.” In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). ICML’15.

Schervish. 2012. Theory of Statistics. Springer Series in Statistics.

Spantini, Bigoni, and Marzouk. 2017. “Inference via Low-Dimensional Couplings.” Journal of Machine Learning Research.

Staines, and Barber. 2012. “Variational Optimization.”

Titsias, and Lázaro-Gredilla. 2014. “Doubly Stochastic Variational Bayes for Non-Conjugate Inference.” In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. ICML’14.

Ullrich. 2020. “A Coding Perspective on Deep Latent Variable Models.”

van de Meent, Paige, Yang, et al. 2021. “An Introduction to Probabilistic Programming.” arXiv:1809.10756 [Cs, Stat].

van den Berg, Hasenclever, Tomczak, et al. 2018. “Sylvester Normalizing Flows for Variational Inference.” In UAI18.

Wainwright, Martin, and Jordan. 2005. “A Variational Principle for Graphical Models.” In New Directions in Statistical Signal Processing.

Wainwright, Martin J., and Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.

Wand. 2017. “Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing.” Journal of the American Statistical Association.

Wang, and Blei. 2017. “Frequentist Consistency of Variational Bayes.” arXiv:1705.03439 [Cs, Math, Stat].

Welandawe, Andersen, Vehtari, et al. 2024. “A Framework for Improving the Reliability of Black-Box Variational Inference.”

Wiegerinck. 2000. “Variational Approximations Between Mean Field Theory and the Junction Tree Algorithm.” In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence. UAI ’00.

Wingate, and Weber. 2013. “Automated Variational Inference in Probabilistic Programming.” arXiv:1301.1299 [Cs, Stat].

Winn, and Bishop. 2005. “Variational Message Passing.” In Journal of Machine Learning Research.

Xing, Jordan, and Russell. 2003. “A Generalized Mean Field Algorithm for Variational Inference in Exponential Families.” In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. UAI’03.

Yao, Vehtari, Simpson, et al. n.d. “Yes, but Did It Work?: Evaluating Variational Inference.”

Yoshida, and West. 2010. “Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing.” Journal of Machine Learning Research.

Zahm, Constantine, Prieur, et al. 2018. “Gradient-Based Dimension Reduction of Multivariate Vector-Valued Functions.” arXiv:1801.07922 [Math].

Zhang, Liu, Chen, et al. 2022. “On the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions.”