Evidence lower bound, variational free energy etc

\(\renewcommand{\Ex}{\mathbb{E}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\kl}{\operatorname{KL}} \renewcommand{\H}{\mathbb{H}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\pd}{\partial}\)

On using the most convenient probability metric (i.e. KL divergence) to do variational inference.

There is nothing novel here. But everyone who is doing variational inference has to work through this just once, and I’m doing so here.

Yuge Shi’s introduction is the best short intro that gets to state-of-the-art. The canonical intro is de Garis Matthews (2017) who did a thesis on it. Murphy (2012) sec 21.2 is also pretty good.

We often want a variational approximation to the marginal (log-)likelihood \(\log p_{\theta}(\vv{x})\) (a.k.a. “evidence”) for some probabilistic model with observations \(\vv{x},\) unobserved latent factors \(\vv{x}\), model parameters \(\mathbb{\theta}\) and variational parameters \(\phi\).

Here’s one. The steps are all elementary, although realizing you would want to take them is not, IMO.

For convenience, we will assume everything has a density with respect to some unspecifed dominating measure over \(\vv{x}\) and \(\vv{z}\), which is usually an OK assumption.1

\[\begin{aligned} \log p_{\theta}(\vv{x}) &=\log (p_{\theta}(\vv{x})) \Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[ 1 \right] \\ &=\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[ \log (p_{\theta}(\vv{x})) \right]\\ &=\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[ \log \left( \frac{ p_{\theta}(\vv{x},\vv{z}) }{ p_{\theta}(\vv{z}|\vv{x}) } \right) \right]\\ &=\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[ \log\left( \frac{ q_{\phi}(\vv{z}|\vv{x})p_{\theta}(\vv{x},\vv{z}) }{ p_{\theta}(\vv{z}|\vv{x})q_{\phi}(\vv{x}|\vv{x}) }\right) \right]\\ &=\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[ \log \left(\frac{ q_{\phi}(\vv{z}|\vv{x}) }{ p_{\theta}(\vv{z}|\vv{x}) } + \frac{ p_{\theta}(\vv{x},\vv{z}) }{ q_{\phi}(\vv{z}|\vv{x}) }\right) \right]\\ &=\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[\log\left(\frac{ q_{\phi}(\vv{z}|\vv{x}) }{ p_{\theta}(\vv{z}|\vv{x}) }\right) \right] \\ &\qquad+\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[ \log\left(\frac{ p_{\theta}(\vv{x},\vv{z}) }{ q_{\phi}(\vv{z}|\vv{x}) }\right) \right] \nonumber\\ &=\kl_{\vv{z}}( q_{\phi}(\vv{z}|\vv{x}) \| p_{\theta}(\vv{z}|\vv{x}) ) +\mathcal{L}(\vv{x},\theta,\phi) \end{aligned}\]

\(\mathcal{L}(\vv{x},\theta,\phi):=\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[\log\left(\frac{ p_{\theta}(\vv{x},\vv{z}) }{ q_{\phi}(\vv{z}|\vv{x})}\right) \right]\) is called the free energy or Evidence Lower Bound.

We can understand this as a decomposition of the total marginal evidence into two parts which are sort-of-interpretable. \(\kl_{\vv{z}}( q_{\phi}(\vv{z}|\vv{x}) \| p_{\theta}(\vv{z}|\vv{x}))\) represents the cost of approximating the exact \(\theta\) distribution over latents with the \(\phi\) distribution. We can’t actually evaluate this in general but for fancy enough \(q_{\phi}\) it could be small.

The other bit, \(\mathcal{L}(\vv{x},\theta,\phi)\), represents an objective we can actually maximise, which motivates a whole bunch of technology. We can stop there and think of it as just that, or we can break it up further. The traditional next step if we want to decompose further is to observe that

\[\begin{aligned} \mathcal{L}(\vv{x},\theta,\phi) &=\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[\log\left(\frac{ p_{\theta}(\vv{x},\vv{z}) }{ q_{\phi}(\vv{z}|\vv{x}) } \right) \right]\\ &=\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[\log\left(\frac{ p_{\theta}(\vv{x}|\vv{z})p_{\theta}(\vv{z}) }{ q_{\phi}(\vv{z}|\vv{x}) } \right) \right]\\ &=\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[\log(p_{\theta}(\vv{x}|\vv{z})) \right] -\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[\log\left(\frac{ q_{\phi}(\vv{z}|\vv{x}) }{ p_{\theta}(\vv{z}) } \right) \right]\\ &=\underbrace{\Ex_{ q_{\phi}(\vv{z}\sim\vv{z}|\vv{x})}\left[\log(p_{\theta}(\vv{x}|\vv{z})) \right]}_{\text{Expected log likelihood}} - \underbrace{\kl_{\vv{z}}( q_{\phi}(\vv{z}|\vv{x}) \| p_{\theta}(\vv{z}))}_{\text{KL of approx posterior update}} \end{aligned}\]

This suggests we might intuit maximising the ELBO as maximising the data-conditional likelihood of the latents whilst penalising them for diverging too far from the prior on those latents. I find a marginal prior on the latents to be a weird concept here and this aprticular formulation makes my head hurt.

Yuge Shi, summarising Hoffman and Johnson (2016) and Mathieu et al. (2019) observes that we can break this down in a way which makes the per-observation latents into a coding problem. Suppose we index our \(N\) observations by \(n\), and they are independent. Then we can write this bad boy using the marginal \(q(\vv{z})\).

\[\begin{aligned} \mathcal{L}(\vv{x},\theta,\phi) &= \underbrace{\left[ \frac{1}{N} \sum^N_{n=1} \Ex_{q(z_n\mid x_n)} [\log p(x_n \mid z_n)] \right]}_{\color{#4059AD}{\text{(1) Average reconstruction}}} - \underbrace{(\log N - \Ex_{q(z)}[\H[q(x_n\mid z)]])}_{\color{#EE6C4D}{\text{(2) Index-code mutual info}}} \notag \\ &\quad + \underbrace{\kl(q(z)\mid p(z))}_{\color{#86CD82}{\text{(3) KL between q and p}}} \notag \\ \end{aligned}\] \(\H\left[p(\vv{z}) \right] \triangleq-\Ex_{p(\vv{z})} \left[\log p(\vv{z})\right]\) is entropy.

Another way to rewrite the ELBO is as \[\mathcal{L}(\vv{x},\theta,\phi)=\Ex_{q_{\phi}(\vv{z} \mid \vv{x})}\left[\log p_{\theta}(\vv{z}, \vv{x})\right]+\H\left[q_{\phi}(\vv{z} \mid \vv{x})\right].\] The log joint \(\log p_{\theta}(z, x)\) what physics people call the “negative energy”. This version highlights that a good posterior approximation \(q_{\phi}(z \mid x)\) must assign most of its probability mass to regions of low energy (i.e. high joint probability). At the same time the entropy term in the ELBO prevents \(q_{\phi}(z \mid x)\) from collapsing to an atom, unlike in, say, an MAP estimate.

Next thing, Importance weighted sampling in variational inference. Also a recommendation from Yuge Shi, see Adam Kosiorek’s What is wrong with VAEs? which finds autoencoders via importance sampling.

Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association 112 (518): 859–77. https://doi.org/10.1080/01621459.2017.1285773.
Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. 2016. “Importance Weighted Autoencoders.” In. http://arxiv.org/abs/1509.00519.
Cremer, Chris, Quaid Morris, and David Duvenaud. 2017. “Reinterpreting Importance-Weighted Autoencoders.” In ICLR 2017. http://arxiv.org/abs/1704.02916.
Garis Matthews, Alexander Graeme de. 2017. “Scalable Gaussian Process Inference Using Variational Methods.” Thesis, University of Cambridge. https://doi.org/10.17863/CAM.25348.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3 edition. Chapman & Hall/CRC Texts in Statistical Science. Boca Raton: Chapman and Hall/CRC.
Hoffman, Matthew D, and Matthew J Johnson. 2016. ELBO Surgery: Yet Another Way to Carve up the Variational Evidence Lower Bound.” In Advances In Neural Information Processing Systems, 4. http://approximateinference.org/accepted/HoffmanJohnson2016.pdf.
Huggins, Jonathan H., Mikołaj Kasprzak, Trevor Campbell, and Tamara Broderick. 2019. “Practical Posterior Error Bounds from Variational Objectives.” October 9, 2019. http://arxiv.org/abs/1910.04102.
Jordan, Michael Irwin. 1999. Learning in Graphical Models. Cambridge, Mass.: MIT Press.
Kingma, Diederik P. 2017. “Variational Inference & Deep Learning: A New Synthesis.” https://www.dropbox.com/s/v6ua3d9yt44vgb3/cover_and_thesis.pdf?dl=0.
Kingma, Diederik P., and Max Welling. 2019. An Introduction to Variational Autoencoders. Vol. 12. Foundations and Trends in Machine Learning. Now Publishers, Inc. https://doi.org/10.1561/2200000056.
MacKay, David J C. 2002. Information Theory, Inference & Learning Algorithms. Cambridge University Press.
Mathieu, Emile, Tom Rainforth, N. Siddharth, and Yee Whye Teh. 2019. “Disentangling Disentanglement in Variational Autoencoders.” In International Conference on Machine Learning, 4402–12. PMLR. http://arxiv.org/abs/1812.02833.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. 1 edition. Adaptive Computation and Machine Learning Series. Cambridge, MA: MIT Press.
Rainforth, Tom, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. 2019. “Tighter Variational Bounds Are Not Necessarily Better.” March 5, 2019. http://arxiv.org/abs/1802.04537.
Roeder, Geoffrey, Yuhuai Wu, and David Duvenaud. 2017. “Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference.” May 28, 2017. http://arxiv.org/abs/1703.09194.
Tucker, George, Dieterich Lawson, Shixiang Gu, and Chris J. Maddison. 2018. “Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives.” November 19, 2018. http://arxiv.org/abs/1810.04152.
Wainwright, Martin J., and Michael I. Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Vol. 1. Foundations and Trends® in Machine Learning. Now Publishers. https://doi.org/10.1561/2200000001.
Wainwright, Martin, and Michael I Jordan. 2005. “A Variational Principle for Graphical Models.” In New Directions in Statistical Signal Processing. Vol. 155. MIT Press.

  1. Although in machine learning we tend to assume the dominating measure is Lebesgue, which as de Garis Matthews (2017) shows, can get you in to trouble.↩︎