A trick in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning, best summarised as “fancy change of variables”. Connections to optimal transport

NB all variables here are assumed to be *vectors* over \(\mathbb{R}\) unless
otherwise stated.
If I am writing about a random *variable* I write it \(\mathsf{x}\) and if I am
writing about the *state space* I write it \(x.\)
\(\mathsf{x}\sim p(x)\) should be read “The random variable \(\mathsf{x}\) has
density \(p(x)\) (over some implied state space \(\mathbb{R}^D\)).”

## For variational autoencoders

In variational autoencoders
we call it the *normalizing flow*.

Ingmar Shuster’s summary of the foundational (Rezende and Mohamed 2015) has the obvious rant about terminology:

The paper adopts the term

normalizing flowfor referring to the plain old change of variables formula for integrals.

Some of the literature is tidily summarized in the Sylvester flow paper (Berg et al. 2018). I have had the disseration (Papamakarios 2019) recommended to me a lot as a summary for this field, although I have not read it.

The setup here is that we are doing variational inference, particularly in a deep learning context. It would be convenient for us while doing inference to have a way of handling an approximate posterior density over the latent variables \(\mathsf{z}\) given the data \(\mathsf{x}\). The real posterior density \(p(z|x)\) is intractable, so we construct an approximate density \(q_{\phi}(z|x)\) parameterised by some variational parameters \(\phi\). We have certain desiderata. Specifically, we would to efficiently…

- calculate the density \(q_{\phi}(z|x)\), such that…
- we can estimate expectations/integrals with respect to \(q_{\phi}(\cdot|z)\) in the sense that we can estimate \(\int q_{\phi}(\cdot|z)f(z) \mathrm{d} z,\) which we will be satisfied to do via Monte Carlo, and so it will suffice if we can simulate from this density, and
- we can differentiate through this density with respect to the variational parameters to find \({\textstyle \frac{\partial }{\partial \phi }q_{\phi}(z|x)}.\)

Additionally we would like our method to be flexible enough that we can persuade ourselves that it is capable of approximating a complicated, messy posterior density such as might arise in the course of our inference; that is, we would like to buy all these convenient characteristics with the lowest possible cost in verisimilitude.

This kind of challenge arises naturally in the variational autoencoder problems (Kingma and Welling 2014) where those properties are enough to get us some affordable approximate inference. The whole problem in such a case would entail solving for the following fairly typical kind of approximate variational objective:

\[\begin{aligned} \log p_{ \theta }\left( x\right) &\geq \mathcal{L}\left( \theta , \phi ; x\right)\\ &=\mathbb{E}_{q_{\phi}( z | x )}\left[-\log q_{\phi}( z | x )+ \log p_{ \theta }( x , z )\right]\\ &=-\operatorname{KL}\left(q_{\phi}\left( z | x\right) \| p_{ \theta }( z )\right)+ \mathbb{E}_{q_{\phi}\left( z | x\right)}\left[ \log p_{ \theta }\left( x | z \right)\right]\\ &=-\mathcal{F}(\theta, \phi) \end{aligned}\]

Here \(p_{\theta}(x)\) is the marginal likelihood of the generative model for the data, \(p_{\theta}(x|z)\) the density of observations given the latent factors and \(\theta\) parameterises the density of the generative model. We pronounce \(\mathcal{F}(\theta, \phi)\) as free energy for reasons of tradition.

(For the next bit, we temporarily suppress the dependence on \(\mathsf{x}\) to avoid repetition, and the dependence of the transforms and density upon \({\phi}\).)

The reparameterization trick is an answer to those desiderata. We get the magical \(q\) of our dreams by requiring it to have a particular form, then using basic multivariate calculus to get the required properties.

Specifically, we assume that for some function (not density!) \(f:\mathbb{R}^D\to\mathbb{R}^D,\) that

\[\mathsf{z}=f(\mathsf{z}_0)\]

and that \(\mathsf{z}_0\sim q_{0}=\mathcal{N}(0,\mathrm{I} )\) (or some similarly easy distribution). It will turn out that by imposing some extra restrictions, we can do most of the heavy lifting in this algorithm by through this simple \(\mathsf{z}_0\sim q_0\) and still get the power of a fancy posterior \(\mathsf{z}\sim q.\)

Now we can calculate (sufficiently nice) expectations with respect to \(\mathsf{z}\) using the law of the unconscious statistician

\[\mathbb{E} f(\mathsf{z})=\int f(z) q_0(z) \mathrm{d} z.\]

However, we need to impose additional conditions to guarantee a tractable form for the densities; in particular we will impose the restriction that \(f\) is invertible, so that we can use the change of variables formula to find the density

\[q\left( z \right) =q_0( f^{-1}(z) )\left| \operatorname{det} \frac{\partial f^{-1}(z)}{\partial z }\right| =q_0( f^{-1}(z) )\left| \operatorname{det} \frac{\partial f(z)}{\partial z }\right|^{-1}.\]

We can economise on function inversions since we will be evaluating always via simulating \(\mathsf{z}\) from \(f(\mathsf{z}_0)\), i.e. \(\mathsf{z}:=f( \mathsf{z}_0)\), so we can write

\[q\left( \mathsf{z} \right) =q_0(\mathsf{z}_0 )\left| \operatorname{det} \frac{\partial f(\mathsf{z})}{\partial \mathsf{z} }\right|^{-1}.\]

We do not need, that is to say, \(f\) inversions (in this VAE context) just the Jacobian determinant, which might be easier. Spoiler: later on we construct some \(f\) transforms for which it is substantially easier to find that Jacobian determinant without inverting the function.

In our application, which may as well be the Variational autoencoder (VAE) we want this mapping to depend upon \(\mathsf{x}\) and the parameters \(\phi\) so we reintroduce that dependence now. We parameterize these functions \(f_{\phi}(\mathsf{z}):=f(\mathsf{z},\phi(\mathsf{x}));\) that is, our \(\phi\) parameters are a learned mapping from observation \(\mathsf{z}_0\) to a posterior sample from the latent variable \(\mathsf{z}\) with density \(q_{K}\left( z _{K} | x \right) \simeq p_\theta(z|x)\) such that, if we configure everything just right, \(q_{K}\left( z _{K} | x \right) \simeq p_\theta(z|x)\).

And indeed it is not too hard to find a recipe for configuring these parameters so that we have the best attainable approximation.

Noting that we may estimate the following derivatives \(\nabla_{\theta} \mathcal{F}( x )\) and \(\nabla_{\phi} \mathcal{F}( x )\) which is sufficient to minimise \(\mathcal{F}(\theta, \phi)\) by gradient descent.

In practice we are doing this in a big data context, so we will use stochastic subsamples from \(x\) to estimate the gradient and we will also use Monte Carlo simulations from \(\mathsf{z}_0\) to estimate the necessary integrals with respect to \(q_K(\cdot|\mathsf{x}),\) but you can read the paper for the details of that.

So far so good. But it turns out that this is not in general *very* tractable,
because determinants are notoriously expensive to calculate, scaling poorly –
\(\mathcal{O}(D^3)\) – in the
number of dimensions, which we will expect to be large in trendy neural network
type problems.

We look for restrictions on the form of \(f_{\phi}\) such that \(\operatorname{det} \frac{\partial f_{\phi}}{\partial z }\) is cheap and yet the approximate \(q\) they induce is still “flexible enough”.

The normalizing flow solution is to choose compositions of some class of *cheap*
functions

\[ \mathsf{z}_{K}=f_{K} \circ \ldots \circ f_{2} \circ f_{1}\left( \mathsf{z} _{0}\right)\sim q_K.\]

By induction on the change of variables formula (and using the logarithm for tidiness), we can find the density of the variable mapped through this flow as

\[\log q_{K}\left( \mathsf{z} _{K} | x \right) =\log q_{0}\left( \mathsf{z} _{0} | x \right) - \sum_{k=1}^{K} \log \left|\operatorname{det}\left(\frac{\partial f_{k}\left( \mathsf{z} _{k-1} \right)}{\partial \mathsf{z} _{k-1}}\right)\right| \]

Compositions of such cheap functions should also be a cheap-ish way of buying flexibility.

But which \(f_k\) mappings are cheap?

The archetypal one from (Rezende and Mohamed 2015) is the “planar flow”:

\[ f(\mathsf{z})= \mathsf{z} + u h\left( w ^{\top} \mathsf{z} +b\right)\]

where \(u , w \in \mathbb{R}^{D}, b \in \mathbb{R}\) and \(h:\mathbb{R}\to\mathbb{R}\) is some monotonic differentiable activation function.

There is a standard identity, the *matrix determinant lemma*
\(\operatorname{det}\left( \mathrm{I} + u v^{\top}\right)=1+ u ^{\top} v,\)
from which it follows

\[\begin{aligned} \operatorname{det} \frac{\partial f}{\partial z } &=\operatorname{det}\left( \mathrm{I} + u h^{\prime}\left( w ^{\top} z +b\right) w ^{\top}\right) \\ &=1+ u ^{\top} h^{\prime}\left( w ^{\top} z +b\right) w. \end{aligned}\] We can often simply parameterise acceptable domains for functions like these so that they remain invertible; For example if \(h\equiv \tanh\) a sufficient condition is \(u^\top w\geq 1.\) This means that we know it should be simple to parameterise these weights \(u,w,b\). Or, as in our application, that it is easy to construct functions \(\phi:x\mapsto u,w,b\) which are guaranteed to remain in an invertible domain.

For functions like this the determinant calculation is cheap, and does not depend at all on the ambient dimension of \(\mathsf{x}\). However, we might find it hard to persuade ourselves that this mapping is flexible enough to represent \(q\) well, at least not without letting \(K\) be large, as the mapping must pass through a univariate “bottleneck” \(w ^{\top} \mathsf{z}.\) Indeed, empirically this does not in fact perform well and a lot of time has been spent trying to do better.

A popular solution to this problem is given by the *Sylvester flow*
(Berg et al. 2018) which, instead of the *matrix determinant lemma* uses a
generalisation, *Sylvester’s determinant identity*. This tells us that for all
\(\mathrm{A} \in \mathbb{R}^{D \times M}, \mathrm{B} \in \mathbb{R}^{M \times R}\),

\[ \operatorname{det}\left( \mathrm{I}_{D}+ \mathrm{A} \mathrm{B} \right) =\operatorname{det}\left( \mathrm{I}_{M}+ \mathrm{B} \mathrm{A} \right)\] where \(\mathrm{I}_{M}\) and \(\mathrm{I}_{D}\) are respectively \(M\) and \(D\) dimensional identity matrices.

This suggests we might look at a generalized planar flow,

\[ f(\mathsf{z}):= \mathsf{z} + \mathrm{A} h\left(\mathrm{B} \mathsf{z} +b\right)\]

with \(\mathrm{A} \in \mathbb{R}^{D \times M}, \mathrm{B} \in \mathbb{R}^{M \times D}, b \in \mathbb{R}^{M},\) and \(M \leq D.\) The determinant calculation here scales as (\(\mathcal{O}(M^3)\ll \mathcal{O}(D^3),\) which is at least cheaper than the general case, and (we hope) gives us enough additional scope to design sufficiently flexible flows, since we have a bottleneck of size \(M \geq 1.\)

The price is an additional parameterisation problem. How do we select \(\mathrm{A}\) and \(\mathrm{B}\) such that they are still invertible for a given \(h\)? The solution in (Berg et al. 2018) is to break this into two simpler parameterization problems.

They choose \(f(\mathsf{z}):=\mathsf{z}+\mathrm{Q} \mathrm{R} h\left(\tilde{\mathrm{R}} \mathrm{Q}^{T} \mathsf{z}+\mathrm{b}\right)\) where \(\mathrm{R}\) and \(\tilde{\mathrm{R}}\) are upper triangular \(M \times M\) matrices, and \(Q\) is a \(D\times M\) matrix with orthonormal columns \(\mathrm{Q}=\left(\mathrm{q}_{1} \ldots \mathrm{q}_{M}\right).\) Using the Sylvester identity on this \(f\) we find

\[\operatorname{det} \frac{\partial f}{\partial z } =\operatorname{det}\left( \mathrm{I} _{M}+\operatorname{diag}\left(h^{\prime}\left(\tilde{\mathrm{R} } \mathrm{Q} ^{T} z + b \right)\right) \tilde{\mathrm{R} } \mathrm{R} \right)\]

They also show that if, in addition,

- \(h: \mathbb{R} \longrightarrow \mathbb{R}\) is a smooth function with bounded, positive derivative and
- if the diagonal entries of \(\mathrm{R}\) and \(\tilde{\mathrm{R}}\) satisfy \(r_{i i} \tilde{r}_{i i}>-1 /\left\|h^{\prime}\right\|_{\infty}\) and
- \(\tilde{\mathrm{R}}\) is invertible,

then \(f\) is invertible as required.

Now all we need to to action this do is choose a differentiable parameterisation of the upper-triangular \(\mathrm{R},\) the upper-triangular invertible \(\tilde{\mathrm{R} }\) with appropriate diagonal entries and the orthonormal matrix, \(Q\). That is a whole other story though.

The parallel with the problem of finding covariance kernels is interesting. In both cases we have some large function class we wish to parameterise so we can search over it, but we restrict it to a subset with computationally convenient properties and a simple domain. This probably arises in other nonparametrics?

## Normalized flows

In the special case where the flows are designed to be pre-normalized (by which I mean to always have a determinant of 1), this becomes a slightly more constrained problem as seen in, e.g. Hamiltonian MCMC and other learning-stable-systems models.

🏗

## Tutorials

Rui Shu explains change of variables in probability and shows how it induces the normalizing flow idea. PyMC3 example of a non-trivial example. Adam Kosiorek summarises some fancy variants of normalizing flow. Eric Jang did a tutorial which explains how this works in tensorflow. Praveen on Ruiz, Titsias, and Blei (2016).

Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savare. 2008. *Gradient Flows: In Metric Spaces and in the Space of Probability Measures*. 2nd ed. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel. https://www.springer.com/gp/book/9783764387211.

Bamler, Robert, and Stephan Mandt. 2017. “Structured Black Box Variational Inference for Latent Time Series Models,” July. http://arxiv.org/abs/1707.01069.

Berg, Rianne van den, Leonard Hasenclever, Jakub M. Tomczak, and Max Welling. 2018. “Sylvester Normalizing Flows for Variational Inference.” In *UAI18*. http://arxiv.org/abs/1803.05649.

Caterini, Anthony L., Arnaud Doucet, and Dino Sejdinovic. 2018. “Hamiltonian Variational Auto-Encoder.” In *Advances in Neural Information Processing Systems*. http://arxiv.org/abs/1805.11328.

Chen, Changyou, Chunyuan Li, Liqun Chen, Wenlin Wang, Yunchen Pu, and Lawrence Carin. 2017. “Continuous-Time Flows for Efficient Inference and Density Estimation,” September. http://arxiv.org/abs/1709.01179.

Chen, Tian Qi, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. “Neural Ordinary Differential Equations.” In *Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 6572–83. Curran Associates, Inc. http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf.

Grathwohl, Will, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018. “FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models,” October. http://arxiv.org/abs/1810.01367.

Huang, Chin-Wei, David Krueger, Alexandre Lacoste, and Aaron Courville. 2018. “Neural Autoregressive Flows,” April. http://arxiv.org/abs/1804.00779.

Kingma, Diederik P., Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. “Improving Variational Inference with Inverse Autoregressive Flow.” In *Advances in Neural Information Processing Systems 29*. Curran Associates, Inc. http://arxiv.org/abs/1606.04934.

Kingma, Diederik P., Tim Salimans, and Max Welling. 2015. “Variational Dropout and the Local Reparameterization Trick,” June. http://arxiv.org/abs/1506.02557.

Kingma, Diederik P., and Max Welling. 2014. “Auto-Encoding Variational Bayes.” In *ICLR 2014 Conference*. http://arxiv.org/abs/1312.6114.

Kingma, Durk P, and Prafulla Dhariwal. 2018. “Glow: Generative Flow with Invertible 1x1 Convolutions.” In *Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 10236–45. Curran Associates, Inc. http://papers.nips.cc/paper/8224-glow-generative-flow-with-invertible-1x1-convolutions.pdf.

Louizos, Christos, and Max Welling. 2017. “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In *PMLR*, 2218–27. http://proceedings.mlr.press/v70/louizos17a.html.

Marzouk, Youssef, Tarek Moselhy, Matthew Parno, and Alessio Spantini. 2016. “Sampling via Measure Transport: An Introduction.” In *Handbook of Uncertainty Quantification*, edited by Roger Ghanem, David Higdon, and Houman Owhadi, 1–41. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-11259-6_23-1.

Papamakarios, George. 2019. “Neural Density Estimation and Likelihood-Free Inference.” http://arxiv.org/abs/1910.13233.

Papamakarios, George, Iain Murray, and Theo Pavlakou. 2017. “Masked Autoregressive Flow for Density Estimation.” In *Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 2338–47. Curran Associates, Inc. http://papers.nips.cc/paper/6828-masked-autoregressive-flow-for-density-estimation.pdf.

Rezende, Danilo Jimenez, and Shakir Mohamed. 2015. “Variational Inference with Normalizing Flows.” In *International Conference on Machine Learning*, 1530–8. ICML’15. Lille, France: JMLR.org. http://arxiv.org/abs/1505.05770.

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2015. “Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” In *Proceedings of ICML*. http://arxiv.org/abs/1401.4082.

Rippel, Oren, and Ryan Prescott Adams. 2013. “High-Dimensional Probability Estimation with Deep Density Models,” February. http://arxiv.org/abs/1302.5125.

Ruiz, Francisco J. R., Michalis K. Titsias, and David M. Blei. 2016. “The Generalized Reparameterization Gradient.” In *Advances in Neural Information Processing Systems*. http://arxiv.org/abs/1610.02287.

Spantini, Alessio, Daniele Bigoni, and Youssef Marzouk. 2017. “Inference via Low-Dimensional Couplings.” *Journal of Machine Learning Research* 19 (66): 2639–2709. http://arxiv.org/abs/1703.06131.

Tabak, E. G., and Cristina V. Turner. 2013. “A Family of Nonparametric Density Estimation Algorithms.” *Communications on Pure and Applied Mathematics* 66 (2): 145–64. https://doi.org/10.1002/cpa.21423.

Tabak, Esteban G., and Eric Vanden-Eijnden. 2010. “Density Estimation by Dual Ascent of the Log-Likelihood.” *Communications in Mathematical Sciences* 8 (1): 217–33. https://projecteuclid.org/euclid.cms/1266935020.

Wang, Prince Zizhuang, and William Yang Wang. 2019. “Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling.” In *Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)*, 284–94. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1025.

Xu, Ming, Matias Quiroz, Robert Kohn, and Scott A. Sisson. 2018. “Variance Reduction Properties of the Reparameterization Trick,” September. http://arxiv.org/abs/1809.10330.

Zahm, Olivier, Paul Constantine, Clémentine Prieur, and Youssef Marzouk. 2018. “Gradient-Based Dimension Reduction of Multivariate Vector-Valued Functions,” January. http://arxiv.org/abs/1801.07922.