A trick in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning, best summarised as “fancy change of variables to that I can differentiate through the parameters of a distribution”. Connections to optimal transport and likelihood free inference in that this trick can enable some clever approximate-likelihood approaches.

Terminology note:

All variables here are assumed to be *vectors* over the field \(\mathbb{R}\) unless
otherwise stated.
If I am writing about a random *variable* I write it \(\mathsf{x}\) and if I am
writing about the *state space* I write it \(x.\)
\(\mathsf{x}\sim p(x)\) should be read “The random variable \(\mathsf{x}\) has
density \(p(x)\) (over some implied state space \(\mathbb{R}^D\)).”

## For variational autoencoders

In variational autoencoders
we call it the *normalizing flow*.

Ingmar Shuster’s summary of the foundational (Rezende and Mohamed 2015) has the obvious rant about terminology:

The paper adopts the term

normalizing flowfor referring to the plain old change of variables formula for integrals.

Some of the literature is tidily summarized in the Sylvester flow paper (Berg et al. 2018). I have had the disseration (Papamakarios 2019) recommended to me a lot as a summary for this field, although I have not read it.

The setup here is that we are doing variational inference, particularly in a deep learning context. It would be convenient for us while doing inference to have a way of handling an approximate posterior density over the latent variables \(\mathsf{z}\) given the data \(\mathsf{x}\). The real posterior density \(p(z|x)\) is intractable, so we construct an approximate density \(q_{\phi}(z|x)\) parameterised by some variational parameters \(\phi\). We have certain desiderata. Specifically, we would to efficiently…

- calculate the density \(q_{\phi}(z|x)\), such that…
- we can estimate expectations/integrals with respect to \(q_{\phi}(\cdot|z)\) in the sense that we can estimate \(\int q_{\phi}(\cdot|z)f(z) \mathrm{d} z,\) which we will be satisfied to do via Monte Carlo, and so it will suffice if we can simulate from this density, and
- we can differentiate through this density with respect to the variational parameters to find \({\textstyle \frac{\partial }{\partial \phi }q_{\phi}(z|x)}.\)

Additionally we would like our method to be flexible enough that we can persuade ourselves that it is capable of approximating a complicated, messy posterior density such as might arise in the course of our inference; that is, we would like to buy all these convenient characteristics with the lowest possible cost in verisimilitude.

This kind of challenge arises naturally in the variational autoencoder problems (Diederik P. Kingma and Welling 2014) where those properties are enough to get us some affordable approximate inference. The whole problem in such a case would entail solving for the following fairly typical kind of approximate variational objective:

\[\begin{aligned} \log p_{ \theta }\left( x\right) &\geq \mathcal{L}\left( \theta , \phi ; x\right)\\ &=\mathbb{E}_{q_{\phi}( z | x )}\left[-\log q_{\phi}( z | x )+ \log p_{ \theta }( x , z )\right]\\ &=-\operatorname{KL}\left(q_{\phi}\left( z | x\right) \| p_{ \theta }( z )\right)+ \mathbb{E}_{q_{\phi}\left( z | x\right)}\left[ \log p_{ \theta }\left( x | z \right)\right]\\ &=-\mathcal{F}(\theta, \phi) \end{aligned}\]

Here \(p_{\theta}(x)\) is the marginal likelihood of the generative model for the data, \(p_{\theta}(x|z)\) the density of observations given the latent factors and \(\theta\) parameterises the density of the generative model. We pronounce \(\mathcal{F}(\theta, \phi)\) as free energy for reasons of tradition.

(For the next bit, we temporarily suppress the dependence on \(\mathsf{x}\) to avoid repetition, and the dependence of the transforms and density upon \({\phi}\).)

The reparameterization trick is an answer to those desiderata. We get the magical \(q\) of our dreams by requiring it to have a particular form, then using basic multivariate calculus to get the required properties.

Specifically, we assume that for some function (not density!) \(f:\mathbb{R}^D\to\mathbb{R}^D,\) that

\[\mathsf{z}=f(\mathsf{z}_0)\]

and that \(\mathsf{z}_0\sim q_{0}=\mathcal{N}(0,\mathrm{I} )\) (or some similarly easy distribution). It will turn out that by imposing some extra restrictions, we can do most of the heavy lifting in this algorithm by through this simple \(\mathsf{z}_0\sim q_0\) and still get the power of a fancy posterior \(\mathsf{z}\sim q.\)

Now we can calculate (sufficiently nice) expectations with respect to \(\mathsf{z}\) using the law of the unconscious statistician

\[\mathbb{E} f(\mathsf{z})=\int f(z) q_0(z) \mathrm{d} z.\]

However, we need to impose additional conditions to guarantee a tractable form for the densities; in particular we will impose the restriction that \(f\) is invertible, so that we can use the change of variables formula to find the density

\[q\left( z \right) =q_0( f^{-1}(z) )\left| \operatorname{det} \frac{\partial f^{-1}(z)}{\partial z }\right| =q_0( f^{-1}(z) )\left| \operatorname{det} \frac{\partial f(z)}{\partial z }\right|^{-1}.\]

We can economise on function inversions since we will be evaluating always via simulating \(\mathsf{z}\) from \(f(\mathsf{z}_0)\), i.e. \(\mathsf{z}:=f( \mathsf{z}_0)\), so we can write

\[q\left( \mathsf{z} \right) =q_0(\mathsf{z}_0 )\left| \operatorname{det} \frac{\partial f(\mathsf{z})}{\partial \mathsf{z} }\right|^{-1}.\]

We do not need, that is to say, \(f\) inversions (in this VAE context) just the Jacobian determinant, which might be easier. Spoiler: later on we construct some \(f\) transforms for which it is substantially easier to find that Jacobian determinant without inverting the function.

In our application, which may as well be the Variational Autoencoder (VAE), why not, we want this mapping to depend upon \(\mathsf{x}\) and the parameters \(\phi\) so we reintroduce that dependence now. We parameterize these functions \(f_{\phi}(\mathsf{z}):=f(\mathsf{z},\phi(\mathsf{x}));\) that is, our \(\phi\) parameters are a learned mapping from observation \(\mathsf{z}_0\) to a posterior sample from the latent variable \(\mathsf{z}\) with density \(q_{K}\left( z _{K} | x \right) \simeq p_\theta(z|x)\) such that, if we configure everything just right, \(q_{K}\left( z _{K} | x \right) \simeq p_\theta(z|x)\).

And indeed it is not too hard to find a recipe for configuring these parameters so that we have the best attainable approximation.

Noting that, we may estimate the following derivatives \(\nabla_{\theta} \mathcal{F}( x )\) and \(\nabla_{\phi} \mathcal{F}( x )\) which is sufficient to minimise \(\mathcal{F}(\theta, \phi)\) by gradient descent.

In practice we are doing this in a big data context, so we will use stochastic subsamples from \(x\) to estimate the gradient and we will also use Monte Carlo simulations from \(\mathsf{z}_0\) to estimate the necessary integrals with respect to \(q_K(\cdot|\mathsf{x}),\) but you can read the paper for the details of that.

So far so good. But it turns out that this is not in general *very* tractable,
because determinants are notoriously expensive to calculate, scaling poorly —
\(\mathcal{O}(D^3)\) — in the
number of dimensions, which we will expect to be large in trendy neural network
type problems.

We look for restrictions on the form of \(f_{\phi}\) such that \(\operatorname{det} \frac{\partial f_{\phi}}{\partial z }\) is cheap and yet the approximate \(q\) they induce is still “flexible enough”.

The normalizing flow solution is to choose compositions of some class of *cheap*
functions

\[ \mathsf{z}_{K}=f_{K} \circ \ldots \circ f_{2} \circ f_{1}\left( \mathsf{z} _{0}\right)\sim q_K.\]

By induction on the change of variables formula (and using the logarithm for tidiness), we can find the density of the variable mapped through this flow as

\[\log q_{K}\left( \mathsf{z} _{K} | x \right) =\log q_{0}\left( \mathsf{z} _{0} | x \right) - \sum_{k=1}^{K} \log \left|\operatorname{det}\left(\frac{\partial f_{k}\left( \mathsf{z} _{k-1} \right)}{\partial \mathsf{z} _{k-1}}\right)\right| \]

Compositions of such cheap functions should also be a cheap-ish way of buying flexibility.

But which \(f_k\) mappings are cheap?

The archetypal one from (Rezende and Mohamed 2015) is the “planar flow”:

\[ f(\mathsf{z})= \mathsf{z} + u h\left( w ^{\top} \mathsf{z} +b\right)\]

where \(u , w \in \mathbb{R}^{D}, b \in \mathbb{R}\) and \(h:\mathbb{R}\to\mathbb{R}\) is some monotonic differentiable activation function.

There is a standard identity, the *matrix determinant lemma*
\(\operatorname{det}\left( \mathrm{I} + u v^{\top}\right)=1+ u ^{\top} v,\)
from which it follows

\[\begin{aligned} \operatorname{det} \frac{\partial f}{\partial z } &=\operatorname{det}\left( \mathrm{I} + u h^{\prime}\left( w ^{\top} z +b\right) w ^{\top}\right) \\ &=1+ u ^{\top} h^{\prime}\left( w ^{\top} z +b\right) w. \end{aligned}\] We can often simply parameterise acceptable domains for functions like these so that they remain invertible; For example if \(h\equiv \tanh\) a sufficient condition is \(u^\top w\geq 1.\) This means that we know it should be simple to parameterise these weights \(u,w,b\). Or, as in our application, that it is easy to construct functions \(\phi:x\mapsto u,w,b\) which are guaranteed to remain in an invertible domain.

For functions like this the determinant calculation is cheap, and does not depend at all on the ambient dimension of \(\mathsf{x}\). However, we might find it hard to persuade ourselves that this mapping is flexible enough to represent \(q\) well, at least not without letting \(K\) be large, as the mapping must pass through a univariate “bottleneck” \(w ^{\top} \mathsf{z}.\) Indeed, empirically this does not in fact perform well and a lot of time has been spent trying to do better.

A popular solution to this problem is given by the *Sylvester flow*
(Berg et al. 2018) which, instead of the *matrix determinant lemma* uses a
generalisation, *Sylvester’s determinant identity*. This tells us that for all
\(\mathrm{A} \in \mathbb{R}^{D \times M}, \mathrm{B} \in \mathbb{R}^{M \times R}\),

\[ \operatorname{det}\left( \mathrm{I}_{D}+ \mathrm{A} \mathrm{B} \right) =\operatorname{det}\left( \mathrm{I}_{M}+ \mathrm{B} \mathrm{A} \right)\] where \(\mathrm{I}_{M}\) and \(\mathrm{I}_{D}\) are respectively \(M\) and \(D\) dimensional identity matrices.

This suggests we might look at a generalized planar flow,

\[ f(\mathsf{z}):= \mathsf{z} + \mathrm{A} h\left(\mathrm{B} \mathsf{z} +b\right)\]

with \(\mathrm{A} \in \mathbb{R}^{D \times M}, \mathrm{B} \in \mathbb{R}^{M \times D}, b \in \mathbb{R}^{M},\) and \(M \leq D.\) The determinant calculation here scales as (\(\mathcal{O}(M^3)\ll \mathcal{O}(D^3),\) which is at least cheaper than the general case, and (we hope) gives us enough additional scope to design sufficiently flexible flows, since we have a bottleneck of size \(M \geq 1.\)

The price is an additional parameterisation problem. How do we select \(\mathrm{A}\) and \(\mathrm{B}\) such that they are still invertible for a given \(h\)? The solution in (Berg et al. 2018) is to break this into two simpler parameterization problems.

They choose \(f(\mathsf{z}):=\mathsf{z}+\mathrm{Q} \mathrm{R} h\left(\tilde{\mathrm{R}} \mathrm{Q}^{T} \mathsf{z}+\mathrm{b}\right)\) where \(\mathrm{R}\) and \(\tilde{\mathrm{R}}\) are upper triangular \(M \times M\) matrices, and \(Q\) is a \(D\times M\) matrix with orthonormal columns \(\mathrm{Q}=\left(\mathrm{q}_{1} \ldots \mathrm{q}_{M}\right).\) Using the Sylvester identity on this \(f\) we find

\[\operatorname{det} \frac{\partial f}{\partial z } =\operatorname{det}\left( \mathrm{I} _{M}+\operatorname{diag}\left(h^{\prime}\left(\tilde{\mathrm{R} } \mathrm{Q} ^{T} z + b \right)\right) \tilde{\mathrm{R} } \mathrm{R} \right)\]

They also show that if, in addition,

- \(h: \mathbb{R} \longrightarrow \mathbb{R}\) is a smooth function with bounded, positive derivative and
- if the diagonal entries of \(\mathrm{R}\) and \(\tilde{\mathrm{R}}\) satisfy \(r_{i i} \tilde{r}_{i i}>-1 /\left\|h^{\prime}\right\|_{\infty}\) and
- \(\tilde{\mathrm{R}}\) is invertible,

then \(f\) is invertible as required.

Now all we need to to action this do is choose a differentiable parameterisation of the upper-triangular \(\mathrm{R},\) the upper-triangular invertible \(\tilde{\mathrm{R} }\) with appropriate diagonal entries and the orthonormal matrix, \(Q\). That is a whole other story though.

The parallel with the problem of finding covariance kernels is interesting. In both cases we have some large function class we wish to parameterise so we can search over it, but we restrict it to a subset with computationally convenient properties and a simple domain. This probably arises in other nonparametrics?

## “Normalized” flows

In the special case where the flows are designed to be pre-normalized (by which I mean to always have a determinant of 1), this becomes a slightly more constrained problem as seen in, learning-stable-systems models.

🏗

## For vanilla density estimation

(Esteban G. Tabak and Vanden-Eijnden 2010; E. G. Tabak and Turner 2013)

🏗

## Representational power of

The most fundamental restriction of the normalizing flow paradigm is that each layer needs to be invertible. We ask whether this restriction has any ‘cost’ in terms of the size, and in particular the depth, of the model. Here we’re counting depth in terms of the number of the invertible transformations that make up the flow. A requirement for large depth would explain training difficulties due to exploding (or vanishing) gradients. Since the Jacobian of a composition of functions is the product of the Jacobians of the functions being composed, the min (max) singular value of the Jacobian of the composition is the product of the min (max) singular value of the Jacobians of the functions. This implies that the smallest (largest) singular value of the Jacobian will get exponentially smaller (larger) with the number of compositions.

A natural way of formalizing this question is by exhibiting a distribution which is easy to model for an unconstrained generator network but hard for a shallow normalizing flow. Precisely, we ask: is there a probability distribution that can be represented by a shallow generator with a small number of parameters that could not be approximately represented by a shallow composition of invertible transformations?

We demonstrate that such a distribution exists.

To reiterate the takeaway: a GLOW-style linear layer in between affine couplings could in theory make your network between 5 and 47 times smaller while representing the same function. We now have a precise understanding of the value of that architectural choice!

## Tutorials

Eric Jang on normalizing flows

Rui Shu explains change of variables in probability and shows how it induces the normalizing flow idea. PyMC3 example of a non-trivial example. Adam Kosiorek summarises some fancy variants of normalizing flow. Eric Jang did a tutorial which explains how this works in Tensorflow. Praveen on Ruiz, Titsias, and Blei (2016). Yuge Shi’s variational inference tutorial is a tour of cunning reparameterisation gradient tricks. Written for her paper Shi et al. (2019). She punts some details to Mohamed et al. (2020) which in turn tells me I need to read Figurnov, Mohamed, and Mnih (2018), Devroye (2006) and Jankowiak and Obermeyer (2018).

## References

*Gradient Flows: In Metric Spaces and in the Space of Probability Measures*. 2nd ed. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel. https://www.springer.com/gp/book/9783764387211.

*Uai18*. http://arxiv.org/abs/1803.05649.

*Advances in Neural Information Processing Systems*. http://arxiv.org/abs/1805.11328.

*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 6572–83. Curran Associates, Inc. http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf.

*Simulation*, edited by Shane G. Henderson and Barry L. Nelson, 13:83–121. Handbooks in Operations Research and Management Science. Elsevier. https://doi.org/10.1016/S0927-0507(06)13004-2.

*Advances In Neural Information Processing Systems*. https://openreview.net/forum?id=SyPNSAW5.

*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 441–52. Curran Associates, Inc. http://papers.nips.cc/paper/7326-implicit-reparameterization-gradients.pdf.

*International Conference on Machine Learning*, 2235–44. http://proceedings.mlr.press/v80/jankowiak18a.html.

*Advances in Neural Information Processing Systems 29*. Curran Associates, Inc. http://arxiv.org/abs/1606.04934.

*ICLR 2014 Conference*. http://arxiv.org/abs/1312.6114.

*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 10236–45. Curran Associates, Inc. http://papers.nips.cc/paper/8224-glow-generative-flow-with-invertible-1x1-convolutions.pdf.

*PMLR*, 2218–27. http://proceedings.mlr.press/v70/louizos17a.html.

*Advances in Neural Information Processing Systems*. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/3fb04953d95a94367bb133f862402bce-Abstract.html.

*Handbook of Uncertainty Quantification*, edited by Roger Ghanem, David Higdon, and Houman Owhadi, 1–41. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-11259-6_23-1.

*Journal of Machine Learning Research*21 (132): 1–62. http://jmlr.org/papers/v21/19-346.html.

*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 2338–47. Curran Associates, Inc. http://papers.nips.cc/paper/6828-masked-autoregressive-flow-for-density-estimation.pdf.

*Neural Computation*29 (5): 1151–1203. https://doi.org/10.1162/NECO_a_00947.

*International Conference on Machine Learning*, 1530–38. ICML’15. Lille, France: JMLR.org. http://arxiv.org/abs/1505.05770.

*Proceedings of ICML*. http://arxiv.org/abs/1401.4082.

*Advances In Neural Information Processing Systems*. http://arxiv.org/abs/1610.02287.

*Journal of Machine Learning Research*19 (66): 2639–709. http://arxiv.org/abs/1703.06131.

*Communications on Pure and Applied Mathematics*66 (2): 145–64. https://doi.org/10.1002/cpa.21423.

*Communications in Mathematical Sciences*8 (1): 217–33. https://projecteuclid.org/euclid.cms/1266935020.

*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)*, 284–94. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1025.