# Reparameterization tricks in inference

## Normalizing flows, invertible density models, inference by measure transport, low-dimensional coupling… Approximating the desired distribution by perturbation of the available distribution

A trick in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning, best summarised as “fancy change of variables to that I can differentiate through the parameters of a distribution”. Connections to optimal transport and likelihood free inference in that this trick can enable some clever approximate-likelihood approaches.

Terminology note:

All variables here are assumed to take values in $$mathbb{R}^D]). If I am writing about a random *variable* I write it \(\mathsf{x}$$ and if I am writing about the realized values, I write it $$x.$$ $$\mathsf{x}\sim p(x)$$ should be read “The random variable $$\mathsf{x}$$ has density $$p(x)$$.”1

## For variational autoencoders

In variational autoencoders we call it the normalizing flow.

Ingmar Shuster’s summary of the foundational has the obvious rant about terminology:

The paper adopts the term normalizing flow for referring to the plain old change of variables formula for integrals.

Some of the literature is tidily summarized in the Sylvester flow paper . I have had the dissertation recommended to me a lot as a summary for this field, although I have not read it.

The setup here is that we are doing variational inference, particularly in a deep learning context. It would be convenient for us while doing inference to have a way of handling an approximate posterior density over the latent variables $$\mathsf{z}$$ given the data $$\mathsf{x}$$. The real posterior density $$p(z|x)$$ is intractable, so we construct an approximate density $$q_{\phi}(z|x)$$ parameterised by some variational parameters $$\phi$$. We have certain desiderata. Specifically, we would to efficiently…

1. calculate the density $$q_{\phi}(z|x)$$, such that…
2. we can estimate expectations/integrals with respect to $$q_{\phi}(\cdot|z)$$ in the sense that we can estimate $$\int q_{\phi}(\cdot|z)f(z) \mathrm{d} z,$$ which we will be satisfied to do via Monte Carlo, and so it will suffice if we can simulate from this density, and
3. we can differentiate through this density with respect to the variational parameters to find $${\textstyle \frac{\partial }{\partial \phi }q_{\phi}(z|x)}.$$

Additionally we would like our method to be flexible enough that we can persuade ourselves that it is capable of approximating a complicated, messy posterior density such as might arise in the course of our inference; that is, we would like to buy all these convenient characteristics with the lowest possible cost in verisimilitude.

This kind of challenge arises naturally in the variational autoencoder problems where those properties are enough to get us some affordable approximate inference. The whole problem in such a case would entail solving for the following fairly typical kind of approximate variational objective:

\begin{aligned} \log p_{ \theta }\left( x\right) &\geq \mathcal{L}\left( \theta , \phi ; x\right)\\ &=\mathbb{E}_{q_{\phi}( z | x )}\left[-\log q_{\phi}( z | x )+ \log p_{ \theta }( x , z )\right]\\ &=-\operatorname{KL}\left(q_{\phi}\left( z | x\right) \| p_{ \theta }( z )\right)+ \mathbb{E}_{q_{\phi}\left( z | x\right)}\left[ \log p_{ \theta }\left( x | z \right)\right]\\ &=-\mathcal{F}(\theta, \phi) \end{aligned}

Here $$p_{\theta}(x)$$ is the marginal likelihood of the generative model for the data, $$p_{\theta}(x|z)$$ the density of observations given the latent factors and $$\theta$$ parameterises the density of the generative model. We pronounce $$\mathcal{F}(\theta, \phi)$$ as free energy for reasons of tradition.

(For the next bit, we temporarily suppress the dependence on $$\mathsf{x}$$ to avoid repetition, and the dependence of the transforms and density upon $${\phi}$$.)

The reparameterization trick is an answer to those desiderata. We get the magical $$q$$ of our dreams by requiring it to have a particular form, then using basic multivariate calculus to approximately enforce the required properties.

Specifically, we assume that for some function (not density!) $$f:\mathbb{R}^D\to\mathbb{R}^D,$$ that

$\mathsf{z}=f(\mathsf{z}_0)$

and that $$\mathsf{z}_0\sim q_{0}=\mathcal{N}(0,\mathrm{I} )$$ (or some similarly easy distribution). It will turn out that by imposing some extra restrictions, we can do most of the heavy lifting in this algorithm through this simple $$\mathsf{z}_0\sim q_0$$ and still get the power of a fancy posterior $$\mathsf{z}\sim q.$$

Now we can calculate (sufficiently nice) expectations with respect to $$\mathsf{z}$$ using the law of the unconscious statistician

$\mathbb{E} f(\mathsf{z})=\int f(z) q_0(z) \mathrm{d} z.$

However, we need to impose additional conditions to guarantee a tractable form for the densities; in particular we will impose the restriction that $$f$$ is invertible, so that we can use the change of variables formula to find the density

$q\left( z \right) =q_0( f^{-1}(z) )\left| \operatorname{det} \frac{\partial f^{-1}(z)}{\partial z }\right| =q_0( f^{-1}(z) )\left| \operatorname{det} \frac{\partial f(z)}{\partial z }\right|^{-1}.$

We can economise on function inversions since we will be evaluating always via simulating $$\mathsf{z}$$ from $$f(\mathsf{z}_0)$$, i.e. $$\mathsf{z}:=f( \mathsf{z}_0)$$, so we can write

$q\left( \mathsf{z} \right) =q_0(\mathsf{z}_0 )\left| \operatorname{det} \frac{\partial f(\mathsf{z})}{\partial \mathsf{z} }\right|^{-1}.$

We do not need, that is to say, $$f$$ inversions (in this VAE context) just the Jacobian determinant, which might be easier. Spoiler: later on we construct some $$f$$ transforms for which it is substantially easier to find that Jacobian determinant without inverting the function.

In our application, which may as well be the Variational Autoencoder (VAE), why not, we want this mapping to depend upon $$\mathsf{x}$$ and the parameters $$\phi$$ so we reintroduce that dependence now. We parameterize these functions $$f_{\phi}(\mathsf{z}):=f(\mathsf{z},\phi(\mathsf{x}));$$ that is, our $$\phi$$ parameters are a learned mapping from observation $$\mathsf{z}_0$$ to a posterior sample from the latent variable $$\mathsf{z}$$ with density $$q_{K}\left( z _{K} | x \right) \simeq p_\theta(z|x)$$ such that, if we configure everything just right, $$q_{K}\left( z _{K} | x \right) \simeq p_\theta(z|x)$$. The VAE (modified from Rezende and Mahomed)

And indeed it is not too hard to find a recipe for configuring these parameters so that we have the best attainable approximation.

Noting that, we may estimate the following derivatives $$\nabla_{\theta} \mathcal{F}( x )$$ and $$\nabla_{\phi} \mathcal{F}( x )$$ which is sufficient to minimise $$\mathcal{F}(\theta, \phi)$$ by gradient descent.

In practice we are doing this in a big data context, so we will use stochastic subsamples from $$x$$ to estimate the gradient and we will also use Monte Carlo simulations from $$\mathsf{z}_0$$ to estimate the necessary integrals with respect to $$q_K(\cdot|\mathsf{x}),$$ but you can read the paper for the details of that.

So far so good. But it turns out that this is not in general very tractable, because determinants are notoriously expensive to calculate, scaling poorly — $$\mathcal{O}(D^3)$$ — in the number of dimensions, which we will expect to be large in trendy neural network type problems.

We look for restrictions on the form of $$f_{\phi}$$ such that $$\operatorname{det} \frac{\partial f_{\phi}}{\partial z }$$ is cheap and yet the approximate $$q$$ they induce is still “flexible enough”.

The normalizing flow solution is to choose compositions of some class of cheap functions $\mathsf{z}_{K}=f_{K} \circ \ldots \circ f_{2} \circ f_{1}\left( \mathsf{z} _{0}\right)\sim q_K.$

By induction on the change of variables formula (and using the logarithm for tidiness), we can find the density of the variable mapped through this flow as

$\log q_{K}\left( \mathsf{z} _{K} | x \right) =\log q_{0}\left( \mathsf{z} _{0} | x \right) - \sum_{k=1}^{K} \log \left|\operatorname{det}\left(\frac{\partial f_{k}\left( \mathsf{z} _{k-1} \right)}{\partial \mathsf{z} _{k-1}}\right)\right|$

Compositions of such cheap functions should also be a cheap-ish way of buying flexibility.

But which $$f_k$$ mappings are cheap?

The archetypal one from is the “planar flow”:

$f(\mathsf{z})= \mathsf{z} + u h\left( w ^{\top} \mathsf{z} +b\right)$

where $$u , w \in \mathbb{R}^{D}, b \in \mathbb{R}$$ and $$h:\mathbb{R}\to\mathbb{R}$$ is some monotonic differentiable activation function. Rezende and Mohamed’s(2015) planar flows applied to some latent distributions

There is a standard identity, the matrix determinant lemma $$\operatorname{det}\left( \mathrm{I} + u v^{\top}\right)=1+ u ^{\top} v,$$ from which it follows

\begin{aligned} \operatorname{det} \frac{\partial f}{\partial z } &=\operatorname{det}\left( \mathrm{I} + u h^{\prime}\left( w ^{\top} z +b\right) w ^{\top}\right) \\ &=1+ u ^{\top} h^{\prime}\left( w ^{\top} z +b\right) w. \end{aligned} We can often simply parameterise acceptable domains for functions like these so that they remain invertible; For example if $$h\equiv \tanh$$ a sufficient condition is $$u^\top w\geq 1.$$ This means that we know it should be simple to parameterise these weights $$u,w,b$$. Or, as in our application, that it is easy to construct functions $$\phi:x\mapsto u,w,b$$ which are guaranteed to remain in an invertible domain.

For functions like this the determinant calculation is cheap, and does not depend at all on the ambient dimension of $$\mathsf{x}$$. However, we might find it hard to persuade ourselves that this mapping is flexible enough to represent $$q$$ well, at least not without letting $$K$$ be large, as the mapping must pass through a univariate “bottleneck” $$w ^{\top} \mathsf{z}.$$ Indeed, empirically this does not in fact perform well and a lot of time has been spent trying to do better.

A popular solution to this problem is given by the Sylvester flow which, instead of the matrix determinant lemma uses a generalisation, Sylvester’s determinant identity. This tells us that for all $$\mathrm{A} \in \mathbb{R}^{D \times M}, \mathrm{B} \in \mathbb{R}^{M \times R}$$,

$\operatorname{det}\left( \mathrm{I}_{D}+ \mathrm{A} \mathrm{B} \right) =\operatorname{det}\left( \mathrm{I}_{M}+ \mathrm{B} \mathrm{A} \right)$ where $$\mathrm{I}_{M}$$ and $$\mathrm{I}_{D}$$ are respectively $$M$$ and $$D$$ dimensional identity matrices.

This suggests we might look at a generalized planar flow,

$f(\mathsf{z}):= \mathsf{z} + \mathrm{A} h\left(\mathrm{B} \mathsf{z} +b\right)$

with $$\mathrm{A} \in \mathbb{R}^{D \times M}, \mathrm{B} \in \mathbb{R}^{M \times D}, b \in \mathbb{R}^{M},$$ and $$M \leq D.$$ The determinant calculation here scales as ($$\mathcal{O}(M^3)\ll \mathcal{O}(D^3),$$ which is at least cheaper than the general case, and (we hope) gives us enough additional scope to design sufficiently flexible flows, since we have a bottleneck of size $$M \geq 1.$$

The price is an additional parameterisation problem. How do we select $$\mathrm{A}$$ and $$\mathrm{B}$$ such that they are still invertible for a given $$h$$? The solution in is to break this into two simpler parameterization problems.

They choose $$f(\mathsf{z}):=\mathsf{z}+\mathrm{Q} \mathrm{R} h\left(\tilde{\mathrm{R}} \mathrm{Q}^{T} \mathsf{z}+\mathrm{b}\right)$$ where $$\mathrm{R}$$ and $$\tilde{\mathrm{R}}$$ are upper triangular $$M \times M$$ matrices, and $$Q$$ is a $$D\times M$$ matrix with orthonormal columns $$\mathrm{Q}=\left(\mathrm{q}_{1} \ldots \mathrm{q}_{M}\right).$$ Using the Sylvester identity on this $$f$$ we find

$\operatorname{det} \frac{\partial f}{\partial z } =\operatorname{det}\left( \mathrm{I} _{M}+\operatorname{diag}\left(h^{\prime}\left(\tilde{\mathrm{R} } \mathrm{Q} ^{T} z + b \right)\right) \tilde{\mathrm{R} } \mathrm{R} \right)$

They also show that if, in addition,

1. $$h: \mathbb{R} \longrightarrow \mathbb{R}$$ is a smooth function with bounded, positive derivative and
2. if the diagonal entries of $$\mathrm{R}$$ and $$\tilde{\mathrm{R}}$$ satisfy $$r_{i i} \tilde{r}_{i i}>-1 /\left\|h^{\prime}\right\|_{\infty}$$ and
3. $$\tilde{\mathrm{R}}$$ is invertible,

then $$f$$ is invertible as required.

Now all we need to action this do is choose a differentiable parameterisation of the upper-triangular $$\mathrm{R},$$ the upper-triangular invertible $$\tilde{\mathrm{R} }$$ with appropriate diagonal entries and the orthonormal matrix, $$Q$$. That is a whole other story though.

The parallel with the problem of finding covariance kernels is interesting. In both cases we have some large function class we wish to parameterise so we can search over it, but we restrict it to a subset with computationally convenient properties and a simple domain. This probably arises in other nonparametrics?

## “Normalized” flows

In the special case where the flows are designed to be pre-normalized (by which I mean to always have a determinant of 1), this becomes a slightly more constrained problem as seen in, learning-stable-systems models.

🏗

🏗

## Representational power of

A CMU team argues

The most fundamental restriction of the normalizing flow paradigm is that each layer needs to be invertible. We ask whether this restriction has any ‘cost’ in terms of the size, and in particular the depth, of the model. Here we’re counting depth in terms of the number of the invertible transformations that make up the flow. A requirement for large depth would explain training difficulties due to exploding (or vanishing) gradients. Since the Jacobian of a composition of functions is the product of the Jacobians of the functions being composed, the min (max) singular value of the Jacobian of the composition is the product of the min (max) singular value of the Jacobians of the functions. This implies that the smallest (largest) singular value of the Jacobian will get exponentially smaller (larger) with the number of compositions.

A natural way of formalizing this question is by exhibiting a distribution which is easy to model for an unconstrained generator network but hard for a shallow normalizing flow. Precisely, we ask: is there a probability distribution that can be represented by a shallow generator with a small number of parameters that could not be approximately represented by a shallow composition of invertible transformations?

We demonstrate that such a distribution exists.

…To reiterate the takeaway: a GLOW-style linear layer in between affine couplings could in theory make your network between 5 and 47 times smaller while representing the same function. We now have a precise understanding of the value of that architectural choice!

## Tutorials

1. I am aware from painful experience that using a font to denote a random variable is painful for some people, but this is usually the same people who have strong opinions about “which” versus “that” and other such bullshit, and such people are welcome to their opinions but they can get their own blog to have express them on. In this context, things come out much clearer if we can distinguish RVs, their realisations, densities and observations using some appropriate notation.↩︎

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.