Normalising flows
Invertible density models, sounding clever by using the word diffeomorphism like a real mathematician
April 3, 2018 — June 12, 2024
Cleverly structured “nice” transforms of RVs to sample from tricky target distributions via an easy source distribution. Useful in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning, best summarised as “fancy change of variables such that I can differentiate through the parameters of a distribution”. Storchastic credits this to (Glasserman and Ho 1991) as perturbation analysis.
Connections to optimal transport and likelihood-free inference, in that this trick can enable some clever approximate-likelihood approaches. Uses transport maps.
Terminology: All variables here are assumed to take values in
1 Basic
AFAICT the terminology is chosen to imply that the reparameterization should be differentiable, bijective and computationally convenient Ingmar Shuster’s summary of the foundational (Rezende and Mohamed 2015) is a little snarky:
The paper adopts the term normalising flow for referring to the plain old change of variables formula for integrals.
I have had the dissertation (Papamakarios 2019) recommended to me a lot as a summary for this field, although I have not read it. The shorter summary paper looks good though (Papamakarios et al. 2021). See Kobyzev, Prince, and Brubaker (2021) for a recent review.
Normalizing flows are a method of variational inference. As with typical variational inference, it is convenient for us while doing inference to have a way of handling an approximate posterior density over the latent variables
- calculate the density
, such that… - we can estimate expectations/integrals with respect to
in the sense that we can estimate which we are satisfied to do via Monte Carlo, and so it will suffice if we can simulate from this density, and - we can differentiate through this density with respect to the variational parameters to find
- Our method is ‘flexible enough’ to approximate a complicated, messy posterior density.
The whole problem in such a case would entail solving for the variational objective:
Here
(For the next bit, we temporarily suppress the dependence on
Normalizing flows use the reparameterization trick as an answer to those desiderata. Specifically, we assume that for some function (not density!)
Now we can calculate (sufficiently nice) expectations with respect to
However, we need to impose additional conditions to guarantee a tractable form for the densities; in particular we impose the restriction that
We can economise on function inversions since we are evaluating always via simulating
We do not need, that is to say,
We want this mapping to depend upon
And indeed it is not too hard to find a recipe for optimising these parameters. We may estimate the following derivatives $_{} ( x ) $ and
In practice we are doing this in a big data context, so we use stochastic subsamples from
So far so good. But it turns out that this is not in general very tractable, because determinants are notoriously expensive to calculate, scaling poorly —
We look for restrictions on the form of
By induction on the change of variables formula (and using the logarithm for tidiness), we can find the density of the variable mapped through this flow as
Compositions of such cheap functions should also be cheap-ish and each iteration would add extra flexibility.
But which
There is a standard identity, the matrix determinant lemma
For functions like this, the determinant calculation is cheap, and does not depend at all on the ambient dimension of
The parallel with the problem of finding covariance kernels is interesting. In both cases we have some large function class we wish to parameterise so we can search over it, but we restrict it to a subset with computationally convenient properties and a simple domain.
2 A fancier flow
There are various solutions to this problem; an illustrative one is Sylvester flow (van den Berg et al. 2018) which, instead of the matrix determinant lemma uses a generalisation, Sylvester’s determinant identity. This tells us that for all
This suggests we might exploit a generalized planar flow,
The price is an additional parameterisation problem. How do we select
They choose
They also show that if, in addition,
is a smooth function with bounded, positive derivative and- if the diagonal entries of
and satisfy and is invertible,
then
Now all we need to do is choose a differentiable parameterisation of the upper-triangular
3 Spline
The type of flow that everyone seems to use in practice for moderate dimensionality is the spline flow (Durkan et al. 2019).
4 Glow
Generative flow for images and similar stuff (Durk P. Kingma and Dhariwal 2018).
5 For density estimation
I am no longer sure what distinction I wished to draw with this heading, but see (Esteban G. Tabak and Vanden-Eijnden 2010; E. G. Tabak and Turner 2013) and maybe also check out transport maps.
🏗
6 Tutorials
- Rui Shu explains change of variables in probability and shows how it induces the normalizing flow idea, which the reparameterisation trick forms part of.
- PyMC3 example of a non-trivial usage.
- Adam Kosiorek summarises some fancy variants of normalising flow.
- Eric Jang did a tutorial which explains how this works in Tensorflow.
- Praveen on Ruiz, Titsias, and Blei (2016).
- Dustin Tran, Denoising Criterion for Variational Auto-Encoding Framework
- Shakir Mohamed, Machine Learning Trick of the Day (4): Reparameterisation Tricks
- Papamakarios et al. (2021)
- Eric Jang on normalising flows
7 General measure transport
See transport maps.
8 Representational power of
The most fundamental restriction of the normalising flow paradigm is that each layer needs to be invertible. We ask whether this restriction has any ‘cost’ in terms of the size, and in particular the depth, of the model. Here we’re counting depth in terms of the number of the invertible transformations that make up the flow. A requirement for large depth would explain training difficulties due to exploding (or vanishing) gradients. Since the Jacobian of a composition of functions is the product of the Jacobians of the functions being composed, the min (max) singular value of the Jacobian of the composition is the product of the min (max) singular value of the Jacobians of the functions. This implies that the smallest (largest) singular value of the Jacobian will get exponentially smaller (larger) with the number of compositions.
A natural way of formalising this question is by exhibiting a distribution which is easy to model for an unconstrained generator network but hard for a shallow normalising flow. Precisely, we ask: is there a probability distribution that can be represented by a shallow generator with a small number of parameters that could not be approximately represented by a shallow composition of invertible transformations?
We demonstrate that such a distribution exists.
…To reiterate the takeaway: a GLOW-style linear layer in between affine couplings could in theory make your network between 5 and 47 times smaller while representing the same function. We now have a precise understanding of the value of that architectural choice!
9 Conditioned
(Koehler, Mehta, and Risteski 2020; Winkler et al. 2023) TBC
10 Implementations
probabilists/zuko: Normalising flows in PyTorch — Seems active
Numpyro includes normalising flows
Pyro includes normalising flows, albeit not especially fancy ones:
flowtorch / Source (Formerly Meta-backed, I think now independent?)
11 References
Footnotes
I am aware from painful experience that using a font to denote a random variable will irritate some people, but this is usually the same people who have strong opinions about “which” versus “that” and other such bullshit, and such people are welcome to their opinions but they can get their own blog to express those opinions on in notations which suit their purposes. In my context, things come out much clearer if we can distinguish RVs, their realisations, densities and observations using some appropriate notation which is not capitalisation or italics, which I already use for other purposes.↩︎