Figure 1: Approximating the desired distribution by perturbation of the available distribution

Cleverly structured “nice” transforms of RVs to sample from tricky target distributions via an easy source distribution. Useful in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning, best summarised as “fancy change of variables such that I can differentiate through the parameters of a distribution”. Storchastic credits this to () as perturbation analysis.

Connections to optimal transport and likelihood-free inference, in that this trick can enable some clever approximate-likelihood approaches. Uses transport maps.

Terminology: All variables here are assumed to take values in RD. If I am writing about a random variable I write it x and if I am writing about the realised values, I write it x. xp(x) should be read “The random variable x has law with density p(x).”

1 Basic

AFAICT the terminology is chosen to imply that the reparameterization should be differentiable, bijective and computationally convenient Ingmar Shuster’s summary of the foundational () is a little snarky:

The paper adopts the term normalising flow for referring to the plain old change of variables formula for integrals.

I have had the dissertation () recommended to me a lot as a summary for this field, although I have not read it. The shorter summary paper looks good though (). See Kobyzev, Prince, and Brubaker () for a recent review.

Normalizing flows are a method of variational inference. As with typical variational inference, it is convenient for us while doing inference to have a way of handling an approximate posterior density over the latent variables z given the data x. The real posterior density p(z|x) is intractable, so we construct an approximate density qϕ(z|x) parameterised by some variational parameters ϕ. We have certain desiderata. Specifically, we would like to efficiently…

  1. calculate the density qϕ(z|x), such that…
  2. we can estimate expectations/integrals with respect to qϕ(|z) in the sense that we can estimate qϕ(|z)f(z)dz, which we are satisfied to do via Monte Carlo, and so it will suffice if we can simulate from this density, and
  3. we can differentiate through this density with respect to the variational parameters to find ϕqϕ(z|x).
  4. Our method is ‘flexible enough’ to approximate a complicated, messy posterior density.

The whole problem in such a case would entail solving for the variational objective:

logpθ(x)L(θ,ϕ;x)=Eqϕ(z|x)[logqϕ(z|x)+logpθ(x,z)]=KL(qϕ(z|x)pθ(z))+Eqϕ(z|x)[logpθ(x|z)]=F(θ,ϕ)

Here pθ(x) is the marginal likelihood of the generative model for the data, pθ(x|z) the density of observations given the latent factors and θ parameterises the density of the generative model.

(For the next bit, we temporarily suppress the dependence on x to avoid repetition, and the dependence of the transforms and density upon ϕ.)

Normalizing flows use the reparameterization trick as an answer to those desiderata. Specifically, we assume that for some function (not density!) f:RDRD, that z=f(z0) and that z0q0=N(0,I) (or some similarly easy distribution). It turns out that by imposing some extra restrictions, we can do most of the heavy lifting in this algorithm through this simple z0q0 and still get the power of a fancy posterior zq.

Now we can calculate (sufficiently nice) expectations with respect to z using the law of the unconscious statistician Ef(z)=f(z)q0(z)dz.

However, we need to impose additional conditions to guarantee a tractable form for the densities; in particular we impose the restriction that f is invertible, so that we can use the change of variables formula to find the density

q(z)=q0(f1(z))|detf1(z)z|=q0(f1(z))|detf(z)z|1.

We can economise on function inversions since we are evaluating always via simulating z from f(z0), i.e. z:=f(z0), so we can write

q(z)=q0(z0)|detf(z)z|1.

We do not need, that is to say, f inversions, just the Jacobian determinant, which might be easier. Spoiler: later on we construct some f transforms for which it is easy-ish to find that Jacobian determinant without inverting the function.

We want this mapping to depend upon x and the parameters ϕ so we reintroduce that dependence now. We parameterize these functions fϕ(z):=f(z,ϕ(x)); that is, our ϕ parameters are a learned mapping from observation z0 to a posterior sample from the latent variable z with density qK(zK|x)pθ(z|x) such that, if we configure everything just right, qK(zK|x)pθ(z|x).

And indeed it is not too hard to find a recipe for optimising these parameters. We may estimate the following derivatives $_{} ( x ) $ and ϕF(x) which is sufficient to minimise F(θ,ϕ) by gradient descent.

In practice we are doing this in a big data context, so we use stochastic subsamples from x to estimate the gradient and Monte Carlo simulations from z0 to estimate the necessary integrals with respect to qK(|x).

So far so good. But it turns out that this is not in general very tractable, because determinants are notoriously expensive to calculate, scaling poorly — O(D3) — in the number of dimensions, which we expect to be large in trendy neural-network-type problems.

We look for restrictions on the form of fϕ such that detfϕz is cheap and yet the approximate q they induce is still “flexible enough”. The normalizing flow solution is to choose compositions of some class of cheap functions zK=fKf2f1(z0)qK.

By induction on the change of variables formula (and using the logarithm for tidiness), we can find the density of the variable mapped through this flow as

logqK(zK|x)=logq0(z0|x)k=1Klog|det(fk(zk1)zk1)|

Compositions of such cheap functions should also be cheap-ish and each iteration would add extra flexibility.

But which fk mappings are cheap? The archetypal one from () is the “planar flow”:

f(z)=z+uh(wz+b) where $ u, w ^{D}, b $ and h:RR is some monotonic differentiable activation function.

Figure 2: Rezende and Mohamed ()’s planar flows applied to some latent distributions

There is a standard identity, the matrix determinant lemma det(I+uv)=1+uv, from which it follows that detfz=det(I+uh(wz+b)w)=1+uh(wz+b)w. We can often simply parameterise acceptable domains for functions like these so that they remain invertible; For example if htanh a sufficient condition is uw1. This means that we know it should be simple to parameterise these weights u,w,b. Or, as in our application, that it is easy to construct functions ϕ:xu,w,b which are guaranteed to remain in an invertible domain.

For functions like this, the determinant calculation is cheap, and does not depend at all on the ambient dimension of x. However, we might find it hard to persuade ourselves that this mapping is flexible enough to represent q well, at least without letting K be large, as the mapping must pass through a univariate “bottleneck” wz. Indeed, empirically this mapping does not in fact perform well and a lot of time has been spent trying to do better.

The parallel with the problem of finding covariance kernels is interesting. In both cases we have some large function class we wish to parameterise so we can search over it, but we restrict it to a subset with computationally convenient properties and a simple domain.

2 A fancier flow

There are various solutions to this problem; an illustrative one is Sylvester flow () which, instead of the matrix determinant lemma uses a generalisation, Sylvester’s determinant identity. This tells us that for all ARD×M,BRM×R, det(ID+AB)=det(IM+BA) where $ {M}$ and $ {D}$ are respectively $M $ and $ D$ dimensional identity matrices.

This suggests we might exploit a generalized planar flow, f(z):=z+Ah(Bz+b) with $ ^{D M}, ^{M D}, b ^{M},$ and MD. The determinant calculation scales as O(M3)O(D3), which is cheap-ish and, we hope, gives us enough additional scope to design sufficiently flexible flows. M1 is a bottleneck parameter.

The price is an additional parameterisation problem. How do we select A and B such that they are still invertible for a given h? The solution in () is to break this into two simpler parameterisation problems.

They choose f(z):=z+QRh(R~QTz+b) where R and R~ are upper triangular M×M matrices, and Q is a D×M matrix with orthonormal columns Q=(q1qM). Using the Sylvester identity on this f we find

detfz=det(IM+diag(h(R~QTz+b))R~R)

They also show that if, in addition,

  1. h:RR is a smooth function with bounded, positive derivative and
  2. if the diagonal entries of R and R~ satisfy riir~ii>1/h and
  3. R~ is invertible,

then f is invertible as required.

Now all we need to do is choose a differentiable parameterisation of the upper-triangular R, the upper-triangular invertible R~ with appropriate diagonal entries and the orthonormal matrix, Q. That is a whole other story though.

3 Spline

The type of flow that everyone seems to use in practice for moderate dimensionality is the spline flow ().

4 Glow

Generative flow for images and similar stuff ().

5 For density estimation

I am no longer sure what distinction I wished to draw with this heading, but see (; ) and maybe also check out transport maps.

🏗

6 Tutorials

7 General measure transport

See transport maps.

8 Representational power of

A CMU team argues

The most fundamental restriction of the normalising flow paradigm is that each layer needs to be invertible. We ask whether this restriction has any ‘cost’ in terms of the size, and in particular the depth, of the model. Here we’re counting depth in terms of the number of the invertible transformations that make up the flow. A requirement for large depth would explain training difficulties due to exploding (or vanishing) gradients. Since the Jacobian of a composition of functions is the product of the Jacobians of the functions being composed, the min (max) singular value of the Jacobian of the composition is the product of the min (max) singular value of the Jacobians of the functions. This implies that the smallest (largest) singular value of the Jacobian will get exponentially smaller (larger) with the number of compositions.

A natural way of formalising this question is by exhibiting a distribution which is easy to model for an unconstrained generator network but hard for a shallow normalising flow. Precisely, we ask: is there a probability distribution that can be represented by a shallow generator with a small number of parameters that could not be approximately represented by a shallow composition of invertible transformations?

We demonstrate that such a distribution exists.

…To reiterate the takeaway: a GLOW-style linear layer in between affine couplings could in theory make your network between 5 and 47 times smaller while representing the same function. We now have a precise understanding of the value of that architectural choice!

9 Conditioned

(; ) TBC

10 Implementations

11 References

Ambrosio, Gigli, and Savare. 2008. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich.
Charpentier, Borchert, Zügner, et al. 2022. Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family Distributions.” arXiv:2105.04471 [Cs, Stat].
Chen, Changyou, Li, Chen, et al. 2017. Continuous-Time Flows for Efficient Inference and Density Estimation.” arXiv:1709.01179 [Stat].
Chen, Tian Qi, Rubanova, Bettencourt, et al. 2018. Neural Ordinary Differential Equations.” In Advances in Neural Information Processing Systems 31.
Dinh, Sohl-Dickstein, and Bengio. 2016. Density Estimation Using Real NVP.” In Advances In Neural Information Processing Systems.
Durkan, Bekasov, Murray, et al. 2019. Neural Spline Flows.” In Advances in Neural Information Processing Systems.
Figurnov, Mohamed, and Mnih. 2018. Implicit Reparameterization Gradients.” In Advances in Neural Information Processing Systems 31.
Glasserman, and Ho. 1991. Gradient Estimation Via Perturbation Analysis.
Grumitt, Karamanis, and Seljak. 2023. Flow Annealed Kalman Inversion for Gradient-Free Inference in Bayesian Inverse Problems.”
Huang, Krueger, Lacoste, et al. 2018. Neural Autoregressive Flows.” arXiv:1804.00779 [Cs, Stat].
Jankowiak, and Obermeyer. 2018. Pathwise Derivatives Beyond the Reparameterization Trick.” In International Conference on Machine Learning.
Kingma, Durk P, and Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions.” In Advances in Neural Information Processing Systems 31.
Kingma, Diederik P., Salimans, Jozefowicz, et al. 2016. Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29.
Kingma, Diederik P., Salimans, and Welling. 2015. Variational Dropout and the Local Reparameterization Trick.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.
Kingma, Diederik P., and Welling. 2014. Auto-Encoding Variational Bayes.” In ICLR 2014 Conference.
Kobyzev, Prince, and Brubaker. 2021. Normalizing Flows: An Introduction and Review of Current Methods.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Koehler, Mehta, and Risteski. 2020. Representational Aspects of Depth and Conditioning in Normalizing Flows.” arXiv:2010.01155 [Cs, Stat].
Louizos, and Welling. 2017. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR.
Lu, and Huang. 2020. Woodbury Transformations for Deep Generative Flows.” In Advances in Neural Information Processing Systems.
Massaroli, Poli, Bin, et al. 2020. Stable Neural Flows.” arXiv:2003.08063 [Cs, Math, Stat].
Mohamed, Rosca, Figurnov, et al. 2020. Monte Carlo Gradient Estimation in Machine Learning.” Journal of Machine Learning Research.
Nash, and Durkan. 2019. Autoregressive Energy Machines.” In Proceedings of the 36th International Conference on Machine Learning.
Papamakarios. 2019. Neural Density Estimation and Likelihood-Free Inference.”
Papamakarios, Murray, and Pavlakou. 2017. Masked Autoregressive Flow for Density Estimation.” In Advances in Neural Information Processing Systems 30.
Papamakarios, Nalisnick, Rezende, et al. 2021. Normalizing Flows for Probabilistic Modeling and Inference.” Journal of Machine Learning Research.
Pfau, and Rezende. 2020. “Integrable Nonparametric Flows.” In.
Rezende, and Mohamed. 2015. Variational Inference with Normalizing Flows.” In International Conference on Machine Learning. ICML’15.
Ruiz, Titsias, and Blei. 2016. The Generalized Reparameterization Gradient.” In Advances In Neural Information Processing Systems.
Shi, Siddharth, Paige, et al. 2019. Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models.” arXiv:1911.03393 [Cs, Stat].
Tabak, E. G., and Turner. 2013. A Family of Nonparametric Density Estimation Algorithms.” Communications on Pure and Applied Mathematics.
Tabak, Esteban G., and Vanden-Eijnden. 2010. Density Estimation by Dual Ascent of the Log-Likelihood.” Communications in Mathematical Sciences.
van den Berg, Hasenclever, Tomczak, et al. 2018. Sylvester Normalizing Flows for Variational Inference.” In UAI18.
Wang, and Wang. 2019. Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Winkler, Worrall, Hoogeboom, et al. 2023. Learning Likelihoods with Conditional Normalizing Flows.”

Footnotes

  1. I am aware from painful experience that using a font to denote a random variable will irritate some people, but this is usually the same people who have strong opinions about “which” versus “that” and other such bullshit, and such people are welcome to their opinions but they can get their own blog to express those opinions on in notations which suit their purposes. In my context, things come out much clearer if we can distinguish RVs, their realisations, densities and observations using some appropriate notation which is not capitalisation or italics, which I already use for other purposes.↩︎