Reparameterization methods for MC gradient estimation

Pathwise gradient estimation,

April 4, 2018 — May 2, 2023

likelihood free
Monte Carlo
probabilistic algorithms
Figure 1

Reparameterization trick. A trick where we cleverly transform RVs to sample from tricky target distributions, and their jacobians, via a “nice” nice source distribution. Useful in e.g. variational inference, especially autoencoders, for density estimation in probabilistic deep learning. Pairs well with normalizing flows to get powerful target distributions. Storchastic credits pathwise gradients to Glasserman and Ho (1991) as perturbation analysis. According to Bloem-Reddy and Teh (2020) the reparameterisation trick is an application of noise outsourcing.

1 Tutorials

Suppose we want the gradient of an expectation of a smooth function \(f\): \[ \nabla_\theta \mathbb {E}_{p(z; \theta)}[f (z)]=\nabla_\theta \int p(z; \theta) f (z) d z \] […] This gradient is often difficult to compute because the integral is typically unknown and the parameters \(\theta\), with respect to which we are computing the gradient, are of the distribution \(p(z; \theta)\).

Now we suppose that we know some function \(g\) such that for some easy distribution \(p(\epsilon)\), \(z | \theta=g(\epsilon, \theta)\). Now we can try to estimate the gradient of the expectation by Monte Carlo:

\[ \nabla_\theta \mathbb {E}_{p(z; \theta)}[f (z)]=\mathbb {E}_{p (c)}\left[\nabla_\theta f(g(\epsilon, \theta))\right] \] Let’s derive this expression and explore the implications of it for our optimisation problem. One-liners give us a transformation from a distribution \(p(\epsilon)\) to another \(p (z)\), thus the differential area (mass of the distribution) is invariant under the change of variables. This property implies that: \[ p (z)=\left|\frac{d \epsilon}{d z}\right|-p(\epsilon) \Longrightarrow-p (z) d z|=|p(\epsilon) d \epsilon| \] Re-expressing the troublesome stochastic optimisation problem using random variate reparameterisation, we find: \[ \begin{aligned} \nabla_\theta \mathbb {E}_{p(z; \theta)}[f (z)] &=\nabla_\theta \int p(z; \theta) f (z) d z \\ &= \nabla_\theta \int p(\epsilon) f (z) d \epsilon\\ &=\nabla_\theta \int p(\epsilon) f(g(\epsilon, \theta)) d \epsilon \\ &=\nabla_\theta \mathbb {E}_{p (c)}[f(g(\epsilon, \theta))]\\ &=\mathbb {E}_{p (e)}\left[\nabla_\theta f(g(\epsilon, \theta))\right] \end{aligned} \]

Yuge Shi’s variational inference tutorial is a tour of cunning reparameterisation gradient tricks written for her paper Shi et al. (2019). She punts some details to Mohamed et al. (2020) which in turn tells me that this adventure continues at Monte Carlo gradient estimation, Figurnov, Mohamed, and Mnih (2018), Devroye (2006) and Jankowiak and Obermeyer (2018).

2 Normalizing flows

Cunning reparameterization maps with desirable properties for nonparametric density inference. See normalizing flows.

3 General measure transport

See transport maps.

4 Tooling


5 Incoming

Universal representation theorems? Probably many, here are some I saw: Perekrestenko, Müller, and Bölcskei (2020); Perekrestenko, Eberhard, and Bölcskei (2021).

6 References

Albergo, Goldstein, Boffi, et al. 2023. Stochastic Interpolants with Data-Dependent Couplings.”
Albergo, and Vanden-Eijnden. 2023. Building Normalizing Flows with Stochastic Interpolants.” In.
Ambrosio, Gigli, and Savare. 2008. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich.
Bamler, and Mandt. 2017. Structured Black Box Variational Inference for Latent Time Series Models.” arXiv:1707.01069 [Cs, Stat].
Bloem-Reddy, and Teh. 2020. Probabilistic Symmetries and Invariant Neural Networks.”
Caterini, Doucet, and Sejdinovic. 2018. Hamiltonian Variational Auto-Encoder.” In Advances in Neural Information Processing Systems.
Charpentier, Borchert, Zügner, et al. 2022. Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family Distributions.” arXiv:2105.04471 [Cs, Stat].
Chen, Changyou, Li, Chen, et al. 2017. Continuous-Time Flows for Efficient Inference and Density Estimation.” arXiv:1709.01179 [Stat].
Chen, Tian Qi, Rubanova, Bettencourt, et al. 2018. Neural Ordinary Differential Equations.” In Advances in Neural Information Processing Systems 31.
Cunningham, Zabounidis, Agrawal, et al. 2020. Normalizing Flows Across Dimensions.”
Devroye. 2006. Chapter 4 Nonuniform Random Variate Generation.” In Simulation. Handbooks in Operations Research and Management Science.
Dinh, Sohl-Dickstein, and Bengio. 2016. Density Estimation Using Real NVP.” In Advances In Neural Information Processing Systems.
Figurnov, Mohamed, and Mnih. 2018. Implicit Reparameterization Gradients.” In Advances in Neural Information Processing Systems 31.
Germain, Gregor, Murray, et al. 2015. MADE: Masked Autoencoder for Distribution Estimation.”
Glasserman, and Ho. 1991. Gradient Estimation Via Perturbation Analysis.
Grathwohl, Chen, Bettencourt, et al. 2018. FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models.” arXiv:1810.01367 [Cs, Stat].
Huang, Krueger, Lacoste, et al. 2018. Neural Autoregressive Flows.” arXiv:1804.00779 [Cs, Stat].
Jankowiak, and Obermeyer. 2018. Pathwise Derivatives Beyond the Reparameterization Trick.” In International Conference on Machine Learning.
Kingma, Durk P, and Dhariwal. 2018. Glow: Generative Flow with Invertible 1x1 Convolutions.” In Advances in Neural Information Processing Systems 31.
Kingma, Diederik P., Salimans, Jozefowicz, et al. 2016. Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29.
Kingma, Diederik P., Salimans, and Welling. 2015. Variational Dropout and the Local Reparameterization Trick.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.
Kingma, Diederik P., and Welling. 2014. Auto-Encoding Variational Bayes.” In ICLR 2014 Conference.
Koehler, Mehta, and Risteski. 2020. Representational Aspects of Depth and Conditioning in Normalizing Flows.” arXiv:2010.01155 [Cs, Stat].
Lin, Khan, and Schmidt. 2019. Stein’s Lemma for the Reparameterization Trick with Exponential Family Mixtures.” arXiv:1910.13398 [Cs, Stat].
Lipman, Chen, Ben-Hamu, et al. 2023. Flow Matching for Generative Modeling.”
Louizos, and Welling. 2017. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR.
Lu, and Huang. 2020. Woodbury Transformations for Deep Generative Flows.” In Advances in Neural Information Processing Systems.
Marzouk, Moselhy, Parno, et al. 2016. Sampling via Measure Transport: An Introduction.” In Handbook of Uncertainty Quantification.
Massaroli, Poli, Bin, et al. 2020. Stable Neural Flows.” arXiv:2003.08063 [Cs, Math, Stat].
Mohamed, Rosca, Figurnov, et al. 2020. Monte Carlo Gradient Estimation in Machine Learning.” Journal of Machine Learning Research.
Ng, and Zammit-Mangion. 2020. Non-Homogeneous Poisson Process Intensity Modeling and Estimation Using Measure Transport.” arXiv:2007.00248 [Stat].
Papamakarios. 2019. Neural Density Estimation and Likelihood-Free Inference.”
Papamakarios, Murray, and Pavlakou. 2017. Masked Autoregressive Flow for Density Estimation.” In Advances in Neural Information Processing Systems 30.
Papamakarios, Nalisnick, Rezende, et al. 2021. Normalizing Flows for Probabilistic Modeling and Inference.” Journal of Machine Learning Research.
Perekrestenko, Eberhard, and Bölcskei. 2021. High-Dimensional Distribution Generation Through Deep Neural Networks.” Partial Differential Equations and Applications.
Perekrestenko, Müller, and Bölcskei. 2020. Constructive Universal High-Dimensional Distribution Generation Through Deep ReLU Networks.”
Pfau, and Rezende. 2020. “Integrable Nonparametric Flows.” In.
Ran, and Hu. 2017. Parameter Identifiability in Statistical Machine Learning: A Review.” Neural Computation.
Rezende, and Mohamed. 2015. Variational Inference with Normalizing Flows.” In International Conference on Machine Learning. ICML’15.
Rezende, Mohamed, and Wierstra. 2015. Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” In Proceedings of ICML.
Rippel, and Adams. 2013. High-Dimensional Probability Estimation with Deep Density Models.” arXiv:1302.5125 [Cs, Stat].
Ruiz, Titsias, and Blei. 2016. The Generalized Reparameterization Gradient.” In Advances In Neural Information Processing Systems.
Shi, Siddharth, Paige, et al. 2019. Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models.” arXiv:1911.03393 [Cs, Stat].
Spantini, Baptista, and Marzouk. 2022. Coupling Techniques for Nonlinear Ensemble Filtering.” SIAM Review.
Spantini, Bigoni, and Marzouk. 2017. Inference via Low-Dimensional Couplings.” Journal of Machine Learning Research.
Tabak, E. G., and Turner. 2013. A Family of Nonparametric Density Estimation Algorithms.” Communications on Pure and Applied Mathematics.
Tabak, Esteban G., and Vanden-Eijnden. 2010. Density Estimation by Dual Ascent of the Log-Likelihood.” Communications in Mathematical Sciences.
van den Berg, Hasenclever, Tomczak, et al. 2018. Sylvester Normalizing Flows for Variational Inference.” In UAI18.
Wang, and Wang. 2019. Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Wehenkel, and Louppe. 2021. Graphical Normalizing Flows.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.
Xu, Zuheng, Chen, and Campbell. 2023. MixFlows: Principled Variational Inference via Mixed Flows.”
Xu, Ming, Quiroz, Kohn, et al. 2018. Variance Reduction Properties of the Reparameterization Trick.” arXiv:1809.10330 [Cs, Stat].
Yang, Li, and Wang. 2021. On the Capacity of Deep Generative Networks for Approximating Distributions.” arXiv:2101.12353 [Cs, Math, Stat].
Zahm, Constantine, Prieur, et al. 2018. Gradient-Based Dimension Reduction of Multivariate Vector-Valued Functions.” arXiv:1801.07922 [Math].
Zhang, and Curtis. 2021. Bayesian Geophysical Inversion Using Invertible Neural Networks.” Journal of Geophysical Research: Solid Earth.