Monte Carlo gradient estimation
September 30, 2020 — July 26, 2024
Taking gradients through integrals/expectations using randomness, i.e. can I estimate this?
\[ g[f] := \frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)}[f(\mathsf{x})] \]
A concept with a similar name but which is not the same is Stochastic Gradient MCMC, which uses stochastic gradients to sample from a target posterior distribution. Some similar tools and concepts pop up in both applications.
1 Score function estimator
A.k.a. REINFORCE (all-caps, for some reason?) A generic method that works on lots of things, including discrete variables; notoriously high variance if done naïvely. Credited to (Williams 1992), but surely it must be older than that?
\[ \begin{aligned} g(f) &= \frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)} f(\mathsf{x}) \\ &= \frac{\partial}{\partial \theta} \int f(x) p(x;\theta) \mathrm{d} x\\ &= \int f(x) \frac{\partial}{\partial \theta} p(x;\theta) \mathrm{d} x\\ &= \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)} f(\mathsf{x}) \frac{\partial}{\partial \theta}\log p(\mathsf{x};\theta) \end{aligned} \] The use of this is that there is a simple and obvious Monte Carlo estimate of the latter, choosing sample \(x_i\sim p(x;\theta)\) \[ \begin{aligned} \hat{g}_{\text{REINFORCE}}(f) &= \sum_i f(x_i) \frac{\partial}{\partial \theta}\log p(x_i;\theta) \end{aligned} \]
See score function estimators for more.
2 Reparameterization trick
Define some base distribution \(\mathsf{e}\sim p(\mathsf{e})\) such that \(f(\mathsf{x};\theta) \simeq f(T(\mathsf{e});\theta)\) for some transform \(T\). Then \[\begin{aligned} \hat{g}_{\text{reparam}}(f) &= \frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{e}\sim p(\mathsf{e})}[f(\mathsf{x})] \\ &= \mathbb{E}_{\mathsf{e}\sim p(\mathsf{e})}\left[\frac{\partial f}{\partial T}\frac{\partial T}{\partial \theta}(\mathsf{e};\theta)\right]. \end{aligned}\]
Less general but better-behaved than the score-function/REINFORCE estimator.
See reparameterization trick for more about that.
3 Gumbel-softmax
For categorical variates’ gradients in particular. See Gumbel-softmax.
4 Parametric
I can imagine that our observed rv \({\mathsf{x}}\in \mathbb{R}\) is generated via lookups from its iCDF $F(;) $ with parameter \(\theta\): \[\mathsf{x} = F^{-1}(\mathsf{u};\theta) \] where \(\mathsf{u}\sim\operatorname{Uniform}(0,1)\). Each realization corresponds to a choice of \(u_i\sim \mathsf{u}\) independently. How can I get the derivative of such a map?
Maybe I generated my original variable not by the icdf method but by simulating some variable \({\mathsf{z}}\sim F(\cdot; \theta).\) In which case I may as well have generated those \(\mathsf{u}_i\) by taking \(\mathsf{u}_i=F(\mathsf{z}_i;\theta)\) for some \(\mathsf{z} \sim F(\cdot;\theta)\) and I am conceptually generating my RV by fixing \(z_i\sim\mathsf{z}_i\) and taking \(\phi := F^{-1}(F(z_i;\theta);\tau).\) So to find the effect of my perturbation what I actually need is
\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta}\\ \end{aligned}\]
Does this do what we want? Kinda. So suppose that the parameters in question are something boring, such as the location parameter of a location-scale distribution, i.e. \(F(\cdot;\theta)=F(\cdot-\theta;0).\) Then \(F^{-1}(\cdot;\theta)=F^{-1}(\cdot;0)+\theta\) and thus
\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta} &=\left.\frac{\partial}{\partial \tau} F^{-1}(F(z-\theta;0);0)+\tau\right|_{\tau=\theta}\\ &=\left.\frac{\partial}{\partial \tau}\left(z-\theta+\tau\right)\right|_{\tau=\theta}\\ &=1\\ \end{aligned}\]
OK grand that came out simple enough.
TBC
5 “Measure-valued”
TBD (Mohamed et al. 2020; Rosca et al. 2019).
6 Tooling
van Krieken, Tomczak, and Teije (2021) supplies us with a large library of pytorch tools for stochastic gradient estimation purposes, under the rubric Storchastic. (Source.). See also Deepmind’s mc_gradients.
7 Optimising Monte Carlo
Let us say I need to differentiate through a monte carlo algorithm to alter its parameters while holding the PRNG fixed. See Tuning MC.