Monte carlo gradient estimation

Taking gradients through integrals.

See Mohamed et al. (2020) for a roundup.

A common activity for me at the moment is differentiating the integral - for example, through the inverse-CDF lookup.

You see, what I would really like is the derivative of the mass-preserving continuous map \(\phi_{\theta, \tau}\) such that

\[\mathsf{z}\sim F(\cdot;\theta) \Rightarrow \phi_{\theta, \tau}(\mathsf{z})\sim F(\cdot;\tau). \] Now suppose I wish to optimise or otherwise perturb \(\theta\). This gives me a way of continuously parameterising a change in measure with respect to a realisation, and I can differentiate with respect to the parameterisation at

\[\left.\frac{\partial}{\partial \tau} \phi_{\theta, \tau}(\mathsf{z})\right|_{\tau=\theta}\] Let us say I need to differentiate through a monte carlo algorithm to alter its parameters while holding the PRNG fixed.

See the reparameterization trick for a way of making this, more or less, feasible for a class of suitably smooth nonparametric neural net problems by transforming it into a slightly different problem. But what if I am not doing some fluffy non-parametric thing but instead a using real and specific parametric distributions. What to do?

How can I get the derivative of such a map? I can look for candidates for the map. Here I can imagine that our observed rv \({\mathsf{x}}\in \mathbb{R}\) is generated via lookups from its iCDF \(F(\cdot;\theta)\) with parameter \(\theta\):

\[\mathsf{x} = F^{-1}(\mathsf{u};\theta) \]

where \(\mathsf{u}\sim\operatorname{Uniform}(0,1)\). Each realization corresponds to a choice of \(u_i\sim \mathsf{u}\) independently.

But maybe I generated my original variable not by the icdf method but by simulating some variable \({\mathsf{z}}\sim F(\cdot; \theta).\) In which case I may as well have generated those \(\mathsf{u}_i\) by taking \(\mathsf{u}_i=F(\mathsf{z}_i;\theta)\) for some \(\mathsf{z} \sim F(\cdot;\theta)\) and I am conceptually generating my RV by fixing \(z_i\sim\mathsf{z}_i\) and taking \(\phi := F^{-1}(F(z_i;\theta);\tau).\) So to find the effect of my perturbation what I actually need is

\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta}\\ \end{aligned}\]

Does this do what we want? Kinda. So suppose that the parameters in question are something boring, such as the location parameter of a location-scale distribution, i.e. \(F(\cdot;\theta)=F(\cdot-\theta;0).\) Then \(F^{-1}(\cdot;\theta)=F^{-1}(\cdot;0)+\theta\) and thus

\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta} &=\left.\frac{\partial}{\partial \tau} F^{-1}(F(z-\theta;0);0)+\tau\right|_{\tau=\theta}\\ &=\left.\frac{\partial}{\partial \tau}\left(z-\theta+\tau\right)\right|_{\tau=\theta}\\ &=1\\ \end{aligned}\]

OK grand that came out simple enough.

Hyvarinen, Aapo. n.d. “Estimation of Non-Normalized Statistical Models by Score Matching,” 15.

Mohamed, Shakir, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. 2020. “Monte Carlo Gradient Estimation in Machine Learning.” Journal of Machine Learning Research 21 (132): 1–62.

Stoker, Thomas M. 1986. “Consistent Estimation of Scaled Coefficients.” Econometrica 54 (6): 1461–81.

Walder, Christian J., Paul Roussel, Richard Nock, Cheng Soon Ong, and Masashi Sugiyama. 2019. “New Tricks for Estimating Gradients of Expectations.” June 24, 2019.