Taking gradients through integrals.

See Mohamed et al. (2020) for a roundup.

https://github.com/deepmind/mc_gradients

A common activity for me at the moment is differentiating the integral - for example, through the inverse-CDF lookup.

You see, what I would really like is the derivative of the mass-preserving continuous map \(\phi_{\theta, \tau}\) such that

\[\mathsf{z}\sim F(\cdot;\theta) \Rightarrow \phi_{\theta, \tau}(\mathsf{z})\sim F(\cdot;\tau). \] Now suppose I wish to optimise or otherwise perturb \(\theta\). This gives me a way of continuously parameterising a change in measure with respect to a realisation, and I can differentiate with respect to the parameterisation at

\[\left.\frac{\partial}{\partial \tau} \phi_{\theta, \tau}(\mathsf{z})\right|_{\tau=\theta}\] Let us say I need to differentiate through a monte carlo algorithm to alter its parameters while holding the PRNG fixed.

See the reparameterization trick for a way of making this, more or less, feasible for a class of suitably smooth nonparametric neural net problems by transforming it into a slightly different problem. But what if I am not doing some fluffy non-parametric thing but instead a using real and specific parametric distributions. What to do?

How can I get the derivative of such a map? I can look for candidates for the map. Here I can imagine that our observed rv \({\mathsf{x}}\in \mathbb{R}\) is generated via lookups from its iCDF \(F(\cdot;\theta)\) with parameter \(\theta\):

\[\mathsf{x} = F^{-1}(\mathsf{u};\theta) \]

where \(\mathsf{u}\sim\operatorname{Uniform}(0,1)\). Each realization corresponds to a choice of \(u_i\sim \mathsf{u}\) independently.

But maybe I generated my original variable not by the icdf method but by simulating some variable \({\mathsf{z}}\sim F(\cdot; \theta).\) In which case I may as well have generated those \(\mathsf{u}_i\) by taking \(\mathsf{u}_i=F(\mathsf{z}_i;\theta)\) for some \(\mathsf{z} \sim F(\cdot;\theta)\) and I am conceptually generating my RV by fixing \(z_i\sim\mathsf{z}_i\) and taking \(\phi := F^{-1}(F(z_i;\theta);\tau).\) So to find the effect of my perturbation what I actually need is

\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta}\\ \end{aligned}\]

Does this do what we want? Kinda. So suppose that the parameters in question are something boring, such as the location parameter of a location-scale distribution, i.e. \(F(\cdot;\theta)=F(\cdot-\theta;0).\) Then \(F^{-1}(\cdot;\theta)=F^{-1}(\cdot;0)+\theta\) and thus

\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta} &=\left.\frac{\partial}{\partial \tau} F^{-1}(F(z-\theta;0);0)+\tau\right|_{\tau=\theta}\\ &=\left.\frac{\partial}{\partial \tau}\left(z-\theta+\tau\right)\right|_{\tau=\theta}\\ &=1\\ \end{aligned}\]

OK grand that came out simple enough.

Hyvarinen, Aapo. n.d. â€śEstimation of Non-Normalized Statistical Models by Score Matching,â€ť 15.

Mohamed, Shakir, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. 2020. â€śMonte Carlo Gradient Estimation in Machine Learning.â€ť *Journal of Machine Learning Research* 21 (132): 1â€“62. http://jmlr.org/papers/v21/19-346.html.

Stoker, Thomas M. 1986. â€śConsistent Estimation of Scaled Coefficients.â€ť *Econometrica* 54 (6): 1461â€“81. https://doi.org/10.2307/1914309.

Walder, Christian J., Paul Roussel, Richard Nock, Cheng Soon Ong, and Masashi Sugiyama. 2019. â€śNew Tricks for Estimating Gradients of Expectations.â€ť June 24, 2019. http://arxiv.org/abs/1901.11311.