Stein’s method

His eyes are like angels but his heart is cold / No need to ask / He’s a Stein operator

March 12, 2021 — March 7, 2024

approximation
functional analysis
measure
metrics
model selection
optimization
probability
spheres
statistics
Figure 1: The importance of possessive apostrophes

A famous generic method for approximating distributions and quantifying discrepancy and manufacturing concentration bounds and limit theorems is Stein’s method, typically in the form of the method of exchangeable pairs (Stein 1986, 1972).

MAny diverse uses of this method have been made and no single reference seems to summarise them all.

Wikipedia is clear but outdated. Different bits are introduced in Barbour and Chen (2005);Chatterjee (2014);Meckes (2012);Ross (2011). An interesting jumping-off point is Lily Li’s Whirlwind Tour

In these notes we summarize Cindy Zhang’s survey (Zhang 2016) of Nathan Ross’ survey (Ross 2011) on the Fundamentals of Stein’s Method with particular emphasis on the proof of the Central Limit Theorem.

See also Fraser Daly’s video lecture: “The Stein-Chen method”

Elizabeth Meckes, Stein’s Method: - The last gadget under the hood finally crystallizes it for me:

the Big Idea: The Stein Equation We need to solve the Stein equation: given a function \(g\), find \(f\) such that \[ T_o f(x)=g(x)-\mathbb{E} g(X) . \] We use \(U_0\) to denote the operator that gives the solution of the Stein equation: \[ f(x)=U_o g(x) . \] If \(f=U_o g\), observe that \[ \mathbb{E} T_o f(Y)=\mathbb{E} g(Y)-\mathbb{E} g(X) . \] Thus if \(\mathbb{E} T_0 f(Y)\) is small, then \(\mathbb{E} g(Y)-\mathbb{E} g(X)\) is small.

This leads naturally to notions of distance between the random variables \(X\) and \(Y\) which can be expressed in the form \[ d(X, Y)=\sup _{\mathcal{F}}|\mathbb{E} g(X)-\mathbb{E} g(Y)|, \] where the supremum is over some class \(\mathcal{F}\) of test functions \(g\). Examples: \[ \begin{aligned} & \mathcal{F}=\left\{f:\left\|f^{\prime}\right\|_{\infty} \leq 1\right\} \quad \longleftrightarrow \text { Wasserstein distance. } \\ & \mathcal{F}=\left\{f:\|f\|_{\infty}+\left\|f^{\prime}\right\|_{\infty} \leq 1\right\} \quad \longleftrightarrow \text { bounded } \\ & \text { Lipschitz distance. } \\ & \end{aligned} \]

Instead of trying to estimate the distance between \(X\) and \(Y\) directly, the problem has been reduced to trying to estimate \(\mathbb {E} T_o f (Y)\) for some large class of functions \(f\). Why is this any better?

Various techniques are in use for trying to estimate \(\mathbb {E} T_o f (Y)\). Among them: - The method of exchangeable pairs (e.g. Stein’s book) - The dependency graph method (e.g. Arratia, Goldstein, and Gordon or Barbour, Karoński, and Ruciński) - Size-bias coupling (e.g. Goldstein and Rinott) - Zero-bias coupling (e.g. Goldstein and Reinert) - The generator method (Barbour)

1 Stein operators

1.1 Gaussian

The original form, Stein’s lemma (Stein 1972) gives use the Stein operator for the Gaussian distribution in particular. Meckes (2009) explains:

The normal distribution is the unique probability measure \(\mu\) for which \[ \int\left[f^{\prime}(x)-x f(x)\right] \mu(d x)=0 \] for all \(f\) for which the left-hand side exists and is finite. It is useful to think of this in terms of operators, specifically, the operator \(\mathcal{A}_{o}\) defined on \(C^{1}\) functions by \[ \mathcal{A}_{o} f(x)=f^{\prime}(x)-x f(x) \] is called the characterizing operator of the standard normal distribution.
This is incredibly useful in probability approximation by Gaussians where it justifies Stein’s method, below. It has apparently been extended to elliptical distributions and exponential families.

Multivariate? Why, yes please. The following lemma of Meckes (2006) gives a second-order characterizing operator for the Gaussian distribution on \(\mathbb{R}^{k}\):

For \(f \in C^{1}\left(\mathbb{R}^{k}\right)\), define the gradient of \(f\) by \(\nabla f(x)=\left(\frac{\partial f}{\partial x_{1}}(x), \ldots, \frac{\partial f}{\partial x_{k}}(x)\right)^{t}\). Define the Laplacian of \(f\) by \(\Delta f(x)=\sum_{i=1}^{k} \frac{\partial^{2} f}{\partial x_{i}^{2}}(x)\). Now, let \(Z \sim \mathcal{N}(0_k, \mathrm{I}_k)\).

  1. If \(f: \mathbb{R}^{k} \rightarrow \mathbb{R}\) is two times continuously differentiable and compactly supported, then \[ \mathbb{E}[\Delta f(Z)-Z \cdot \nabla f(Z)]=0 \]
  2. If \(Y \in \mathbb{R}^{k}\) is a random vector such that \[ \mathbb{E}[\Delta f(Y)-Y \cdot \nabla f(Y)]=0 \] for every \(f \in C^{2}\left(\mathbb{R}^{k}\right)\), then \(\mathcal{L}(Y)=\mathcal{L}(Z) .\)
  3. If \(g \in C_{o}^{\infty}\left(\mathbb{R}^{k}\right)\), then the function \[ U_{o} g(x):=\int_{0}^{1} \frac{1}{2 t}[\mathbb{E} g(\sqrt{t} x+\sqrt{1-t} Z)-\mathbb{E} g(Z)] d t \] is a solution to the differential equation \[ \Delta h(x)-x \cdot \nabla h(x)=g(x)-\mathbb{E} g(Z) \]

1.2 Poisson

a.k.a. Stein-Chen (e.g. Chen 1998) \[\mathcal{A}_{o} f(k)=\lambda f(k+1)-k f(k)\]

1.3 Exponential

Ross (2011) mentions

\[ \mathcal{A}_{o} f(x)=f^{\prime}(x)-f(x)+f(0) \]

See also Chatterjee, Fulman, and Röllin (2011) who credits Stein et al. (2004) with > a random variable \(Z\) on \([0, \infty)\) is \(\operatorname{Exp}(1)\) if and only if \(\mathbb{E}\left[f^{\prime}(Z)-f(Z)\right]=-f\left(0^{+}\right)\) for all functions \(f\) in a large class of functions (whose precise definition we do not need).

1.4 Via the method of generators

Barbour (n.d.) gives us a method of creating Stein operators via infinitesimal generators on the way to a Poisson process Stein operator.

Summarised by Gesine Reinert in Barbour and Chen (2005):

The basic idea is to choose the operator \(\mathcal{A}\) to be the generator of a Markov process with stationary distribution \(\mu\). That is, for a homogeneous Markov process \(\left(X_t\right)_{t \geq 0}\), put \(T_t f(x)=\mathbb{E}\left(f\left(X_t\right) \mid X(0)=x\right)\). The generator of the Markov process is defined as \(\mathcal{A} f(x)=\lim _{t \downarrow 0} \frac{1}{t}\left(T_t f(x)-f(x)\right)\). Standard results for generators […]] yield

  1. If \(\mu\) is the stationary distribution of the Markov process then \(X \sim \mu\) if and only if \(\mathbb{E} \mathcal{A} f(X)=0\) for all real-valued functions \(f\) for which \(\mathcal{A} f\) is defined.
  2. \(T_t h-h=\mathcal{A}\left(\int_0^t T_u h d u\right)\), and, formally taking limits, \[ \int h d \mu-h=\mathcal{A}\left(\int_0^{\infty} T_u h d u\right) \] if the right-hand side exists. Thus the generator approach gives both a Stein equation and a candidate for its solution. One could hence view this approach as a Markov process perturbation technique.

Reassuring example:

The operator \[ \mathcal{A} h(x)=h^{\prime \prime}(x)-x h^{\prime}(x) \] is the generator of the Ornstein-Uhlenbeck process with stationary distribution \(\mathcal{N}(0,1)\). Putting \(f=h^{\prime}\) gives the classical Stein characterization for \(\mathcal{N}(0,1)\)

2 Stein’s method for Gaussians via exchangeable pairs

Figure 2

Meckes (2009) summarises:

Heuristically, the univariate method of exchangeable pairs goes as follows. Let \(W\) be a random variable conjectured to be approximately Gaussian; assume that \(\mathbb{E} W=0\) and \(\mathbb{E} W^{2}=1 .\) From \(W,\) construct a new random variable \(W^{\prime}\) such that the pair \(\left(W, W^{\prime}\right)\) has the same distribution as \(\left(W^{\prime}, W\right) .\) This is usually done by making a “small random change” in \(W\), so that \(W\) and \(W^{\prime}\) are close. Let \(\Delta=W^{\prime}-W\). If it can be verified that there is a \(\lambda>0\) such that \[ \begin{aligned} \mathbb{E}[\Delta \mid W]=-\lambda W+E_{1} \\ \mathbb{E}\left[\Delta^{2} \mid W\right]=2 \lambda+E_{2} \\ \mathbb{E}|\Delta|^{3}=E_{3} \end{aligned} \] with the random quantities \(E_{1}, E_{2}\) and the deterministic quantity \(E_{3}\) being small compared to \(\lambda,\) then \(W\) is indeed approximately Gaussian, and its distance to Gaussian (in some metric) can be bounded in terms of the \(E_{i}\) and \(\lambda\).

This comes out very nicely where there are natural symmetries to exploit, e.g. in low-d projections.

2.1 Multivariate Gaussian Stein method

The work of Elizabeth Meckes (1980—2020) is incredibly useful in this area. What a cruelly early death.

Two foundational ones are Chatterjee and Meckes (2008) and Meckes (2009) and there is a kind of introductory user guide in Meckes (2012); The examples are mostly about random projections although the method is much more general. The exchangeable pairs are natural in projections though, you can just switch off your brain and turn the handle to produce results, or easier yet, a computer algebra system that can handle noncommutative algebra can do it for you.

If the papers are too dense, try this friendly lecture, Stein’s Method — The last gadget under the hood.

3 Stein discrepancy

A probability metric based on something like “how well this distribution satisfies Stein’s lemma”, I think?

See (Kernelized) Stein gradient descent.

4 Incoming

5 References

Agueh, and Carlier. 2011. Barycenters in the Wasserstein Space.” SIAM Journal on Mathematical Analysis.
Ambrogioni, Güçlü, Güçlütürk, et al. 2018. Wasserstein Variational Inference.” In Proceedings of the 32Nd International Conference on Neural Information Processing Systems. NIPS’18.
Anastasiou, Barp, Briol, et al. 2021. Stein’s Method Meets Statistics: A Review of Some Recent Developments.” arXiv:2105.03481 [Math, Stat].
Arras, and Houdré. 2019. On Stein’s Method for Multivariate Self-Decomposable Laws.” Electronic Journal of Probability.
Barbour. n.d. Stein’s Method and Poisson Process Convergence.” Journal of Applied Probability.
Barbour, and Chen, eds. 2005. An Introduction to Stein’s Method. Lecture Notes Series / Institute for Mathematical Sciences, National University of Singapore, v. 4.
Chatterjee. 2014. A Short Survey of Stein’s Method.” arXiv:1404.1392 [Math].
Chatterjee, Fulman, and Röllin. 2011. Exponential Approximation by Stein’s Method and Spectral Graph Theory.” ALEA Revista Latino-Americana de Probabilidade e Estatística.
Chatterjee, and Meckes. 2008. Multivariate Normal Approximation Using Exchangeable Pairs.” arXiv:math/0701464.
Chen. 1998. Stein’s Method: Some Perspectives with Applications.” In Probability Towards 2000. Lecture Notes in Statistics.
Chu, Minami, and Fukumizu. 2022. The Equivalence Between Stein Variational Gradient Descent and Black-Box Variational Inference.” In.
Chwialkowski, Strathmann, and Gretton. 2016. A Kernel Test of Goodness of Fit.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.
Cuturi, and Doucet. 2014. Fast Computation of Wasserstein Barycenters.” In International Conference on Machine Learning.
Detommaso, Cui, Spantini, et al. 2018. A Stein Variational Newton Method.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.
Ethier, and Kurtz. 2009. Markov Processes: Characterization and Convergence.
Gong, Peng, and Liu. 2019. Quantile Stein Variational Gradient Descent for Batch Bayesian Optimization.” In Proceedings of the 36th International Conference on Machine Learning.
Gorham, and Mackey. 2015. Measuring Sample Quality with Stein’s Method.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. NIPS’15.
Gorham, Raj, and Mackey. 2020. Stochastic Stein Discrepancies.” arXiv:2007.02857 [Cs, Math, Stat].
Han, and Liu. 2018. Stein Variational Gradient Descent Without Gradient.” In Proceedings of the 35th International Conference on Machine Learning.
Landsman, and Nešlehová. 2008. Stein’s Lemma for Elliptical Random Vectors.” Journal of Multivariate Analysis.
Landsman, Vanduffel, and Yao. 2013. A Note on Stein’s Lemma for Multivariate Elliptical Distributions.” Journal of Statistical Planning and Inference.
Ley, Reinert, and Swan. 2017. Stein’s Method for Comparison of Univariate Distributions.” Probability Surveys.
Li. 2021. “Stein’s Method: A Whirlwind Tour.”
Liu, Qiang. 2016. Stein Variational Gradient Descent: Theory and Applications.”
———. 2017. Stein Variational Gradient Descent as Gradient Flow.”
Liu, Qiang, Lee, and Jordan. 2016. A Kernelized Stein Discrepancy for Goodness-of-Fit Tests.” In Proceedings of The 33rd International Conference on Machine Learning.
Liu, Qiang, and Wang. 2018. Stein Variational Gradient Descent as Moment Matching.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.
———. 2019. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In Advances In Neural Information Processing Systems.
Liu, Chang, and Zhu. 2018. Riemannian Stein Variational Gradient Descent for Bayesian Inference.” Proceedings of the AAAI Conference on Artificial Intelligence.
Liu, Xing, Zhu, Ton, et al. 2022. Grassmann Stein Variational Gradient Descent.” In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics.
Loh. n.d. Stein’s Method and Multinomial Approximation.”
Meckes. 2006. “An Infinitesimal Version of Stein’s Method of Exchangeable Pairs.”
———. 2009. On Stein’s Method for Multivariate Normal Approximation.” In High Dimensional Probability V: The Luminy Volume.
———. 2012. Approximation of Projections of Random Vectors.” Journal of Theoretical Probability.
Paulin, Mackey, and Tropp. 2016. Efron–Stein Inequalities for Random Matrices.” The Annals of Probability.
Piccoli, and Rossi. 2016. On Properties of the Generalized Wasserstein Distance.” Archive for Rational Mechanics and Analysis.
Reinert. 2000. Stein’s Method for Epidemic Processes.” In Complex Stochastic Systems.
———. 2005. Three General Approaches to Stein’s Method.” In An Introduction to Stein’s Method. Lecture Notes Series, Institute for Mathematical Sciences, National University of Singapore, Volume 4.
Reinert, and Röllin. 2007. Multivariate Normal Approximation with Stein’s Method of Exchangeable Pairs Under a General Linearity Condition.”
Ross. 2011. Fundamentals of Stein’s Method.” Probability Surveys.
Schoutens. 2001. Orthogonal Polynomials in Stein’s Method.” Journal of Mathematical Analysis and Applications.
Shi, Sun, and Zhu. 2018. A Spectral Approach to Gradient Estimation for Implicit Distributions.” In.
Stein. 1972. A Bound for the Error in the Normal Approximation to the Distribution of a Sum of Dependent Random Variables.” Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory.
———. 1986. Approximate Computation of Expectations.
Stein, Diaconis, Holmes, et al. 2004. Use of Exchangeable Pairs in the Analysis of Simulations.” In Institute of Mathematical Statistics Lecture Notes - Monograph Series.
Tan, Lu, and Xia. 2018. Relative Error of Scaled Poisson Approximation via Stein’s Method.” arXiv:1810.04300 [Math].
Upadhye, and Barman. 2020. A Unified Approach to Stein’s Method for Stable Distributions.” arXiv:2004.07593 [Math].
Wang, and Liu. 2019. Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models.” In Proceedings of the 36th International Conference on Machine Learning.
Wang, Tang, Bajaj, et al. 2019. Stein Variational Gradient Descent with Matrix-Valued Kernels.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems.
Wang, Zeng, and Liu. 2018. Stein Variational Message Passing for Continuous Graphical Models.”
Xu, and Matsuda. 2020. A Stein Goodness-of-Fit Test for Directional Distributions.” In International Conference on Artificial Intelligence and Statistics.
———. 2021. Interpretable Stein Goodness-of-Fit Tests on Riemannian Manifolds.” arXiv:2103.00895 [Stat].
Xu, and Reinert. 2021. A Stein Goodness of Fit Test for Exponential Random Graph Models.” arXiv:2103.00580 [Stat].
Zhang. 2016. Fundamentals of Stein’s Method and Its Application in Proving Central Limit Theorem.”
Zhuo, Liu, Shi, et al. 2018. Message Passing Stein Variational Gradient Descent.” In Proceedings of the 35th International Conference on Machine Learning.