Large sample theory
2015-02-15 — 2025-05-28
Wherein asymptotic behaviour is surveyed, and the posterior is shown to be asymptotically Gaussian under Bernstein–von Mises conditions while MMD and Fisher information govern rates and variances.
under construction ⚠️: I merged two notebooks here. The seams are showing.
We use asymptotic approximations all the time in statistics, most frequently in asymptotic pivots that motivate classical tests, e.g. in classical hypothesis tests or an information penalty. We use the asymptotic delta method to motivate robust statistics or infinite neural networks. There are various specialised mechanisms; I am fond of the Stein methods. Also fun, Feynman-Kac formulae give us central limit theorems for all manner of weird processes.
There is much to be said on the various central limit theorems, but I will not be the one to say it right this minute, because this is a placeholder.
1 Frequentist asymptotics
A convenient feature of M-estimation, and especially maximum likelihood estimation, is the simple behaviour of estimators in the asymptotic large-sample-size limit, which can give us, e.g., variance estimates, or motivate information criteria, or robust statistics, optimisation, etc.
In the most celebrated and convenient cases, asymptotic bounds are about normally-distributed errors, and these are typically derived through Local Asymptotic Normality theorems. A simple and general introduction is given in Andersen et al. (1997), page 594, which applies to both i.i.d. data and dependent_data in the form of point processes. For all that it is applied, it is still stringent.
In many nice distributions, central limit theorems lead (asymptotically) to Gaussian distributions, and we can treat uncertainty in terms of transformations of Gaussians.
2 Bayesian posteriors are kinda Gaussian
The Bayesian large sample result of note is the Bernstein–von Mises theorem, which provides some conditions under which the posterior distribution is asymptotically Gaussian and the Laplace approximation is true-ish.
3 Particle filters are kinda Gaussian
Long story: Bishop and Del Moral (2023); Bishop and Del Moral (2016); Cérou et al. (2005); Pierre Del Moral, Hu, and Wu (2011); Pierre Del Moral (2004); Pierre Del Moral and Doucet (2009); P. Del Moral, Kurtzmann, and Tugaut (2017).
4 Fisher Information
Used in ML theory and kinda-sorta in robust estimation and natural gradient methods. A matrix that tells us how much a new datum affects our parameter estimates. (It’s related, I’m told, to garden-variety Shannon information, and when that non-obvious fact is clearer to me I shall expand how precisely this could be so.) 🏗
5 In Gibbs posteriors
What if the loss function is not a likelihood but arises from a generalized loss as in Gibbs posteriors? See Gibbs posterior asymptotics.
6 Limit theorems for MMD-inference
By which I mean maximum mean discrepancy (MMD) flavours of the so-called “generalized” variational Bayesian inference methods, which are a way of doing Bayesian inference by minimizing some distance between the posterior and a variational family.
Chérief-Abdellatif, Alquier, and Khan (2019) define a pseudo-posterior proportional to \(\exp\{ - n\,{\rm MMD}^2(P_\theta,\widehat P_n)/\eta\}\). They prove consistency and give oracle-type bounds on the expected MMD risk; rates match the minimax optimal \(n^{-1/2}\) up to log factors under characteristic kernels. They also show that any mean-field variational approximation with bounded KL gap inherits the same contraction rate. So this is an implicit limit theorem for other metrics too, I guess? See also Cherief-Abdellatif and Alquier (2020).
Briol et al. (2019) study minimum-MMD estimators for Gaussian location/scale families; they show the estimators are \(\sqrt{n}\)-consistent and asymptotically normal after a kernel-dependent efficiency loss. It looks like there might be some annoying conditions about the model being correctly specified.
7 Convolution Theorem
The unhelpfully-named convolution theorem of Hájek (1970) — is that related?
Suppose \(\hat{\theta}\) is an efficient estimator of \(\theta\) and \(\tilde{\theta}\) is another, not fully efficient, estimator. The convolution theorem says that, if you rule out stupid exceptions, asymptotically \(\tilde{\theta} = \hat{\theta} + \varepsilon\) where \(\varepsilon\) is pure noise, independent of \(\hat{\theta}.\)
The reason that’s almost obvious is that if it weren’t true, there would be some information about \(\theta\) in \(\tilde{\theta}-\hat{\theta}\), and you could use this information to get a better estimator than \(\hat{\theta}\), which (by assumption) can’t happen. The stupid exceptions are things like the Hodges superefficient estimator that do better at a few values of \(\hat{\theta}\) but much worse at neighbouring values.