Large sample theory

2015-02-15 — 2025-05-28

Wherein asymptotic behaviour is surveyed, and the posterior is shown to be asymptotically Gaussian under Bernstein–von Mises conditions while MMD and Fisher information govern rates and variances.

Gaussian

likelihood

optimization

probability

statistics

Figure 1: Many things become similar if you wait long enough.

under construction ⚠️: I merged two notebooks here. The seams are showing.

We use asymptotic approximations all the time in statistics, most frequently in asymptotic pivots that motivate classical tests, e.g. in classical hypothesis tests or an information penalty. We use the asymptotic delta method to motivate robust statistics or infinite neural networks. There are various specialised mechanisms; I am fond of the Stein methods. Also fun, Feynman-Kac formulae give us central limit theorems for all manner of weird processes.

There is much to be said on the various central limit theorems, but I will not be the one to say it right this minute, because this is a placeholder.

1 Frequentist asymptotics

A convenient feature of M-estimation, and especially maximum likelihood estimation, is the simple behaviour of estimators in the asymptotic large-sample-size limit, which can give us, e.g., variance estimates, or motivate information criteria, or robust statistics, optimisation, etc.

In the most celebrated and convenient cases, asymptotic bounds are about normally-distributed errors, and these are typically derived through Local Asymptotic Normality theorems. A simple and general introduction is given in Andersen et al. (1997), page 594, which applies to both i.i.d. data and dependent_data in the form of point processes. For all that it is applied, it is still stringent.

In many nice distributions, central limit theorems lead (asymptotically) to Gaussian distributions, and we can treat uncertainty in terms of transformations of Gaussians.

2 Bayesian posteriors are kinda Gaussian

The Bayesian large sample result of note is the Bernstein–von Mises theorem, which provides some conditions under which the posterior distribution is asymptotically Gaussian and the Laplace approximation is true-ish.

3 Particle filters are kinda Gaussian

Long story: Bishop and Del Moral (2023); Bishop and Del Moral (2016); Cérou et al. (2005); Pierre Del Moral, Hu, and Wu (2011); Pierre Del Moral (2004); Pierre Del Moral and Doucet (2009); P. Del Moral, Kurtzmann, and Tugaut (2017).

4 Fisher Information

Used in ML theory and kinda-sorta in robust estimation and natural gradient methods. A matrix that tells us how much a new datum affects our parameter estimates. (It’s related, I’m told, to garden-variety Shannon information, and when that non-obvious fact is clearer to me I shall expand how precisely this could be so.) 🏗

5 In Gibbs posteriors

What if the loss function is not a likelihood but arises from a generalized loss as in Gibbs posteriors? See Gibbs posterior asymptotics.

6 Limit theorems for MMD-inference

By which I mean maximum mean discrepancy (MMD) flavours of the so-called “generalized” variational Bayesian inference methods, which are a way of doing Bayesian inference by minimizing some distance between the posterior and a variational family.

Chérief-Abdellatif, Alquier, and Khan (2019) define a pseudo-posterior proportional to \(\exp\{ - n\,{\rm MMD}^2(P_\theta,\widehat P_n)/\eta\}\). They prove consistency and give oracle-type bounds on the expected MMD risk; rates match the minimax optimal \(n^{-1/2}\) up to log factors under characteristic kernels. They also show that any mean-field variational approximation with bounded KL gap inherits the same contraction rate. So this is an implicit limit theorem for other metrics too, I guess? See also Cherief-Abdellatif and Alquier (2020).

Briol et al. (2019) study minimum-MMD estimators for Gaussian location/scale families; they show the estimators are \(\sqrt{n}\)-consistent and asymptotically normal after a kernel-dependent efficiency loss. It looks like there might be some annoying conditions about the model being correctly specified.

7 Convolution Theorem

The unhelpfully-named convolution theorem of Hájek (1970) — is that related?

Suppose \(\hat{\theta}\) is an efficient estimator of \(\theta\) and \(\tilde{\theta}\) is another, not fully efficient, estimator. The convolution theorem says that, if you rule out stupid exceptions, asymptotically \(\tilde{\theta} = \hat{\theta} + \varepsilon\) where \(\varepsilon\) is pure noise, independent of \(\hat{\theta}.\)

The reason that’s almost obvious is that if it weren’t true, there would be some information about \(\theta\) in \(\tilde{\theta}-\hat{\theta}\), and you could use this information to get a better estimator than \(\hat{\theta}\), which (by assumption) can’t happen. The stupid exceptions are things like the Hodges superefficient estimator that do better at a few values of \(\hat{\theta}\) but much worse at neighbouring values.

8 References

Akaike, Hirotogu. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.” In Proceeding of the Second International Symposium on Information Theory.

Akaike, Htrotugu. 1973. “Maximum Likelihood Identification of Gaussian Autoregressive Moving Average Models.” Biometrika.

Alquier, Ridgway, and Chopin. 2016. “On the Properties of Variational Approximations of Gibbs Posteriors.” Journal of Machine Learning Research.

Andersen, Borgan, Gill, et al. 1997. Statistical models based on counting processes. Springer series in statistics.

Athreya, K. B., and Keiding. 1977. “Estimation Theory for Continuous-Time Branching Processes.” Sankhyā: The Indian Journal of Statistics, Series A (1961-2002).

Athreya, Krishna B, and Lahiri. 2006. Measure theory and probability theory.

Bacry, Delattre, Hoffmann, et al. 2013. “Some Limit Theorems for Hawkes Processes and Application to Financial Statistics.” Stochastic Processes and Their Applications, A Special Issue on the Occasion of the 2013 International Year of Statistics,.

Barbour, and Chen, eds. 2005. An Introduction to Stein’s Method. Lecture Notes Series / Institute for Mathematical Sciences, National University of Singapore, v. 4.

Barndorff-Nielsen, and Sørensen. 1994. “A Review of Some Aspects of Asymptotic Likelihood Theory for Stochastic Processes.” International Statistical Review / Revue Internationale de Statistique.

Barrio, Deheuvels, and van de Geer. 2006. Lectures on Empirical Processes: Theory and Statistical Applications.

Barron. 1986. “Entropy and the Central Limit Theorem.” The Annals of Probability.

Becker-Kern, Meerschaert, and Scheffler. 2004. “Limit Theorems for Coupled Continuous Time Random Walks.” The Annals of Probability.

Bibby, and Sørensen. 1995. “Martingale Estimation Functions for Discretely Observed Diffusion Processes.” Bernoulli.

Bishop, and Del Moral. 2016. “On the Stability of Kalman-Bucy Diffusion Processes.” SIAM Journal on Control and Optimization.

———. 2023. “On the Mathematical Theory of Ensemble (Linear-Gaussian) Kalman-Bucy Filtering.” Mathematics of Control, Signals, and Systems.

Bréhier, Goudenège, and Tudela. 2016. “Central Limit Theorem for Adaptive Multilevel Splitting Estimators in an Idealized Setting.” In Monte Carlo and Quasi-Monte Carlo Methods. Springer Proceedings in Mathematics & Statistics.

Briol, Barp, Duncan, et al. 2019. “Statistical Inference for Generative Models with Maximum Mean Discrepancy.”

Buhmann, Dumazert, Gronskiy, et al. 2018. “Posterior Agreement for Large Parameter-Rich Optimization Problems.” Theoretical Computer Science.

Cantoni, and Ronchetti. 2001. “Robust Inference for Generalized Linear Models.” Journal of the American Statistical Association.

Cérou, Le Gland, François, Del Moral, et al. 2005. “Limit Theorems for the Multilevel Splitting Algorithm in the Simulation of Rare Events.” In Proceedings of the Winter Simulation Conference, 2005.

Cherief-Abdellatif, and Alquier. 2020. “MMD-Bayes: Robust Bayesian Estimation via Maximum Mean Discrepancy.” In Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference.

Chérief-Abdellatif, Alquier, and Khan. 2019. “A Generalization Bound for Online Variational Inference.”

Chikuse. 2003. “High Dimensional Asymptotic Theorems.” In Statistics on Special Manifolds.

Claeskens, Krivobokova, and Opsomer. 2009. “Asymptotic Properties of Penalized Spline Estimators.” Biometrika.

DasGupta. 2008. Asymptotic Theory of Statistics and Probability. Springer Texts in Statistics.

Dehaene. 2019. “A Deterministic and Computable Bernstein-von Mises Theorem.”

Del Moral, Pierre. 2004. Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications.

Del Moral, Pierre, and Doucet. 2009. “Particle Methods: An Introduction with Applications.”

Del Moral, Pierre, Hu, and Wu. 2011. On the Concentration Properties of Interacting Particle Processes.

Del Moral, P., Kurtzmann, and Tugaut. 2017. “On the Stability and the Uniform Propagation of Chaos of a Class of Extended Ensemble Kalman-Bucy Filters.” SIAM Journal on Control and Optimization.

Duembgen, and Podolskij. 2015. “High-Frequency Asymptotics for Path-Dependent Functionals of Itô Semimartingales.” Stochastic Processes and Their Applications.

Feigin. 1976. “Maximum Likelihood Estimation for Continuous-Time Stochastic Processes.” Advances in Applied Probability.

Feller. 1951. “The Asymptotic Distribution of the Range of Sums of Independent Random Variables.” The Annals of Mathematical Statistics.

Fernholz. 1983. von Mises calculus for statistical functionals. Lecture Notes in Statistics 19.

Gine, and Zinn. 1990. “Bootstrapping General Empirical Measures.” Annals of Probability.

Giraitis, and Surgailis. 1999. “Central Limit Theorem for the Empirical Process of a Linear Sequence with Long Memory.” Journal of Statistical Planning and Inference.

Gribonval, Blanchard, Keriven, et al. 2017. “Compressive Statistical Learning with Random Feature Moments.” arXiv:1706.07180 [Cs, Math, Stat].

Hájek. 1970. “A Characterization of Limiting Distributions of Regular Estimates.” Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete.

———. 1972. “Local Asymptotic Minimax and Admissibility in Estimation.” In.

Heyde, and Seneta. 2010. “Estimation Theory for Growth and Immigration Rates in a Multiplicative Process.” In Selected Works of C.C. Heyde. Selected Works in Probability and Statistics.

Huber. 1964. “Robust Estimation of a Location Parameter.” The Annals of Mathematical Statistics.

Jacob, O’Leary, and Atchadé. 2019. “Unbiased Markov Chain Monte Carlo with Couplings.” arXiv:1708.03625 [Stat].

Jacod, Podolskij, and Vetter. 2010. “Limit Theorems for Moving Averages of Discretized Processes Plus Noise.” The Annals of Statistics.

Jacod, and Shiryaev. 1987a. Limit Theorems for Stochastic Processes. Grundlehren Der Mathematischen Wissenschaften.

———. 1987b. “The General Theory of Stochastic Processes, Semimartingales and Stochastic Integrals.” In Limit Theorems for Stochastic Processes. Grundlehren Der Mathematischen Wissenschaften.

Janková, and van de Geer. 2016. “Confidence Regions for High-Dimensional Generalized Linear Models Under Sparsity.” arXiv:1610.01353 [Math, Stat].

Karabash, and Zhu. 2012. “Limit Theorems for Marked Hawkes Processes with Application to a Risk Model.” arXiv:1211.4039 [Math].

Kasprzak, Giordano, and Broderick. 2023. “How Good Is Your Laplace Approximation of the Bayesian Posterior? Finite-Sample Computable Error Bounds for a Variety of Useful Divergences.”

Konishi, and Kitagawa. 1996. “Generalised Information Criteria in Model Selection.” Biometrika.

———. 2003. “Asymptotic Theory for Information Criteria in Model Selection—Functional Approach.” Journal of Statistical Planning and Inference, C.R. Rao 80th Birthday Felicitation Volume, Part IV,.

Kraus, and Panaretos. 2014. “Frequentist Estimation of an Epidemic’s Spreading Potential When Observations Are Scarce.” Biometrika.

Le Gland, Monbet, and Tran. 2009. “Large Sample Asymptotics for the Ensemble Kalman Filter.” Report.

LeCam. 1970. “On the Assumptions Used to Prove Asymptotic Normality of Maximum Likelihood Estimates.” The Annals of Mathematical Statistics.

———. 1972. “Limits of Experiments.” In.

Lederer, and van de Geer. 2014. “New Concentration Inequalities for Suprema of Empirical Processes.” Bernoulli.

Lorsung. 2021. “Understanding Uncertainty in Bayesian Deep Learning.”

Maronna. 1976. “Robust M-Estimators of Multivariate Location and Scatter.” The Annals of Statistics.

McGoff, Mukherjee, and Nobel. 2022. “Gibbs Posterior Convergence and the Thermodynamic Formalism.” The Annals of Applied Probability.

Miller. 2021. “Asymptotic Normality, Concentration, and Coverage of Generalized Posteriors.” Journal of Machine Learning Research.

Mueller. 2018. “Refining the Central Limit Theorem Approximation via Extreme Value Theory.” arXiv:1802.00762 [Math].

Oertel. 2020. “Grothendieck’s Inequality and Completely Correlation Preserving Functions – a Summary of Recent Results and an Indication of Related Research Problems.”

Ogata. 1978. “The Asymptotic Behaviour of Maximum Likelihood Estimators for Stationary Point Processes.” Annals of the Institute of Statistical Mathematics.

Pollard. 1990. Empirical Processes: Theory and Applications.

Prause, and Steland. 2018. “Estimation of the Asymptotic Variance of Univariate and Multivariate Random Fields and Statistical Inference.” Electronic Journal of Statistics.

Puri, and Tuan. 1986. “Maximum Likelihood Estimation for Stationary Point Processes.” Proceedings of the National Academy of Sciences of the United States of America.

Raginsky, and Sason. 2014. Concentration of Measure Inequalities in Information Theory, Communications, and Coding: Second Edition.

Ross. 2011. “Fundamentals of Stein’s Method.” Probability Surveys.

Scornet. 2014. “On the Asymptotics of Random Forests.” arXiv:1409.2090 [Math, Stat].

Shiga, and Tanaka. 1985. “Central Limit Theorem for a System of Markovian Particles with Mean Field Interactions.” Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete.

Sørensen. 2000. “Prediction-Based Estimating Functions.” The Econometrics Journal.

Stam. 1982. “Limit Theorems for Uniform Distributions on Spheres in High-Dimensional Euclidean Spaces.” Journal of Applied Probability.

Stein. 1986. Approximate Computation of Expectations.

Syring, and Martin. 2023. “Gibbs Posterior Concentration Rates Under Sub-Exponential Type Losses.” Bernoulli.

Tropp. 2015. An Introduction to Matrix Concentration Inequalities.

van der Vaart. 2007. Asymptotic statistics. Cambridge series in statistical and probabilistic mathematics.

Winter, Melikechi, and Dunson. 2023. “Sequential Gibbs Posteriors with Applications to Principal Component Analysis.”

Yamazaki, View Profile, Kawanabe, et al. 2007. “Asymptotic Bayesian Generalization Error When Training and Test Distributions Are Different.” Proceedings of the 24th International Conference on Machine Learning, ACM Other conferences,.