Measure concentration inequalities

The fancy name for probability inequalities

2014-11-25 — 2021-03-03

Wherein the notion of measure concentration is surveyed, classical inequalities are presented, and extensions to matrix concentration and high‑dimensional sums are exhibited with a corral metaphor.

approximation

dynamical systems

functional analysis

measure

metrics

model selection

probability

stochastic processes

Figure 1: A corral captures the idea of concentration of measure; we have some procedure that guarantees that most of the mass (of buffalos) is where we can handle it. Image: Kevin M Klerks, CC BY 2.0

Welcome to the probability inequality mines!

When something in your process (measurement, estimation) means that you can be pretty sure that a whole bunch of your stuff is particularly likely to be somewhere in particular, e.g. being 80% sure I am only 20% wrong.

As undergraduates, we run into central limit theorems, but there are many more diverse ways we can keep track of our probability, or at least most of it. This idea is a basic workhorse in univariate probability and turns out to be yet more essential in multivariate matrix probability, as seen in matrix factorization, compressive sensing, PAC-bounds and suchlike.

1 Background

Overviews include

Roman Vershynin’s High-Dimensional Probability: An Introduction with Applications in Data Science (Vershynin 2018)
Thomas Lumley’s super simple intro to chaining and controlling maxima.
Dasgupta, Asymptotic Theory of Statistics and Probability (DasGupta 2008) is practical, and despite its name, introduces some basic non-asymptotic inequalities
Raginsky and Sason, Concentration of Measure Inequalities in Information Theory, Communications and Coding (Raginsky and Sason 2012)
Tropp, An Introduction to Matrix Concentration Inequalities (Tropp 2015) high-dimensional data! free!
Boucheron, Bousquet & Lugosi, Concentration inequalities (Boucheron, Bousquet, and Lugosi 2004) (Clear and brisk but missing some newer stuff)
Boucheron, Lugosi & Massart, Concentration inequalities: a nonasymptotic theory of independence (Boucheron, Lugosi, and Massart 2013). Haven’t read it yet.
Massart, Concentration inequalities and model section (Massart 2007). Clear, focussed, but brusque. Depressingly, by being applied it also demonstrates the limitations of its chosen techniques, which seem sweet in application but bitter in the required assumptions. (Massart 2000) is an earlier draft.
Lugosi’s Concentration-of-measure Lecture notes as good, especially for treatment of Efron-Stein inequalities. This taxonomy is used in his Combinatorial statistics notes.
Terry Tao’s lecture notes
Divergence in everything: erasure divergence and concentration inequalities
Talagrand’s opus that is commonly credited with kicking off the modern fad, particularly the part due to the chaining method. (Talagrand 1995)
Luca Trevisan wrote an example-driven explanation of Talagrand generic chaining.

2 Markov

For any nonnegative random variable \(X,\) and \(t>0\) \[ \mathbb{P}\{X \geq t\} \leq \frac{\mathbb{E} X}{t} \] Corollary: if \(\phi\) is a strictly monotonically increasing non-negative-valued function then for any random variable \(X\) and real number \(t\) \[ \mathbb{P}\{X \geq t\}=\mathbb{P}\{\phi(X) \geq \phi(t)\} \leq \frac{\mathbb{E} \phi(X)}{\phi(t)} \]

3 Chebychev

A corollary of Markov’s bound is with \(\phi(x)=x^{2}\) is Chebyshev’s: if \(X\) is an arbitrary random variable and \(t>0,\) then \[ \mathbb{P}\{|X-\mathbb{E} X| \geq t\}=\mathbb{P}\left\{|X-\mathbb{E} X|^{2} \geq t^{2}\right\} \leq \frac{\mathbb{E}\left[|X-\mathbb{E} X|^{2}\right]}{t^{2}}=\frac{\operatorname{Var}\{X\}}{t^{2}} \] More generally taking \(\phi(x)=x^{q}(x \geq 0),\) for any \(q>0\) we have \[ \mathbb{P}\{|X-\mathbb{E} X| \geq t\} \leq \frac{\mathbb{E}\left[|X-\mathbb{E} X|^{q}\right]}{t^{q}} \] We can choose \(q\) to optimize the obtained upper bound for the problem in hand.

4 Hoeffding

Another one about sums of RVs, in particular for bounded RVs with potentially different bounds.

Let \(X_1, X_2, \ldots, X_n\) be independent bounded random variables, in the sense that \(a_i \leq X_i \leq b_i\) for all \(i.\) Define the sum of these variables \(S_n = \sum_{i=1}^n X_i.\) Let \(\mu = \mathbb{E}[S_n]\) be the expected value of \(S_n.\)

Hoeffding’s inequality states that for any \(t > 0\),

\[ \mathbb{P}(S_n - \mu \geq t) \leq \exp \left( -\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right) \]

and

\[ \mathbb{P}(S_n - \mu \leq -t) \leq \exp \left( -\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right). \]

For example, consider \(X_1, X_2, \ldots, X_n\) independent random variables taking values in \([0, 1]\). For \(S_n = \sum_{i=1}^n X_i\) and \(\mu = \mathbb{E}[S_n],\) Hoeffding’s inequality becomes:

\[ \mathbb{P}(S_n - \mu \geq t) \leq \exp \left( -\frac{2t^2}{n} \right). \]

Cool, except my variates are rarely bounded. What do I do then? Probably Chernoff or Bernstein bounds.

5 Chernoff

Taking \(\phi(x)=e^{sx}\) where \(s>0\), for any random variable \(X\), and any \(t>0\), we have \[ \mathbb{P}\{X \geq t\}=\mathbb{P}\left\{e^{sX} \geq e^{st}\right\} \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}}. \] \(s\) is a free parameter we choose to make the bound as tight as possible.

That’s what I was taught in class. The lazy (e.g., me) might not notice that a useful bound for sums of RVs can be derived from this result because this tells us about the Laplace transform (i.e. Moment-generating function), which tells us about sums.

For independent random variables \(X_1, X_2, \ldots, X_n\), let \(S_n = \sum_{i=1}^n X_i\). We have \[ \mathbb{P}(S_n \geq t) \leq \inf_{s > 0} \frac{\mathbb{E}[e^{sS_n}]}{e^{st}}. \] If \(X_i\) are i.i.d., then \[ \mathbb{P}(S_n \geq t) \leq \inf_{s > 0} \frac{(\mathbb{E}[e^{sX_1}])^n}{e^{st}}. \]

6 Bernstein inequality

Another one about bounded RVs. For independent zero-mean random variables \(X_1, X_2, \ldots, X_n\) with \(\mathbb{P}(|X_i| \leq M)=1,\)

\[ \mathbb{P}\left( \sum_{i=1}^n X_i \geq t \right) \leq \exp \left( -\frac{t^2/2}{\sum_{i=1}^n \mathbb{E}[X_i^2] + Mt/3} \right) \]

7 Efron-Stein

Do these results follow from Stein’s method? They have the right form, but the derivation is not clear. In fact, where did I get these results from? I forgot to provide a citation for what I was cribbing from.

Let \(g: \mathcal{X}^{n} \rightarrow \mathbb{R}\) be a real-valued measurable function of n variables. Efron-Stein inequalities concern the difference between the random variable \(Z=g\left(X_{1}, \ldots, X_{n}\right)\) and its expected value \(\mathbb{E}X\) when \(X_{1}, \ldots, X_{n}\) are arbitrary independent random variables.

Define \(\mathbb{E}_{i}\) for the expected value with respect to the variable \(X_{i}\), that is, \[\mathbb{E}_{i} Z=\mathbb{E}\left[Z \mid X_{1}, \ldots, X_{i-1}, X_{i+1}, \ldots, X_{n}\right]\] Then \[ \operatorname{Var}(Z) \leq \sum_{i=1}^{n} \mathbb{E}\left[\left(Z-\mathbb{E}_{i} Z\right)^{2}\right] \]

Now, let \(X_{1}^{\prime}, \ldots, X_{n}^{\prime}\) be an independent copy of \(X_{1}, \ldots, X_{n}\). \[ Z_{i}^{\prime}=g\left(X_{1}, \ldots, X_{i}^{\prime}, \ldots, X_{n}\right) \] Alternatively, \[ \operatorname{Var}(Z) \leq \frac{1}{2} \sum_{i=1}^{n} \mathbb{E}\left[\left(Z-Z_{i}^{\prime}\right)^{2}\right] \] Nothing here seems to constrain the variables here to be real-valued, merely the function \(g\), but apparently they do not work for matrix variables as written — we need to see matrix Efron-Stein results for that.

8 Kolmogorov

🏗

9 Gaussian

For the Gaussian distribution. Filed there, perhaps?

10 Sub-Gaussian

🏗

E.g. Hanson-Wright.

11 Martingale bounds

🏗

12 Khintchine

Let us copy from wikipedia:

Heuristically: if we pick \(N\) complex numbers \(x_1,\dots,x_N \in\mathbb{C}\), and add them together, each multiplied by jointly independent random signs \(\pm 1\), then the expected value of the sum’s magnitude is close to \(\sqrt{|x_1|^{2}+ \cdots + |x_N|^{2}}\).

Let \({\varepsilon_n}_{n=1}^N\) i.i.d. random variables with \(P(\varepsilon_n=\pm1)=\frac12\) for \(n=1,\ldots, N\), i.e., a sequence with Rademacher distribution. Let \(0<p<\infty\) and let \(x_1,\ldots,x_N \in \mathbb{C}\). Then

\[ A_p \left( \sum_{n=1}^N |x_n|^2 \right)^{1/2} \leq \left(\operatorname{E} \left|\sum_{n=1}^N \varepsilon_n x_n\right|^p \right)^{1/p} \leq B_p \left(\sum_{n=1}^N |x_n|^2\right)^{1/2} \]

for some constants \(A_p,B_p>0\). It is a simple matter to see that \(A_p = 1\) when \(p \ge 2\), and \(B_p = 1\) when \(0 < p \le 2\).

13 Empirical process theory

🏗

14 Matrix concentration

If we fix our interest to matrices in particular, some fun things arise. See Matrix concentration inequalities

15 Jensen gap

See Jensen gaps.

16 Incoming

17 References

Adler, Robert J. 2010. The Geometry of Random Fields.

Adler, Robert J., and Taylor. 2007. Random Fields and Geometry. Springer Monographs in Mathematics 115.

Adler, Robert J, Taylor, and Worsley. 2016. Applications of Random Fields and Geometry Draft.

Aubrun, and Nechita. 2011. “The Multiplicative Property Characterizes \(\ell_p\) and \(L_p\) Norms.” Confluentes Mathematici.

Bach. 2013. “Sharp Analysis of Low-Rank Kernel Matrix Approximations.” In COLT.

Barron, Birgé, and Massart. 1999. “Risk Bounds for Model Selection via Penalization.” Probability Theory and Related Fields.

Bellec, Lecué, and Tsybakov. 2017. “Towards the Study of Least Squares Estimators with Convex Penalty.” arXiv:1701.09120 [Math, Stat].

Berend, Harremoës, and Kontorovich. 2012. “Minimum KL-Divergence on Complements of L₁ Balls.” arXiv:1206.6544 [Cs, Math].

Bertsimas, and Popescu. 2005. “Optimal Inequalities in Probability Theory: A Convex Optimization Approach.” SIAM Journal on Optimization.

Boissard. 2011. “Simple Bounds for the Convergence of Empirical and Occupation Measures in 1-Wasserstein Distance.” Electronic Journal of Probability.

Boucheron, Bousquet, and Lugosi. 2004. “Concentration Inequalities.” In Advanced Lectures in Machine Learning.

Boucheron, Lugosi, and Massart. 2003. “Concentration Inequalities Using the Entropy Method.”

———. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence.

Bousquet, Boucheron, and Lugosi. 2004. “Introduction to Statistical Learning Theory.” In Advanced Lectures on Machine Learning. Lecture Notes in Computer Science 3176.

Bühlmann, and van de Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications.

Candès, and Recht. 2009. “Exact Matrix Completion via Convex Optimization.” Foundations of Computational Mathematics.

Candès, Romberg, and Tao. 2006. “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information.” IEEE Transactions on Information Theory.

Carrell, Gong, Shetty, et al. 2025. “Low-Rank Thinning.”

Dasarathy, Shah, Bhaskar, et al. 2013. “Sketching Sparse Matrices.” arXiv:1303.6544 [Cs, Math].

Dasgupta. 2000. “Experiments with Random Projection.” In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence. UAI’00.

DasGupta. 2008. Asymptotic Theory of Statistics and Probability. Springer Texts in Statistics.

Dasgupta, Hsu, and Verma. 2006. “A Concentration Theorem for Projections.” In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. UAI’06.

Del Moral, Hu, and Wu. 2011. On the Concentration Properties of Interacting Particle Processes.

Dubhashi, and Panconesi. 2009. Concentration of Measure for the Analysis of Randomized Algorithms.

Flammia, Gross, Liu, et al. 2012. “Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators.” New Journal of Physics.

Ghosal, and Roy. 2006. “Posterior Consistency of Gaussian Process Prior for Nonparametric Binary Regression.” The Annals of Statistics.

Giné, and Nickl. 2009. “Uniform Limit Theorems for Wavelet Density Estimators.” The Annals of Probability.

Gozlan, and Léonard. 2010. “Transport Inequalities. A Survey.” arXiv:1003.3852 [Math].

Gross, D. 2011. “Recovering Low-Rank Matrices From Few Coefficients in Any Basis.” IEEE Transactions on Information Theory.

Gross, David, Liu, Flammia, et al. 2010. “Quantum State Tomography via Compressed Sensing.” Physical Review Letters.

Hansen, Reynaud-Bouret, and Rivoirard. 2015. “Lasso and Probabilistic Inequalities for Multivariate Point Processes.” Bernoulli.

Hanson, and Wright. 1971. “A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables.” Annals of Mathematical Statistics.

Horn. 1979. “Some Inequalities for the Expectation of a Product of Functions of a Random Variable and for the Multivariate Distribution Function at a Random Point.” Biometrical Journal.

Houdré. 2002. “Remarks on Deviation Inequalities for Functions of Infinitely Divisible Random Vectors.” The Annals of Probability.

Houdré, and Privault. 2002. “Concentration and Deviation Inequalities in Infinite Dimensions via Covariance Representations.” Bernoulli.

Kennedy. 2015. “Semiparametric Theory and Empirical Processes in Causal Inference.” arXiv Preprint arXiv:1510.04740.

Koltchinskii. 2011. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Mathematics École d’Été de Probabilités de Saint-Flour 2033.

Kontorovich, and Raginsky. 2016. “Concentration of Measure Without Independence: A Unified Approach via the Martingale Method.” arXiv:1602.00721 [Cs, Math].

Kroll. 2016. “Concentration Inequalities for Poisson Point Processes with Application to Adaptive Intensity Estimation.” arXiv:1612.07901 [Math, Stat].

Kuznetsov, and Mohri. 2014. “Generalization Bounds for Time Series Prediction with Non-Stationary Processes.” In Algorithmic Learning Theory. Lecture Notes in Computer Science.

———. 2015. “Learning Theory and Algorithms for Forecasting Non-Stationary Time Series.” In Advances in Neural Information Processing Systems.

———. 2016. “Generalization Bounds for Non-Stationary Mixing Processes.” In Machine Learning Journal.

Lahiri, Gao, and Ganguli. 2016. “Random Projections of Random Manifolds.” arXiv:1607.04331 [Cs, q-Bio, Stat].

Lederer, and van de Geer. 2014. “New Concentration Inequalities for Suprema of Empirical Processes.” Bernoulli.

Liggett. 2010. “Stochastic Models for Large Interacting Systems and Related Correlation Inequalities.” Proceedings of the National Academy of Sciences of the United States of America.

Lugosi. 2003. Concentration-of-Measure Inequalities.

Massart. 2000. “Some Applications of Concentration Inequalities to Statistics.” In Annales de La Faculté Des Sciences de Toulouse: Mathématiques.

———. 2007. Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003. Lecture Notes in Mathematics 1896.

Maugis, and Michel. 2011. “A Non Asymptotic Penalized Criterion for Gaussian Mixture Model Selection.” ESAIM: Probability and Statistics.

Pardoux, Burkholder, and Sznitman. 1991. Explorations in Martingale Theory and Its Applications/ Topics in Propagation of Chaos: Ecole D’été De Probabilités De Saint-Flour Xix, 1989. Edited by P. L. Hennequin.

Raginsky, and Sason. 2012. “Concentration of Measure Inequalities in Information Theory, Communications and Coding.” Foundations and Trends in Communications and Information Theory.

Rahimi, and Recht. 2009. “Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems.

Reynaud-Bouret, and Roy. 2007. “Some Non Asymptotic Tail Estimates for Hawkes Processes.” Bulletin of the Belgian Mathematical Society - Simon Stevin.

Stein. 1986. Approximate Computation of Expectations.

Talagrand. 1995. “Concentration of Measure and Isoperimetric Inequalities in Product Spaces.”

———. 1996. “A New Look at Independence.” The Annals of Probability.

Touchette. 2011. “A Basic Introduction to Large Deviations: Theory, Applications, Simulations.”

Tropp. 2015. An Introduction to Matrix Concentration Inequalities.

van de Geer. 1995. “Exponential Inequalities for Martingales, with Application to Maximum Likelihood Estimation for Counting Processes.” The Annals of Statistics.

———. 2002. “On Hoeffdoing’s Inequality for Dependent Random Variables.” In Empirical Process Techniques for Dependent Data.

———. 2014. “Statistical Theory for High-Dimensional Models.” arXiv:1409.8557 [Math, Stat].

van de Geer, and Lederer. 2011. “The Lasso, Correlated Design, and Improved Oracle Inequalities.” arXiv:1107.0189 [Stat].

van Handel. 2017. “Structured Random Matrices.” In Convexity and Concentration. The IMA Volumes in Mathematics and Its Applications.

Vershynin. 2011. “Introduction to the Non-Asymptotic Analysis of Random Matrices.” arXiv:1011.3027 [Cs, Math].

———. 2015. “Estimation in High Dimensions: A Geometric Perspective.” In Sampling Theory, a Renaissance: Compressive Sensing and Other Developments. Applied and Numerical Harmonic Analysis.

———. 2016. “Four Lectures on Probabilistic Methods for Data Science.”

———. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science.

Wright. 1973. “A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions Are Not Necessarily Symmetric.” Annals of Probability.

Wu, and Zhang. 2016. “Distribution-Dependent Concentration Inequalities for Tighter Generalization Bounds.” arXiv:1607.05506 [Stat].

Zhang, and Chen. 2020. “Concentration Inequalities for Statistical Inference.” arXiv:2011.02258 [Cs, Math, Stat].