Welcome to the probability inequality mines!

When something in your process (measurement, estimation) means that you can be pretty sure that a whole bunch of your stuff is particularly likely to be somewhere in particular.

As undergraduates we run into central limit theorems, but there are many more diverse ways we can keep track of our probability, or at least most of it. This idea is a basic workhorse in univariate probability, and turns out to be yet more essential in multivariate matrix probability, as seen in matrix factorisation, compressive sensing, PAC-bounds and suchlike.

## Background

Overviews include

- Roman Vershynin’s High-Dimensional Probability: An Introduction with Applications in Data Science (Vershynin 2018)
- Thomas Lumley’s super simple intro to chaining and controlling maxima.
- Dasgupta, Asymptotic Theory of Statistics and Probability (DasGupta 2008) is practical, and despite its name, introduces some basic non-asymptotic inequalities
- Raginsky and Sason, Concentration of Measure Inequalities in Information Theory, Communications and Coding (Raginsky and Sason 2012)
- Tropp, An Introduction to Matrix Concentration Inequalities (Tropp 2015) high-dimensional data! free!
- Boucheron, Bousquet & Lugosi, Concentration inequalities (Boucheron, Bousquet, and Lugosi 2004) (Clear and brisk but missing some newer stuff)
- Boucheron, Lugosi & Massart, Concentration inequalities: a nonasymptotic theory of independence (Boucheron, Lugosi, and Massart 2013). Haven’t read it yet.
- Massart, Concentration inequalities and model section (Massart 2007). Clear, focussed, but brusque. Depressingly, by being applied it also demonstrates the limitations of its chosen techniques, which seem sweet in application but bitter in the required assumptions assumptions. (Massart 2000) is an earlier draft.
- Lugosi’s Concentration-of-measure Lecture notes as good, especially for treatment of Efron-Stein inequalities. This taxonomy is used in his Combinatorial statistics notes.
- Terry Tao’s lecture notes
- Divergence in everything: erasure divergence and concentration inequalities
- Talagrand’s opus that is commonly credited with kicking off the modern fad, particularly the part due to the chaining method. (Talagrand 1995)
- Luca Trevisan wrote an example-driven explanation of Talagrand generic chaining.

## Markov

For any nonnegative random variable \(X,\) and \(t>0\) \[ \mathbb{P}\{X \geq t\} \leq \frac{\mathbb{E} X}{t} \] Corollary: if \(\phi\) is a strictly monotonically increasing nonnegative-valued function then for any random variable \(X\) and real number \(t\) \[ \mathbb{P}\{X \geq t\}=\mathbb{P}\{\phi(X) \geq \phi(t)\} \leq \frac{\mathbb{E} \phi(X)}{\phi(t)} \]

## Chebychev

- Another Corollary of MArkov’s bound is with \(\phi(x)=x^{2}\) is Chebyshev’s
- if \(X\) is an arbitrary random variable and \(t>0,\) then \[ \mathbb{P}\{|X-\mathbb{E} X| \geq t\}=\mathbb{P}\left\{|X-\mathbb{E} X|^{2} \geq t^{2}\right\} \leq \frac{\mathbb{E}\left[|X-\mathbb{E} X|^{2}\right]}{t^{2}}=\frac{\operatorname{Var}\{X\}}{t^{2}} \] More generally taking \(\phi(x)=x^{q}(x \geq 0),\) for any \(q>0\) we have \[ \mathbb{P}\{|X-\mathbb{E} X| \geq t\} \leq \frac{\mathbb{E}\left[|X-\mathbb{E} X|^{q}\right]}{t^{q}} \] We can choose \(q\) to optimize the obtained upper bound for the problem in hand.

## Chernoff

Taking \(\phi(x)=e^{s x}\) where \(s>0,\) for any random variable \(X,\) and any \(t>0,\) we have \[ \mathbb{P}\{X \geq t\}=\mathbb{P}\left\{e^{s X} \geq e^{s t}\right\} \leq \frac{\mathbb{E} e^{s X}}{e^{s t}} \] Once again, we choose \(s\) to make the bound as tight as possible.

## Hoeffding

🏗

## Efron-Stein

Are these precisely results arising from Stein’s method

Let \(g: \mathcal{X}^{n} \rightarrow \mathbb{R}\) be a real-valued measurable function of n variables. Efron-Stein inequalities concern the difference between the random variable \(Z=g\left(X_{1}, \ldots, X_{n}\right)\) and its expected value \(\mathbb{E Z}\) when \(X_{1}, \ldots, X_{n}\) are arbitrary independent random variables.

Define \(\mathbb{E}_{i}\) for the expected value with respect to the variable \(X_{i}\), that is, \(\mathbb{E}_{i} Z=\mathbb{E}\left[Z \mid X_{1}, \ldots, X_{i-1}, X_{i+1}, \ldots, X_{n}\right]\) Then \[ \operatorname{Var}(Z) \leq \sum_{i=1}^{n} \mathbb{E}\left[\left(Z-\mathbb{E}_{i} Z\right)^{2}\right] \]

Now, let \(X_{1}^{\prime}, \ldots, X_{n}^{\prime}\) be an independent copy of \(X_{1}, \ldots, X_{n}\). \[ Z_{i}^{\prime}=g\left(X_{1}, \ldots, X_{i}^{\prime}, \ldots, X_{n}\right) \] Alternatively, \[ \operatorname{Var}(Z) \leq \frac{1}{2} \sum_{i=1}^{n} \mathbb{E}\left[\left(Z-Z_{i}^{\prime}\right)^{2}\right] \] Nothing here seems to constrain the variables here to be real-valued, merely the function \(g\), but apparently they do not work for matrix variables as written — you need to see Matrix efron stein results for that.

## Kolmogorov

🏗

## Gaussian

For the Gaussian distribution. Filed there, perhaps?

## Martingale bounds

🏗

## Khintchine

Let us copy from wikipedia:

Heuristically: if we pick \(N\) complex numbers \(x_1,\dots,x_N \in\mathbb{C}\), and add them together, each multiplied by jointly independent random signs \(\pm 1\), then the expected value of the sum’s magnitude is close to \(\sqrt{|x_1|^{2}+ \cdots + |x_N|^{2}}\).

Let \({\varepsilon_n}_{n=1}^N\) i.i.d. random variables with \(P(\varepsilon_n=\pm1)=\frac12\) for \(n=1,\ldots, N\), i.e., a sequence with Rademacher distribution. Let \(0<p<\infty\) and let \(x_1,\ldots,x_N \in \mathbb{C}\). Then

\[ A_p \left( \sum_{n=1}^N |x_n|^2 \right)^{1/2} \leq \left(\operatorname{E} \left|\sum_{n=1}^N \varepsilon_n x_n\right|^p \right)^{1/p} \leq B_p \left(\sum_{n=1}^N |x_n|^2\right)^{1/2} \]

for some constants \(A_p,B_p>0\). It is a simple matter to see that \(A_p = 1\) when \(p \ge 2\), and \(B_p = 1\) when \(0 < p \le 2\).

## Empirical process theory

🏗

## Matrix concentration

If we fix our interest to matrices in particular, some fun things arise. See Matrix concentration inequalities

## References

*Confluentes Mathematici*03 (04): 637–47. https://doi.org/10.1142/S1793744211000485.

*COLT*, 30:185–209. http://www.jmlr.org/proceedings/papers/v30/Bach13.pdf.

*Probability Theory and Related Fields*113 (3): 301–413. http://link.springer.com/article/10.1007/s004400050210.

*SIAM Journal on Optimization*15 (3): 780–804. https://doi.org/10.1137/S1052623401399903.

*Advanced Lectures in Machine Learning*, edited by Olivier Bousquet, Ulrike von Luxburg, and G Rätsch. http://www.econ.upf.edu/~lugosi/mlss_conc.pdf.

*Concentration Inequalities: A Nonasymptotic Theory of Independence*. 1st ed. Oxford: Oxford University Press.

*Advanced Lectures on Machine Learning*, edited by Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, 169–207. Lecture Notes in Computer Science 3176. Springer Berlin Heidelberg. http://www.econ.upf.edu/~lugosi/mlss_slt.pdf.

*Statistics for High-Dimensional Data: Methods, Theory and Applications*. 2011 edition. Heidelberg ; New York: Springer.

*Foundations of Computational Mathematics*9 (6): 717–72. https://doi.org/10.1007/s10208-009-9045-5.

*IEEE Transactions on Information Theory*52 (2): 489–509. https://doi.org/10.1109/TIT.2005.862083.

*Asymptotic Theory of Statistics and Probability*. Springer Texts in Statistics. New York: Springer New York. http://link.springer.com/10.1007/978-0-387-75971-5.

*Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence*, 143–51. UAI’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://arxiv.org/abs/1301.3849.

*Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence*, 114–21. UAI’06. Arlington, Virginia, USA: AUAI Press. http://arxiv.org/abs/1206.6813.

*On the Concentration Properties of Interacting Particle Processes*. Vol. 3. Now Publishers. https://doi.org/10.1561/2200000026.

*Concentration of Measure for the Analysis of Randomized Algorithms*. New York: Cambridge University Press.

*New Journal of Physics*14 (9): 095022. https://doi.org/10.1088/1367-2630/14/9/095022.

*The Annals of Statistics*23 (5): 1779–1801. https://doi.org/10.1214/aos/1176324323.

*Empirical Process Techniques for Dependent Data*. Birkhhäuser.

*The Annals of Probability*37 (4): 1605–46. https://doi.org/10.1214/08-AOP447.

*IEEE Transactions on Information Theory*57 (3): 1548–66. https://doi.org/10.1109/TIT.2011.2104999.

*Physical Review Letters*105 (15). https://doi.org/10.1103/PhysRevLett.105.150401.

*Convexity and Concentration*, edited by Eric Carlen, Mokshay Madiman, and Elisabeth M. Werner, 107–56. The IMA Volumes in Mathematics and Its Applications. New York, NY: Springer. https://doi.org/10.1007/978-1-4939-7005-6_4.

*Bernoulli*21 (1): 83–143. https://doi.org/10.3150/13-BEJ562.

*Annals of Mathematical Statistics*42 (3): 1079–83. https://doi.org/10.1214/aoms/1177693335.

*Biometrical Journal*21 (3): 243–45. https://doi.org/10.1002/bimj.4710210305.

*The Annals of Probability*30 (3): 1223–37. https://doi.org/10.1214/aop/1029867126.

*Bernoulli*8 (6, 6): 697–720. http://projecteuclid.org/euclid.bj/1076364802.

*Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems*. Lecture Notes in Mathematics École d’Été de Probabilités de Saint-Flour 2033. Heidelberg: Springer. https://doi.org/10.1007/978-3-642-22147-7_1.

*Algorithmic Learning Theory*, edited by Peter Auer, Alexander Clark, Thomas Zeugmann, and Sandra Zilles, 260–74. Lecture Notes in Computer Science. Bled, Slovenia: Springer International Publishing. https://doi.org/10.1007/978-3-319-11662-4_19.

*Advances in Neural Information Processing Systems*, 541–49. Curran Associates, Inc. http://papers.nips.cc/paper/5836-learning-theory-and-algorithms-for-forecasting-non-stationary-time-series.

*Machine Learning Journal*. http://www.cs.nyu.edu/~mohri/pub/nonstatj.pdf.

*Bernoulli*20 (4): 2020–38. https://doi.org/10.3150/13-BEJ549.

*Proceedings of the National Academy of Sciences of the United States of America*107 (38): 16413–19. https://doi.org/10.1073/pnas.1011270107.

*Concentration-of-Measure Inequalities*. http://www.econ.upf.edu/~lugosi/anu.pdf.

*Annales de La Faculté Des Sciences de Toulouse: Mathématiques*, 9:245–303. http://archive.numdam.org/article/AFST_2000_6_9_2_245_0.pdf.

*Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003*. Lecture Notes in Mathematics 1896. Berlin ; New York: Springer-Verlag. http://www.cmap.polytechnique.fr/~merlet/articles/probas_massart_stf03.pdf.

*ESAIM: Probability and Statistics*15: 41–68. https://doi.org/10.1051/ps/2009004.

*Foundations and Trends in Communications and Information Theory*, December. http://arxiv.org/abs/1212.4663.

*Advances in Neural Information Processing Systems*, 1313–20. Curran Associates, Inc. http://papers.nips.cc/paper/3495-weighted-sums-of-random-kitchen-sinks-replacing-minimization-with-randomization-in-learning.

*Bulletin of the Belgian Mathematical Society - Simon Stevin*13 (5): 883–96.

*Ecole d’été de Probabilités de Saint-Flour XIX, 1989*. Berlin ; New York : Springer-Verlag. https://trove.nla.gov.au/work/20069699.

*Approximate Computation of Expectations*. Vol. 7. IMS. https://www.jstor.org/stable/4355512.

*The Annals of Probability*24 (1): 1–34. http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aop/1042644705.

*An Introduction to Matrix Concentration Inequalities*. http://arxiv.org/abs/1501.01571.

*High-Dimensional Probability: An Introduction with Applications in Data Science*. 1st ed. Cambridge University Press. https://doi.org/10.1017/9781108231596.

*Annals of Probability*1 (6): 1068–70. https://doi.org/10.1214/aop/1176996815.

## No comments yet!