Welcome to the probability inequality mines!

When something in your process (measurement, estimation) means that you can be pretty sure that a whole bunch of your stuff is damn likely to be somewhere in particular.

This is basic workhorse stuff in univariate probability, and turns out to be yet more essential in multivariate matrix probability, as seen in matrix factorisation, compressive sensing, PAC-bounds and suchlike.

## Background

Overviews include

TouchetteBasic2011

The theory of large deviations deals with the probabilities of rare events (or fluctuations) that are exponentially small as a function of some parameter, e.g., the number of random components of a system, the time over which a stochastic system is observed, the amplitude of the noise perturbing a dynamical system or the temperature of a chemical reaction. The theory has applications in many different scientific fields, ranging from queuing theory to statistics and from finance to engineering. It is also increasingly used in statistical physics for studying both equilibrium and nonequilibrium systems. In this context, deep analogies can be made between familiar concepts of statistical physics, such as the entropy and the free energy, and concepts of large deviation theory having more technical names, such as the rate function and the scaled cumulant generating function. The first part of these notes introduces the basic elements of large deviation theory at a level appropriate for advanced undergraduate and graduate students in physics, engineering, chemistry, and mathematics. The focus there is on the simple but powerful ideas behind large deviation theory, stated in non-technical terms, and on the application of these ideas in simple stochastic processes, such as sums of independent and identically distributed random variables and Markov processes. Some physical applications of these processes are covered in exercises contained at the end of each section. In the second part, the problem of numerically evaluating large deviation probabilities is treated at a basic level. The fundamental idea of importance sampling is introduced there together with its sister idea, the exponential change of measure. Other numerical methods based on sample means and generating functions, with applications to Markov processes, are also covered.

Roman Vershynin’s High-Dimensional Probability: An Introduction with Applications in Data Science (Vershynin 2018)

Thomas Lumley’s super simple intro to chaining and controlling maxima.

Dasgupta, Asymptotic Theory of Statistics and Probability (DasGupta 2008) is practical, and despite its name, introduces some basic non-asymptotic inequalities

Raginsky and Sason, Concentration of Measure Inequalities in Information Theory, Communications and Coding (Raginsky and Sason 2012)

Tropp, An Introduction to Matrix Concentration Inequalities (Tropp 2015) high-dimensional data! free!

Boucheron, Bousquet & Lugosi, Concentration inequalities (Boucheron, Bousquet, and Lugosi 2004) (Clear and brisk but missing some newer stuff)

Boucheron, Lugosi & Massart, Concentration inequalities: a nonasymptotic theory of independence (Boucheron, Lugosi, and Massart 2013). Haven’t read it yet.

Massart, Concentration inequalities and model section (Massart 2007). Clear, focussed, but brusque. Depressingly, by being applied it also demonstrates the limitations of its chosen techniques, which seem sweet in application but bitter in the required assumptions assumptions. (Massart 2000) is an earlier draft.

Lugosi’s Concentration-of-measure Lecture notes:

The inequalities discussed in these notes bound tail probabilities of general functions of independent random variables.

The taxonomy is interesting:

Several methods have been known to prove such inequalities, including martingale methods (see Milman and Schechtman [1] and the surveys of McDiarmid [2, 3]), information-theoretic methods (see Alhswede, Gács, and Körner [4], Marton [5, 6, 7], Dembo [8], Massart [9] and Rio [10]), Talagrand’s induction method [11, 12, 13] (see also Luczak and McDiarmid [14], McDiarmid [15] and Panchenko [16, 17, 18]), the decoupling method surveyed by de la Penã and Giné [19], and the so-called “entropy method”, based on logarithmic Sobolev inequalities, developed by Ledoux [20, 21], see also Bobkov and Ledoux [22], Massart [23], Rio [10], Klein [24], Boucheron, Lugosi, and Massart [25, 26], Bousquet [27, 28], and Boucheron, Bousquet, Lugosi, and Massart [29].

This taxonomy is used in his Combinatorial statistics notes.

Divergence in everything: erasure divergence and concentration inequalities

Talagrand’s opus that is commonly credited with kicking off the modern fad especially the part due to the chaining method. (Talagrand 1995)

Luca Trevisan wrote a excellent example-driven explanation of Talagrand generic chaining.

## Use in finite sample bounds

Asymptotic normality is so last season.
These days we care about finite sample performance dsitribution,
and asymptotic results don’t help us there.
Apparently I can construct useful bounds using concentration inequalities?
One suggested keyword to disambiguate: *Ahlswede-Winterfeld bounds*?
🏗

## Classic inequalities

### Markov

🏗

### Chebychev

🏗

### Hoeffding

🏗

### Chernoff

🏗

### Kolmogorov

🏗

### Gaussian

For the Gaussian distribution. Filed there.

### Martingale bounds

🏗

### Khintchine

Let us copy from wikipedia:

Heuristically: if we pick \(N\) complex numbers \(x_1,\dots,x_N \in\mathbb{C}\), and add them together, each multiplied by jointly independent random signs \(\pm 1\), then the expected value of the sum’s magnitude is close to \(\sqrt{|x_1|^{2}+ \cdots + |x_N|^{2}}\).

Let \({\varepsilon_n}_{n=1}^N\) i.i.d. random variables with \(P(\varepsilon_n=\pm1)=\frac12\) for \(n=1,\ldots, N\), i.e., a sequence with Rademacher distribution. Let \(0<p<\infty\) and let \(x_1,\ldots,x_N \in \mathbb{C}\). Then

\[ A_p \left( \sum_{n=1}^N |x_n|^2 \right)^{1/2} \leq \left(\operatorname{E} \left|\sum_{n=1}^N \varepsilon_n x_n\right|^p \right)^{1/p} \leq B_p \left(\sum_{n=1}^N |x_n|^2\right)^{1/2} \]

for some constants \(A_p,B_p>0\). It is a simple matter to see that \(A_p = 1\) when \(p \ge 2\), and \(B_p = 1\) when \(0 < p \le 2\).

## Empirical process theory

🏗

## Matrix Chernoff bounds

Tropp (2015) summarises:

In recent years, random matrices have come to play a major role in computational mathematics, but most of the classical areas of random matrix theory remain the province of experts. Over the last decade, with the advent of matrix concentration inequalities, research has advanced to the point where we can conquer many (formerly) challenging problems with a page or two of arithmetic.

Are these related?

Nikhil Srivastava’s Discrepancy, Graphs, and the Kadison-Singer Problem has an interesting example of bounds via discrepancy theory (and only indirectly probability). Gross (2011) is also readable, and gives results for matrices over the complex field.

Aubrun, Guillaume, and Ion Nechita. 2011. “The Multiplicative Property Characterizes $\ell_p$ and $L_p$ Norms.” *Confluentes Mathematici* 03 (04): 637–47. https://doi.org/10.1142/S1793744211000485.

Bach, Francis R. 2013. “Sharp Analysis of Low-Rank Kernel Matrix Approximations.” In *COLT*, 30:185–209. http://www.jmlr.org/proceedings/papers/v30/Bach13.pdf.

Barron, Andrew, Lucien Birgé, and Pascal Massart. 1999. “Risk Bounds for Model Selection via Penalization.” *Probability Theory and Related Fields* 113 (3): 301–413. http://link.springer.com/article/10.1007/s004400050210.

Bellec, Pierre C., Guillaume Lecué, and Alexandre B. Tsybakov. 2017. “Towards the Study of Least Squares Estimators with Convex Penalty.” January 31, 2017. http://arxiv.org/abs/1701.09120.

Berend, Daniel, Peter Harremoës, and Aryeh Kontorovich. 2012. “Minimum KL-Divergence on Complements of L₁ Balls.” June 27, 2012. http://arxiv.org/abs/1206.6544.

Bertsimas, D., and I. Popescu. 2005. “Optimal Inequalities in Probability Theory: A Convex Optimization Approach.” *SIAM Journal on Optimization* 15 (3): 780–804. https://doi.org/10.1137/S1052623401399903.

Boucheron, Stéphane, Olivier Bousquet, and Gábor Lugosi. 2004. “Concentration Inequalities.” In *Advanced Lectures in Machine Learning*, edited by Olivier Bousquet, Ulrike von Luxburg, and G Rätsch. http://www.econ.upf.edu/~lugosi/mlss_conc.pdf.

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. *Concentration Inequalities: A Nonasymptotic Theory of Independence*. 1st ed. Oxford: Oxford University Press.

———. 2003. “Concentration Inequalities Using the Entropy Method” 31 (3): 1583–1614. https://doi.org/10.1214/aop/1055425791.

Bousquet, Olivier, Stéphane Boucheron, and Gábor Lugosi. 2004. “Introduction to Statistical Learning Theory.” In *Advanced Lectures on Machine Learning*, edited by Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, 169–207. Lecture Notes in Computer Science 3176. Springer Berlin Heidelberg. http://www.econ.upf.edu/~lugosi/mlss_slt.pdf.

Bühlmann, Peter, and Sara van de Geer. 2011. *Statistics for High-Dimensional Data: Methods, Theory and Applications*. 2011 edition. Heidelberg ; New York: Springer.

Candès, Emmanuel J., and Benjamin Recht. 2009. “Exact Matrix Completion via Convex Optimization.” *Foundations of Computational Mathematics* 9 (6): 717–72. https://doi.org/10.1007/s10208-009-9045-5.

Candès, Emmanuel J., J. Romberg, and T. Tao. 2006. “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information.” *IEEE Transactions on Information Theory* 52 (2): 489–509. https://doi.org/10.1109/TIT.2005.862083.

Dasarathy, Gautam, Parikshit Shah, Badri Narayan Bhaskar, and Robert Nowak. 2013. “Sketching Sparse Matrices.” March 26, 2013. http://arxiv.org/abs/1303.6544.

DasGupta, Anirban. 2008. *Asymptotic Theory of Statistics and Probability*. Springer Texts in Statistics. New York: Springer New York. http://link.springer.com/10.1007/978-0-387-75971-5.

Dasgupta, Sanjoy. 2000. “Experiments with Random Projection.” In *Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence*, 143–51. UAI’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://arxiv.org/abs/1301.3849.

Dasgupta, Sanjoy, Daniel Hsu, and Nakul Verma. 2012. “A Concentration Theorem for Projections.” 2012. http://arxiv.org/abs/1206.6813.

Del Moral, Pierre, Peng Hu, and Liming Wu. 2011. *On the Concentration Properties of Interacting Particle Processes*. Vol. 3. Now Publishers. https://doi.org/10.1561/2200000026.

Dubhashi, Devdatt, and Alessandro Panconesi. 2009. *Concentration of Measure for the Analysis of Randomized Algorithms*. New York: Cambridge University Press.

Flammia, Steven T., David Gross, Yi-Kai Liu, and Jens Eisert. 2012. “Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators.” *New Journal of Physics* 14 (9): 095022. https://doi.org/10.1088/1367-2630/14/9/095022.

Geer, Sara van de. 2002. “On Hoeffdoing’s Inequality for Dependent Random Variables.” In *Empirical Process Techniques for Dependent Data*. Birkhhäuser.

———. 1995. “Exponential Inequalities for Martingales, with Application to Maximum Likelihood Estimation for Counting Processes.” *The Annals of Statistics* 23 (5): 1779–1801. https://doi.org/10.1214/aos/1176324323.

———. 2014. “Statistical Theory for High-Dimensional Models.” September 30, 2014. http://arxiv.org/abs/1409.8557.

Geer, Sara van de, and Johannes Lederer. 2011. “The Lasso, Correlated Design, and Improved Oracle Inequalities.” July 1, 2011. http://arxiv.org/abs/1107.0189.

Giné, Evarist, and Richard Nickl. 2009. “Uniform Limit Theorems for Wavelet Density Estimators.” *The Annals of Probability* 37 (4): 1605–46. https://doi.org/10.1214/08-AOP447.

Gross, D. 2011. “Recovering Low-Rank Matrices from Few Coefficients in Any Basis.” *IEEE Transactions on Information Theory* 57 (3): 1548–66. https://doi.org/10.1109/TIT.2011.2104999.

Gross, David, Yi-Kai Liu, Steven T. Flammia, Stephen Becker, and Jens Eisert. 2010. “Quantum State Tomography via Compressed Sensing.” *Physical Review Letters* 105 (15). https://doi.org/10.1103/PhysRevLett.105.150401.

Hansen, Niels Richard, Patricia Reynaud-Bouret, and Vincent Rivoirard. 2015. “Lasso and Probabilistic Inequalities for Multivariate Point Processes.” *Bernoulli* 21 (1): 83–143. https://doi.org/10.3150/13-BEJ562.

Hanson, D. L., and F. T. Wright. 1971. “A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables.” *Annals of Mathematical Statistics* 42 (3): 1079–83. https://doi.org/10.1214/aoms/1177693335.

Horn, M. 1979. “Some Inequalities for the Expectation of a Product of Functions of a Random Variable and for the Multivariate Distribution Function at a Random Point.” *Biometrical Journal* 21 (3): 243–45. https://doi.org/10.1002/bimj.4710210305.

Houdré, Christian. 2002. “Remarks on Deviation Inequalities for Functions of Infinitely Divisible Random Vectors.” *The Annals of Probability* 30 (3): 1223–37. https://doi.org/10.1214/aop/1029867126.

Houdré, Christian, and Nicolas Privault. 2002. “Concentration and Deviation Inequalities in Infinite Dimensions via Covariance Representations.” *Bernoulli* 8 (6, 6): 697–720. http://projecteuclid.org/euclid.bj/1076364802.

Kennedy, Edward H. 2015. “Semiparametric Theory and Empirical Processes in Causal Inference.” 2015. http://arxiv.org/abs/1510.04740.

Koltchinskii, Prof Vladimir. 2011. *Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems*. Lecture Notes in Mathematics École d’Été de Probabilités de Saint-Flour 2033. Heidelberg: Springer. https://doi.org/10.1007/978-3-642-22147-7_1.

Kontorovich, Aryeh, and Maxim Raginsky. 2016. “Concentration of Measure Without Independence: A Unified Approach via the Martingale Method.” February 1, 2016. http://arxiv.org/abs/1602.00721.

Kroll, Martin. 2016. “Concentration Inequalities for Poisson Point Processes with Application to Adaptive Intensity Estimation.” December 23, 2016. http://arxiv.org/abs/1612.07901.

Kuznetsov, Vitaly, and Mehryar Mohri. 2015. “Learning Theory and Algorithms for Forecasting Non-Stationary Time Series.” In *Advances in Neural Information Processing Systems*, 541–49. Curran Associates, Inc. http://papers.nips.cc/paper/5836-learning-theory-and-algorithms-for-forecasting-non-stationary-time-series.

———. 2016. “Generalization Bounds for Non-Stationary Mixing Processes.” In *Machine Learning Journal*. http://www.cs.nyu.edu/~mohri/pub/nonstatj.pdf.

———. 2014. “Generalization Bounds for Time Series Prediction with Non-Stationary Processes.” In *Algorithmic Learning Theory*, edited by Peter Auer, Alexander Clark, Thomas Zeugmann, and Sandra Zilles, 260–74. Lecture Notes in Computer Science. Bled, Slovenia: Springer International Publishing. https://doi.org/10.1007/978-3-319-11662-4_19.

Lahiri, Subhaneil, Peiran Gao, and Surya Ganguli. 2016. “Random Projections of Random Manifolds.” July 14, 2016. http://arxiv.org/abs/1607.04331.

Lederer, Johannes, and Sara van de Geer. 2014. “New Concentration Inequalities for Suprema of Empirical Processes.” *Bernoulli* 20 (4): 2020–38. https://doi.org/10.3150/13-BEJ549.

Liggett, Thomas M. 2010. “Stochastic Models for Large Interacting Systems and Related Correlation Inequalities.” *Proceedings of the National Academy of Sciences of the United States of America* 107 (38): 16413–9. https://doi.org/10.1073/pnas.1011270107.

Lugosi, Gábor. 2003. *Concentration-of-Measure Inequalities*. http://www.econ.upf.edu/~lugosi/anu.pdf.

Massart, Pascal. 2000. “Some Applications of Concentration Inequalities to Statistics.” In *Annales de La Faculté Des Sciences de Toulouse: Mathématiques*, 9:245–303. http://archive.numdam.org/article/AFST_2000_6_9_2_245_0.pdf.

———. 2007. *Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003*. Lecture Notes in Mathematics 1896. Berlin ; New York: Springer-Verlag. http://www.cmap.polytechnique.fr/~merlet/articles/probas_massart_stf03.pdf.

Maugis, Cathy, and Bertrand Michel. 2011. “A Non Asymptotic Penalized Criterion for Gaussian Mixture Model Selection.” *ESAIM: Probability and Statistics* 15: 41–68. https://doi.org/10.1051/ps/2009004.

Raginsky, Maxim, and Igal Sason. 2012. “Concentration of Measure Inequalities in Information Theory, Communications and Coding.” *Foundations and Trends in Communications and Information Theory*, December. http://arxiv.org/abs/1212.4663.

Rahimi, Ali, and Benjamin Recht. 2009. “Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In *Advances in Neural Information Processing Systems*, 1313–20. Curran Associates, Inc. http://papers.nips.cc/paper/3495-weighted-sums-of-random-kitchen-sinks-replacing-minimization-with-randomization-in-learning.

Reynaud-Bouret, Patricia, and Emmanuel Roy. 2007. “Some Non Asymptotic Tail Estimates for Hawkes Processes.” *Bulletin of the Belgian Mathematical Society - Simon Stevin* 13 (5): 883–96.

Saint-Flour (19th : 1989), Ecole d’été de probabilités de, P. L. Hennequin, Etienne Pardoux, D. L. Burkholder, and Alain-Sol Sznitman. 1991. *Ecole d’été de Probabilités de Saint-Flour XIX, 1989*. Berlin ; New York : Springer-Verlag. https://trove.nla.gov.au/work/20069699.

Talagrand, Michel. 1995. “Concentration of Measure and Isoperimetric Inequalities in Product Spaces.” https://doi.org/10.1007/BF02699376.

———. 1996. “A New Look at Independence.” *The Annals of Probability* 24 (1): 1–34. http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aop/1042644705.

Touchette, Hugo. 2011. “A Basic Introduction to Large Deviations: Theory, Applications, Simulations,” June. https://arxiv.org/abs/1106.4146v3.

Tropp, Joel A. 2015. *An Introduction to Matrix Concentration Inequalities*. http://arxiv.org/abs/1501.01571.

Vershynin, Roman. 2018. *High-Dimensional Probability: An Introduction with Applications in Data Science*. 1st ed. Cambridge University Press. https://doi.org/10.1017/9781108231596.

Wright, F. T. 1973. “A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions Are Not Necessarily Symmetric.” *Annals of Probability* 1 (6): 1068–70. https://doi.org/10.1214/aop/1176996815.

Wu, Xinxing, and Junping Zhang. 2016. “Distribution-Dependent Concentration Inequalities for Tighter Generalization Bounds.” July 19, 2016. http://arxiv.org/abs/1607.05506.