Welcome to the probability inequality mines!
When something in your process (measurement, estimation) means that you can be pretty sure that a whole bunch of your stuff is particularly likely to be somewhere in particular.
As undergraduates we run into central limit theorems, but there are many more diverse ways we can keep track of our probability, or at least most of it. This idea is a basic workhorse in univariate probability, and turns out to be yet more essential in multivariate matrix probability, as seen in matrix factorisation, compressive sensing, PACbounds and suchlike.
Background
Overviews include

The theory of large deviations deals with the probabilities of rare events (or fluctuations) that are exponentially small as a function of some parameter, e.g., the number of random components of a system, the time over which a stochastic system is observed, the amplitude of the noise perturbing a dynamical system or the temperature of a chemical reaction. The theory has applications in many different scientific fields, ranging from queuing theory to statistics and from finance to engineering. It is also increasingly used in statistical physics for studying both equilibrium and nonequilibrium systems. In this context, deep analogies can be made between familiar concepts of statistical physics, such as the entropy and the free energy, and concepts of large deviation theory having more technical names, such as the rate function and the scaled cumulant generating function. The first part of these notes introduces the basic elements of large deviation theory at a level appropriate for advanced undergraduate and graduate students in physics, engineering, chemistry, and mathematics. The focus there is on the simple but powerful ideas behind large deviation theory, stated in nontechnical terms, and on the application of these ideas in simple stochastic processes, such as sums of independent and identically distributed random variables and Markov processes. Some physical applications of these processes are covered in exercises contained at the end of each section. In the second part, the problem of numerically evaluating large deviation probabilities is treated at a basic level. The fundamental idea of importance sampling is introduced there together with its sister idea, the exponential change of measure. Other numerical methods based on sample means and generating functions, with applications to Markov processes, are also covered.
Roman Vershynin’s HighDimensional Probability: An Introduction with Applications in Data Science (Vershynin 2018)
Thomas Lumley’s super simple intro to chaining and controlling maxima.
Dasgupta, Asymptotic Theory of Statistics and Probability (DasGupta 2008) is practical, and despite its name, introduces some basic nonasymptotic inequalities
Raginsky and Sason, Concentration of Measure Inequalities in Information Theory, Communications and Coding (Raginsky and Sason 2012)
Tropp, An Introduction to Matrix Concentration Inequalities (Tropp 2015) highdimensional data! free!
Boucheron, Bousquet & Lugosi, Concentration inequalities (Boucheron, Bousquet, and Lugosi 2004) (Clear and brisk but missing some newer stuff)
Boucheron, Lugosi & Massart, Concentration inequalities: a nonasymptotic theory of independence (Boucheron, Lugosi, and Massart 2013). Haven’t read it yet.
Massart, Concentration inequalities and model section (Massart 2007). Clear, focussed, but brusque. Depressingly, by being applied it also demonstrates the limitations of its chosen techniques, which seem sweet in application but bitter in the required assumptions assumptions. (Massart 2000) is an earlier draft.
Lugosi’s Concentrationofmeasure Lecture notes:
The inequalities discussed in these notes bound tail probabilities of general functions of independent random variables.
The taxonomy is interesting:
Several methods have been known to prove such inequalities, including martingale methods (see Milman and Schechtman [1] and the surveys of McDiarmid [2, 3]), informationtheoretic methods (see Alhswede, Gács, and Körner [4], Marton [5, 6, 7], Dembo [8], Massart [9] and Rio [10]), Talagrand’s induction method [11, 12, 13] (see also Luczak and McDiarmid [14], McDiarmid [15] and Panchenko [16, 17, 18]), the decoupling method surveyed by de la Penã and Giné [19], and the socalled “entropy method”, based on logarithmic Sobolev inequalities, developed by Ledoux [20, 21], see also Bobkov and Ledoux [22], Massart [23], Rio [10], Klein [24], Boucheron, Lugosi, and Massart [25, 26], Bousquet [27, 28], and Boucheron, Bousquet, Lugosi, and Massart [29].
This taxonomy is used in his Combinatorial statistics notes.
Divergence in everything: erasure divergence and concentration inequalities
Talagrand’s opus that is commonly credited with kicking off the modern fad especially the part due to the chaining method. (Talagrand 1995)
Luca Trevisan wrote a excellent exampledriven explanation of Talagrand generic chaining.
Use in finite sample bounds
Asymptotic normality is so last season. These days we care about finite sample performance dsitribution, and asymptotic results don’t help us there. Apparently I can construct useful bounds using concentration inequalities? One suggested keyword to disambiguate: AhlswedeWinterfeld bounds? 🏗
Classic inequalities
Markov
🏗
Chebychev
🏗
Hoeffding
🏗
Chernoff
🏗
Kolmogorov
🏗
Gaussian
For the Gaussian distribution. Filed there.
Martingale bounds
🏗
Khintchine
Let us copy from wikipedia:
Heuristically: if we pick \(N\) complex numbers \(x_1,\dots,x_N \in\mathbb{C}\), and add them together, each multiplied by jointly independent random signs \(\pm 1\), then the expected value of the sum’s magnitude is close to \(\sqrt{x_1^{2}+ \cdots + x_N^{2}}\).
Let \({\varepsilon_n}_{n=1}^N\) i.i.d. random variables with \(P(\varepsilon_n=\pm1)=\frac12\) for \(n=1,\ldots, N\), i.e., a sequence with Rademacher distribution. Let \(0<p<\infty\) and let \(x_1,\ldots,x_N \in \mathbb{C}\). Then
\[ A_p \left( \sum_{n=1}^N x_n^2 \right)^{1/2} \leq \left(\operatorname{E} \left\sum_{n=1}^N \varepsilon_n x_n\right^p \right)^{1/p} \leq B_p \left(\sum_{n=1}^N x_n^2\right)^{1/2} \]
for some constants \(A_p,B_p>0\). It is a simple matter to see that \(A_p = 1\) when \(p \ge 2\), and \(B_p = 1\) when \(0 < p \le 2\).
Empirical process theory
🏗
Matrix Chernoff bounds
In recent years, random matrices have come to play a major role in computational mathematics, but most of the classical areas of random matrix theory remain the province of experts. Over the last decade, with the advent of matrix concentration inequalities, research has advanced to the point where we can conquer many (formerly) challenging problems with a page or two of arithmetic.
Are these related?
Nikhil Srivastava’s Discrepancy, Graphs, and the KadisonSinger Problem has an interesting example of bounds via discrepancy theory (and only indirectly probability). D. Gross (2011) is also readable, and gives results for matrices over the complex field.