# High dimensional statistics

Placeholder to think about the many weird problems arising in very high dimensional statistical inference. There are many approaches to this problem: throwing out dimensions/predictors as in model selection, considering low dimensional projections, viewing objects with matrix structure for concentration or factorisation, or tensor structure even.

## Soap bubbles

High dimensional distributions are extremely odd, and concentrate in weird ways. For example, for some natural definitions of typical, typical items are not average items in high See Sander Dielemann’s musings on typicality for an introduction to this plus some motivating examples.

For another example, consider this summary result of Vershynin (2015):

Let $$K$$ be an isotropic convex body (e.g. an $$L_2$$ ball) in $$\mathbb{R}^{n},$$ and let $$X$$ be a random vector uniformly distributed in $$K$$, with $$\mathbb{E}X=0$$ and $$\mathbb{E}XX^{\top}=I_n.$$ Then the following is true for some positive constants $$C,c$$:

1. (Concentration of volume) For every $$t \geq 1$$, one has $\mathbb{P}\left\{\|X\|_{2}>t \sqrt{n}\right\} \leq \exp (-c t \sqrt{n})$
2. (Thin shell) For every $$\varepsilon \in(0,1),$$ one has $\mathbb{P}\left\{\left|\|X\|_{2}-\sqrt{n}\right|>\varepsilon \sqrt{n}\right\} \leq C \exp \left(-c \varepsilon^{3} n^{1 / 2}\right)$

That is, even with the mass uniformly distributed over space, as the dimension grows, it all ends up in a thin shell, because volume grows exponentially in dimension. This is popularly known as a soap bubble phenomenon. This is one of the phenomena that leads to interesting behaviour in low dimensional projection. The more formal name is the Gaussian Annulus Theorem. Turning it around, for a d-dimensional spherical Gaussian with unit variance in each direction, for any $$\beta \leq \sqrt{d}$$, all but at most $$3 e^{-c \beta^{2}}$$ of the probability mass lies within the annulus $$\sqrt{d}-\beta \leq|\mathbf{x}| \leq \sqrt{d}+\beta,$$ where $$c$$ is a fixed positive constant.

## Convex hulls

Balestriero, Pesenti, and LeCun (2021) cite Bárány and Füredi (1988):

Given a $$d$$-dimensional dataset $$\boldsymbol{X} \triangleq\left\{\boldsymbol{x}_{1}, \ldots, \boldsymbol{x}_{N}\right\}$$ with i.i.d. samples $$\boldsymbol{x}_{n} \sim \mathcal{N}\left(0, I_{d}\right), \forall n$$, the probability that a new sample $$\boldsymbol{x} \sim \mathcal{N}\left(0, I_{d}\right)$$ is in interpolation regime (recall Def. 1 ) has the following limiting behavior $\lim _{d \rightarrow \infty} p(\underbrace{\boldsymbol{x} \in \operatorname{Hull}(\boldsymbol{X})}_{\text {interpolation }})= \begin{cases}1 & \Longleftrightarrow N>d^{-1} 2^{d / 2} \\ 0 & \Longleftrightarrow N<d^{-1} 2^{d / 2}\end{cases}$

They observe that this implies high dimensional statistics rarely interpolates between data points, which is not surprising, but only in retrospect. Despite some expertise in high-dimensional problems I had never noticed this fact myself. Interestingly they collect evidence that suggests that low-d projections and latent spaces are also rarely interpolating.

## Empirical processes in high dimensions

Combining empirical process theory with high dimensional statistics gets us to some interesting models. See, e.g. van de Geer (2014b).

TBD

## References

Balestriero, Randall, Jerome Pesenti, and Yann LeCun. 2021. arXiv:2110.09485 [Cs], October.
Bárány, Imre, and Zoltán Füredi. 1988. Probability Theory and Related Fields 77 (2): 231–40.
Borgs, Christian, Jennifer T. Chayes, Henry Cohn, and Yufei Zhao. 2014. arXiv:1401.2906 [Math], January.
Bühlmann, Peter, and Sara van de Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. 2011 edition. Heidelberg ; New York: Springer.
———. 2015. arXiv:1503.06426 [Stat] 9 (1): 1449–73.
Candès, Emmanuel J., J. Romberg, and T. Tao. 2006. IEEE Transactions on Information Theory 52 (2): 489–509.
Chen, Yen-Chi, and Yu-Xiang Wang. n.d.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2016. arXiv:1608.00060 [Econ, Stat], July.
Chernozhukov, Victor, Christian Hansen, Yuan Liao, and Yinchu Zhu. 2018. arXiv:1812.08089 [Math, Stat], December.
Chernozhukov, Victor, Whitney K. Newey, and Rahul Singh. 2018. arXiv:1809.05224 [Econ, Math, Stat], September.
Geer, Sara van de. 2014a. In arXiv:1403.7023 [Math, Stat]. Vol. 131.
———. 2014b. arXiv:1409.8557 [Math, Stat], September.
Geer, Sara van de, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure. 2014. The Annals of Statistics 42 (3): 1166–1202.
Georgi, Howard. 2022. arXiv:2203.09485 [Hep-Ph, Physics:hep-Th, Physics:physics], March.
Gorban, Alexander N., Ivan Yu Tyukin, and Ilya Romanenko. 2016. arXiv:1610.00494 [Cs, Stat], October.
Gribonval, Rémi, Gilles Blanchard, Nicolas Keriven, and Yann Traonmilin. 2017. arXiv:1706.07180 [Cs, Math, Stat], June.
Gui, Jiang, and Hongzhe Li. 2005. Bioinformatics 21 (13): 3001–8.
Hall, Peter, and Ker-Chau Li. 1993. The Annals of Statistics 21 (2): 867–89.
Javanmard, Adel, and Andrea Montanari. 2014. Journal of Machine Learning Research 15 (1): 2869–909.
Müller, Patric, and Sara van de Geer. 2015. TEST, April.
Tang, Yanbo, and Nancy Reid. 2021. arXiv:2107.10885 [Math, Stat], July.
Uematsu, Yoshimasa. 2015. arXiv:1504.06706 [Math, Stat], April.
Veitch, Victor, and Daniel M. Roy. 2015. arXiv:1512.03099 [Cs, Math, Stat], December.
Vershynin, Roman. 2015. In Sampling Theory, a Renaissance: Compressive Sensing and Other Developments, edited by Götz E. Pfander, 3–66. Applied and Numerical Harmonic Analysis. Cham: Springer International Publishing.
———. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science. 1st ed. Cambridge University Press.
Wainwright, Martin J. 2014. Annual Review of Statistics and Its Application 1 (1): 233–53.
Wright, John, and Yi Ma. 2022. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. S.l.: Cambridge University Press.
Zhang, Cun-Hui, and Stephanie S. Zhang. 2014. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1): 217–42.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.