Cross validation

September 5, 2016 — May 13, 2021

estimator distribution
linear algebra
model selection
probability
statistics

On substituting simulation for analysis in model selection, in e.g. choosing the “right” regularisation parameter for sparse regression.

The computationally expensive default option when your model doesn’t have any obvious short cuts for complexity regularization, for example when AIC cannot be shown to work.

To learn: how this interacts with Bayesian inference.

🏗

2 Generalised Cross Validation

Why the name? It’s specialised cross-validation, AFAICS.

🏗 Hat matrix, smoother matrix. Note comparative computational efficiency. Define hat matrix.

🏗️

4 What even is cross validation?

I always thought the answer here was simple: It is asymptotically equivalent to generalised Akaike information criteria. (e.g. Stone (1977)) Related to bootstrap in various ways.

But there is other stuff going on. Here is an interesting sampling of opinions: Rob Tibshirani, Yuling Yao, and Aki Vehtari on cross validation

4.1 Testing leakage

The vtreat introduction mentions their why you need hold-out article and also :

Cross-methods such as cross-validation, and cross-prediction are effective tools for many machine learning, statistics, and data science related applications. They are useful for parameter selection, model selection, impact/target encoding of high cardinality variables, stacking models, and super learning. As cross-methods simulate access to an out of sample data set the same the original data, they are more statistically efficient, lower variance, than partitioning training data into calibration/training/holdout sets. However, cross-methods do not satisfy the full exchangeability conditions that full hold-out methods have. This introduces some additional statistical trade-offs when using cross-methods, beyond the obvious increases in computational cost.

Specifically, cross-methods can introduce an information leak into the modeling process.

5 References

Andrews. 1991. Journal of Econometrics.
Bates, Hastie, and Tibshirani. n.d. “Cross-Validation: What Does It Estimate and How Well Does It Do It?”
Bürkner, Gabry, and Vehtari. 2020. Journal of Statistical Computation and Simulation.
———. 2021. Computational Statistics.
Giordano, Jordan, and Broderick. 2019. arXiv:1907.12116 [Cs, Math, Stat].
Giordano, Stephenson, Liu, et al. 2019. In AISTATS.
Golub, Heath, and Wahba. 1979. Technometrics.
Hall, Racine, and Li. 2004. Journal of the American Statistical Association.
Li. 1987. The Annals of Statistics.
Perlich, and Świrszcz. 2011. ACM SIGKDD Explorations Newsletter.
Polley. 2010. U.C. Berkeley Division of Biostatistics Working Paper Series.
Sivula, Magnusson, and Vehtari. 2020a. arXiv:2008.10859 [Stat].
———. 2020b. arXiv:2008.10296 [Stat].
Stone. 1977. Journal of the Royal Statistical Society. Series B (Methodological).
van der Laan, Polley, and Hubbard. 2007. Statistical Applications in Genetics and Molecular Biology.
Wood. 1994. SIAM Journal on Scientific Computing.
Yao, Vehtari, Simpson, et al. 2018. Bayesian Analysis.