Leakage in predictive models
September 5, 2016 — April 12, 2023
Self-deluding cheating in predictive models.
If you’d like to make a good prediction, your best bet is to invent a time machine, visit the future, observe the value, and return to the past. For those without access to time travel technology, we need to avoid including information about the future in our training data when building machine learning models
1 Cross-validation leakage
The vtreat introduction mentions their why you need hold-out article and also (Perlich and Świrszcz 2011) to discuss leakage via cross-validation.
Cross-methods such as cross-validation and cross-prediction are effective tools for many machine learning, statistics, and data science-related applications. They are useful for parameter selection, model selection, impact/target encoding of high cardinality variables, stacking models, and super learning. As cross-methods simulate access to an out-of-sample data set the same as the original data, they are more statistically efficient, with lower variance, than partitioning training data into calibration/training/holdout sets. However, cross-methods do not satisfy the full exchangeability conditions that full hold-out methods have. This introduces some additional statistical trade-offs when using cross-methods, beyond the obvious increases in computational cost.
Specifically, cross-methods can introduce an information leak into the modeling process.