Leakage in predictive models

September 5, 2016 — April 12, 2023

estimator distribution
linear algebra
model selection
probability
statistics
Figure 1

Self-deluding cheating in predictive models.

Kyle Polich:

If you’d like to make a good prediction, your best bet is to invent a time machine, visit the future, observe the value, and return to the past. For those without access to time travel technology, we need to avoid including information about the future in our training data when building machine learning models

1 Cross-validation leakage

The vtreat introduction mentions their why you need hold-out article and also (Perlich and Świrszcz 2011) to discuss leakage via cross validation.

Cross-methods such as cross-validation, and cross-prediction are effective tools for many machine learning, statistics, and data science related applications. They are useful for parameter selection, model selection, impact/target encoding of high cardinality variables, stacking models, and super learning. As cross-methods simulate access to an out of sample data set the same the original data, they are more statistically efficient, lower variance, than partitioning training data into calibration/training/holdout sets. However, cross-methods do not satisfy the full exchangeability conditions that full hold-out methods have. This introduces some additional statistical trade-offs when using cross-methods, beyond the obvious increases in computational cost.

Specifically, cross-methods can introduce an information leak into the modeling process.

2 Incoming

3 References

Kaufman, Rosset, and Perlich. 2011. “Leakage in Data Mining: Formulation, Detection, and Avoidance.”
Perlich, and Świrszcz. 2011. On Cross-Validation and Stacking: Building Seemingly Predictive Models on Random Data.” ACM SIGKDD Explorations Newsletter.