Post-selection inference

Adaptive data analysis without cheating

2017-08-20 — 2017-08-20

Wherein Post-Selection Inference Is Considered, and the Reusable Holdout via Differential Privacy Is Introduced as a Way to Preserve Validity Across Adaptive Analyses, With LASSO Methods Noted.

information

model selection

statistics

After you have interfered with the purity of your data by model selection, how do you do inference? 🚧TODO🚧

Tricky in general. There is an overview by Cosma Shalizi which mostly comes down in favour of data-splitting, whose complications are least extravagant. But this requires a new data holdout for each successive inference, which is still not ideal if you have limited data.

Here’s an approach for more extended chains of inference, from the school known as adaptive data analysis: The reusable holdout: Preserving validity in adaptive data analysis which, like everything these days, uses differential privacy methods. Aaron Roth’s explanation is pretty clear. Soon I will analyse fruit smoothies as differential privacy for bananas.

Some models have special powers in this regard, e.g. LASSO-style approaches. Much to do here, but for now there is a simple, relaxed walk-through by Peter Ellis on post-regression inference using the LASSO for COVID-19 and hydroxychloroquine with some side glances at T. J. Hastie, Tibshirani, Rob, and Wainwright (2015).

1 References

Bassily, Nissim, Smith, et al. 2015. “Algorithmic Stability for Adaptive Data Analysis.” arXiv:1511.02513 [Cs].

Berk, Brown, Buja, et al. 2013. “Valid Post-Selection Inference.” The Annals of Statistics.

Bunea. 2004. “Consistent Covariate Selection and Post Model Selection Inference in Semiparametric Regression.” The Annals of Statistics.

Chernozhukov, Hansen, and Spindler. 2015. “Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach.” Annual Review of Economics.

Dwork, Feldman, Hardt, et al. 2015. “Preserving Statistical Validity in Adaptive Data Analysis.” In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing - STOC ’15.

Dwork, Feldman, Hardt, et al. 2017. “Guilt-Free Data Reuse.” Communications of the ACM.

Hardt, and Ullman. 2014. “Preventing False Discovery in Interactive Data Analysis Is Hard.” In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. FOCS ’14.

Hastie, Trevor J., Tibshirani, Rob, and Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations.

Hastie, Trevor, Tibshirani, and Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction.

Jung, Ligett, Neel, et al. 2019. “A New Analysis of Differential Privacy’s Generalization Guarantees.” arXiv:1909.03577 [Cs, Stat].

Lee, Sun, Sun, et al. 2013. “Exact Post-Selection Inference, with Application to the Lasso.” arXiv:1311.6238 [Math, Stat].

Taylor, Lockhart, Tibshirani, et al. 2014. “Exact Post-Selection Inference for Forward Stepwise and Least Angle Regression.” arXiv:1401.3889 [Stat].