After you have interfered with the purity of your data by model selection, how do you do inference? 🏗
Tricky in general. There is an overview by Cosma Shalizi which mostly comes down in favour of data-splitting whose complications are least extravagant. But this requires a new data holdout for each successive inference, which is still not ideal if you have limited data.
Here’s an approach for more extended chains of inference, from the school known as adaptie data analysis: The reusable holdout: Preserving validity in adaptive data analysis which, like everything these days, uses differential privacy methods. Aaron Roth’s explanation is pretty clear. Soon I will analyse fruit smoothies as differential privacy for bananas.
Bassily, Raef, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. 2015. “Algorithmic Stability for Adaptive Data Analysis,” November. http://arxiv.org/abs/1511.02513.
Berk, Richard, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. 2013. “Valid Post-Selection Inference.” The Annals of Statistics 41 (2): 802–37. https://doi.org/10.1214/12-AOS1077.
Bunea, Florentina. 2004. “Consistent Covariate Selection and Post Model Selection Inference in Semiparametric Regression.” The Annals of Statistics 32 (3): 898–927. https://doi.org/10.1214/009053604000000247.
Chernozhukov, Victor, Christian Hansen, and Martin Spindler. 2015. “Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach.” Annual Review of Economics 7 (1): 649–88. https://doi.org/10.1146/annurev-economics-012315-015826.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. 2015. “Preserving Statistical Validity in Adaptive Data Analysis.” In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing - STOC ’15, 117–26. Portland, Oregon, USA: ACM Press. https://doi.org/10.1145/2746539.2746580.
Hardt, Moritz, and Jonathan Ullman. 2014. “Preventing False Discovery in Interactive Data Analysis Is Hard.” In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, 454–63. FOCS ’14. Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/FOCS.2014.55.
Jung, Christopher, Katrina Ligett, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, and Moshe Shenfeld. 2019. “A New Analysis of Differential Privacy’s Generalization Guarantees,” September. http://arxiv.org/abs/1909.03577.
Lee, Jason D., Dennis L. Sun, Yuekai Sun, and Jonathan E. Taylor. 2013. “Exact Post-Selection Inference, with Application to the Lasso,” November. http://arxiv.org/abs/1311.6238.
Taylor, Jonathan, Richard Lockhart, Ryan J. Tibshirani, and Robert Tibshirani. 2014. “Exact Post-Selection Inference for Forward Stepwise and Least Angle Regression,” January. http://arxiv.org/abs/1401.3889.