Multiple testing

April 22, 2015 — November 5, 2018

decision theory
high d
how do science
linear algebra
machine learning
model selection

How to go data mining for models without “dredging” for models. (accidentally or otherwise) If you keep on testing models until you find some that fit (which you usually will) how do you know that the fit is in some sense interesting? How sharp will your conclusions be? How does it work when you are testing against a possibly uncountable continuum of hypotheses? (One perspective on sparsity penalties is precisely this, I am told.)

Model selection is this writ small — when you are testing how many variables to include in your model.

In modern high-dimensional models, where you have potentially many explanatory variables, the question of how to handle the combinatorial explosion of possible variables to include, this can also be considered a multiple testing problem. We tend to regard this as a smoothing and model selection problem though.

This all gets more complicated when you think about many people testing many hypothesese in many different experiments then you are going to run into many more issues than just these - also publication bias and suchlike.

Suggestive connection:

Moritz Hardt, The machine learning leaderboard problem:

In this post, I will describe a method to climb the public leaderboard without even looking at the data. The algorithm is so simple and natural that an unwitting analyst might just run it. We will see that in Kaggle’s famous Heritage Health Prize competition this might have propelled a participant from rank around 150 into the top 10 on the public leaderboard without making progress on the actual problem. […]

I get super excited. I keep climbing the leaderboard! Who would’ve thought that this machine learning thing was so easy? So, I go write a blog post on Medium about Big Data and score a job at, the latest data science startup in the city. Life is pretty sweet. I pick up indoor rock climbing, sign up for wood working classes; I read Proust and books about espresso. Two months later the competition closes and Kaggle releases the final score. What an embarrassment! Wacky boosting did nothing whatsoever on the final test set. I get fired from days before the buyout. My spouse dumps me. The lease expires. I get evicted from my apartment in the Mission. Inevitably, I hike the Pacific Crest Trail and write a novel about it.

See (Blum and Hardt 2015; Dwork et al. 2015a) for more of that.

1 P-value hacking

2 False discovery rate

FDR control…

3 Familywise error rate

Šidák correction, Bonferroni correction…

4 Post selection inference

See post selection inference.

5 Incoming

David Kadavy’s classic, grumpy essay A/A Testing: How I increased conversions 300% by doing absolutely nothing.

Multiple testing in python: multipy

6 References

Abramovich, Benjamini, Donoho, et al. 2006. Adapting to Unknown Sparsity by Controlling the False Discovery Rate.” The Annals of Statistics.
Aickin, and Gensler. 1996. Adjusting for Multiple Testing When Reporting Research Results: The Bonferroni Vs Holm Methods. American Journal of Public Health.
Ansley, and Kohn. 1985. Estimation, Filtering, and Smoothing in State Space Models with Incompletely Specified Initial Conditions.” The Annals of Statistics.
Arnold, and Emerson. 2011. Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions.” The R Journal.
Bach. 2009. Model-Consistent Sparse Estimation Through the Bootstrap.” arXiv:0901.3202 [Cs, Stat].
Barber, and Candès. 2015. Controlling the False Discovery Rate via Knockoffs.” The Annals of Statistics.
Bashtannyk, and Hyndman. 2001. Bandwidth Selection for Kernel Conditional Density Estimation.” Computational Statistics & Data Analysis.
Bassily, Nissim, Smith, et al. 2015. Algorithmic Stability for Adaptive Data Analysis.” arXiv:1511.02513 [Cs].
Benjamini. 2010. Simultaneous and Selective Inference: Current Successes and Future Challenges.” Biometrical Journal.
Benjamini, and Gavrilov. 2009. A Simple Forward Selection Procedure Based on False Discovery Rate Control.” The Annals of Applied Statistics.
Benjamini, and Hochberg. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological).
Benjamini, and Yekutieli. 2005. False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters.” Journal of the American Statistical Association.
Berk, Brown, Buja, et al. 2013. Valid Post-Selection Inference.” The Annals of Statistics.
Blum, and Hardt. 2015. The Ladder: A Reliable Leaderboard for Machine Learning Competitions.” arXiv:1502.04585 [Cs].
Buckland, Burnham, and Augustin. 1997. Model Selection: An Integral Part of Inference.” Biometrics.
Bühlmann, and van de Geer. 2015. High-Dimensional Inference in Misspecified Linear Models.” arXiv:1503.06426 [Stat].
Bunea. 2004. Consistent Covariate Selection and Post Model Selection Inference in Semiparametric Regression.” The Annals of Statistics.
Burnham, and Anderson. 2004. Multimodel Inference Understanding AIC and BIC in Model Selection.” Sociological Methods & Research.
Cai, and Sun. 2017. Large-Scale Global and Simultaneous Inference: Estimation and Testing in Very High Dimensions.” Annual Review of Economics.
Candès, Fan, Janson, et al. 2016. Panning for Gold: Model-Free Knockoffs for High-Dimensional Controlled Variable Selection.” arXiv Preprint arXiv:1610.02351.
Candès, Romberg, and Tao. 2006. Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information.” IEEE Transactions on Information Theory.
Candès, Wakin, and Boyd. 2008. Enhancing Sparsity by Reweighted ℓ 1 Minimization.” Journal of Fourier Analysis and Applications.
Cavanaugh. 1997. Unifying the Derivations for the Akaike and Corrected Akaike Information Criteria.” Statistics & Probability Letters.
Cavanaugh, and Shumway. 1998. An Akaike Information Criterion for Model Selection in the Presence of Incomplete Data.” Journal of Statistical Planning and Inference.
Chernozhukov, Hansen, and Spindler. 2015. Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach.” Annual Review of Economics.
Chung. 2020. Introduction to Random Fields.”
Claeskens, Krivobokova, and Opsomer. 2009. Asymptotic Properties of Penalized Spline Estimators.” Biometrika.
Clevenson, and Zidek. 1975. Simultaneous Estimation of the Means of Independent Poisson Laws.” Journal of the American Statistical Association.
Collings, and Margolin. 1985. Testing Goodness of Fit for the Poisson Assumption When Observations Are Not Identically Distributed.” Journal of the American Statistical Association.
Cox, D. R., and Battey. 2017. Large Numbers of Explanatory Variables, a Semi-Descriptive Analysis.” Proceedings of the National Academy of Sciences.
Cox, Christopher R., and Rogers. 2021. Finding Distributed Needles in Neural Haystacks.” Journal of Neuroscience.
Cule, Vineis, and De Iorio. 2011. Significance Testing in Ridge Regression for Genetic Data.” BMC Bioinformatics.
Dai, and Barber. 2016. The Knockoff Filter for FDR Control in Group-Sparse and Multitask Regression.” arXiv Preprint arXiv:1602.03589.
DasGupta. 2008. Asymptotic Theory of Statistics and Probability. Springer Texts in Statistics.
Delaigle, Hall, and Meister. 2008. On Deconvolution with Repeated Measurements.” The Annals of Statistics.
Dezeure, Bühlmann, Meier, et al. 2014. High-Dimensional Inference: Confidence Intervals, p-Values and R-Software Hdi.” arXiv:1408.4026 [Stat].
Donoho, and Johnstone. 1995. Adapting to Unknown Smoothness via Wavelet Shrinkage.” Journal of the American Statistical Association.
Donoho, Johnstone, Kerkyacharian, et al. 1995. Wavelet Shrinkage: Asymptopia? Journal of the Royal Statistical Society. Series B (Methodological).
Dwork, Feldman, Hardt, et al. 2015a. Preserving Statistical Validity in Adaptive Data Analysis.” In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing - STOC ’15.
Dwork, Feldman, Hardt, et al. 2015b. The Reusable Holdout: Preserving Validity in Adaptive Data Analysis.” Science.
Efird, and Nielsen. 2008. A Method to Compute Multiplicity Corrected Confidence Intervals for Odds Ratios and Other Relative Effect Estimates.” International Journal of Environmental Research and Public Health.
Efron, B. 1979. Bootstrap Methods: Another Look at the Jackknife.” The Annals of Statistics.
Efron, Bradley. 1986. How Biased Is the Apparent Error Rate of a Prediction Rule? Journal of the American Statistical Association.
———. 2004a. Selection and Estimation for Large-Scale Simultaneous Inference.”
———. 2004b. The Estimation of Prediction Error.” Journal of the American Statistical Association.
———. 2007. Doing Thousands of Hypothesis Tests at the Same Time.” Metron - International Journal of Statistics.
———. 2008. Simultaneous Inference: When Should Hypothesis Testing Problems Be Combined? The Annals of Applied Statistics.
———. 2009. Empirical Bayes Estimates for Large-Scale Prediction Problems.” Journal of the American Statistical Association.
———. 2010a. The Future of Indirect Evidence.” Statistical Science.
———. 2010b. Correlated z-Values and the Accuracy of Large-Scale Statistical Estimates.” Journal of the American Statistical Association.
———. 2013. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction.
Evans, and Didelez. n.d. Recovering from Selection Bias Using Marginal Structure in Discrete Models.”
Ewald, and Schneider. 2015. Confidence Sets Based on the Lasso Estimator.” arXiv:1507.05315 [Math, Stat].
Fan, and Li. 2001. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties.” Journal of the American Statistical Association.
Fan, and Lv. 2010. A Selective Overview of Variable Selection in High Dimensional Feature Space.” Statistica Sinica.
Franz, and von Luxburg. 2014. Unconscious Lie Detection as an Example of a Widespread Fallacy in the Neurosciences.” arXiv:1407.4240 [q-Bio, Stat].
Friedman, Hastie, and Tibshirani. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software.
Garreau, Lajugie, Arlot, et al. 2014. Metric Learning for Temporal Sequence Alignment.” In Advances in Neural Information Processing Systems 27.
Gelman, and Loken. 2014. The Statistical Crisis in Science.” American Scientist.
Genovese, and Wasserman. 2008. Adaptive Confidence Bands.” The Annals of Statistics.
Gonçalves, and White. 2004. Maximum Likelihood and the Bootstrap for Nonlinear Dynamic Models.” Journal of Econometrics.
Hardt, and Ullman. 2014. Preventing False Discovery in Interactive Data Analysis Is Hard.” In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. FOCS ’14.
Hesterberg, Choi, Meier, et al. 2008. Least Angle and ℓ1 Penalized Regression: A Review.” Statistics Surveys.
Hjort, Nils Lid. 1992. On Inference in Parametric Survival Data Models.” International Statistical Review / Revue Internationale de Statistique.
Hjort, N. L., and Jones. 1996. Locally Parametric Nonparametric Density Estimation.” The Annals of Statistics.
Hjort, Nils Lid, West, and Leurgans. 1992. Semiparametric Estimation Of Parametric Hazard Rates.” In Survival Analysis: State of the Art. Nato Science 211.
Hurvich, and Tsai. 1989. Regression and Time Series Model Selection in Small Samples.” Biometrika.
Ichimura. 1993. Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single-Index Models.” Journal of Econometrics.
Ioannidis. 2005. Why Most Published Research Findings Are False. PLoS Medicine.
Iyengar, and Greenhouse. 1988. Selection Models and the File Drawer Problem.” Statistical Science.
Jamieson, and Jain. n.d. “A Bandit Approach to Multiple Testing with False Discovery Control.”
Janková, and van de Geer. 2015. Honest Confidence Regions and Optimality in High-Dimensional Precision Matrix Estimation.” arXiv:1507.02061 [Math, Stat].
Janson, Fithian, and Hastie. 2015. Effective Degrees of Freedom: A Flawed Metaphor.” Biometrika.
Kaufman, and Rosset. 2014. When Does More Regularization Imply Fewer Degrees of Freedom? Sufficient Conditions and Counterexamples.” Biometrika.
Konishi, and Kitagawa. 1996. Generalised Information Criteria in Model Selection.” Biometrika.
Korattikara, Chen, and Welling. 2015. Sequential Tests for Large-Scale Learning.” Neural Computation.
Korthauer, Kimes, Duvallet, et al. 2019. A Practical Guide to Methods Controlling False Discoveries in Computational Biology.” Genome Biology.
Künsch. 1986. “Discrimination Between Monotonic Trends and Long-Range Dependence.” Journal of Applied Probability.
Lancichinetti, Sirer, Wang, et al. 2015. High-Reproducibility and High-Accuracy Method for Automated Topic Classification.” Physical Review X.
Lavergne, Maistre, and Patilea. 2015. A Significance Test for Covariates in Nonparametric Regression.” Electronic Journal of Statistics.
Lazzeroni, and Ray. 2012. The Cost of Large Numbers of Hypothesis Tests on Power, Effect Size and Sample Size.” Molecular Psychiatry.
Lee, Sun, Sun, et al. 2013. Exact Post-Selection Inference, with Application to the Lasso.” arXiv:1311.6238 [Math, Stat].
Li, and Liang. 2008. Variable Selection in Semiparametric Regression Modeling.” The Annals of Statistics.
Lockhart, Taylor, Tibshirani, et al. 2014. A Significance Test for the Lasso.” The Annals of Statistics.
Meinshausen. 2006. False Discovery Control for Multiple Tests of Association Under General Dependence.” Scandinavian Journal of Statistics.
———. 2007. Relaxed Lasso.” Computational Statistics & Data Analysis.
———. 2014. Group Bound: Confidence Intervals for Groups of Variables in Sparse High Dimensional Regression Without Assumptions on the Design.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Meinshausen, and Bühlmann. 2005. Lower Bounds for the Number of False Null Hypotheses for Multiple Testing of Associations Under General Dependence Structures.” Biometrika.
———. 2006. High-Dimensional Graphs and Variable Selection with the Lasso.” The Annals of Statistics.
———. 2010. Stability Selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Meinshausen, Meier, and Bühlmann. 2009. P-Values for High-Dimensional Regression.” Journal of the American Statistical Association.
Meinshausen, and Rice. 2006. Estimating the Proportion of False Null Hypotheses Among a Large Number of Independently Tested Hypotheses.” The Annals of Statistics.
Meinshausen, and Yu. 2009. Lasso-Type Recovery of Sparse Representations for High-Dimensional Data.” The Annals of Statistics.
Müller, and Behnke. 2014. Pystruct - Learning Structured Prediction in Python.” Journal of Machine Learning Research.
Nickl, and van de Geer. 2013. Confidence Sets in Sparse Regression.” The Annals of Statistics.
Noble. 2009. How Does Multiple Testing Correction Work? Nature Biotechnology.
Ramsey, Glymour, Sanchez-Romero, et al. 2017. A Million Variables and More: The Fast Greedy Equivalence Search Algorithm for Learning High-Dimensional Graphical Causal Models, with an Application to Functional Magnetic Resonance Images.” International Journal of Data Science and Analytics.
Rosset, and Zhu. 2007. Piecewise Linear Regularized Solution Paths.” The Annals of Statistics.
Rothman. 1990. No adjustments are needed for multiple comparisons.” Epidemiology (Cambridge, Mass.).
Rzhetsky, Foster, Foster, et al. 2015. Choosing Experiments to Accelerate Collective Discovery.” Proceedings of the National Academy of Sciences.
Siegmund, and Li. 2014. Higher Criticism: P-Values and Criticism.” arXiv:1411.1437 [Math, Stat].
Stone. 1977. An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.” Journal of the Royal Statistical Society. Series B (Methodological).
Storey. 2002. A Direct Approach to False Discovery Rates.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Su, Bogdan, and Candès. 2015. False Discoveries Occur Early on the Lasso Path.” arXiv:1511.01957 [Cs, Math, Stat].
Taddy. 2013. One-Step Estimator Paths for Concave Regularization.” arXiv:1308.5623 [Stat].
Tansey, Koyejo, Poldrack, et al. 2014. False Discovery Rate Smoothing.” arXiv:1411.6144 [Stat].
Tansey, Padilla, Suggala, et al. 2015. Vector-Space Markov Random Fields via Exponential Families.” In Journal of Machine Learning Research.
Taylor, Lockhart, Tibshirani, et al. 2014. Exact Post-Selection Inference for Forward Stepwise and Least Angle Regression.” arXiv:1401.3889 [Stat].
Tibshirani. 2014. A General Framework for Fast Stagewise Algorithms.” arXiv:1408.5801 [Stat].
Tibshirani, Rinaldo, Tibshirani, et al. 2015. Uniform Asymptotic Inference and the Bootstrap After Model Selection.” arXiv:1506.06266 [Math, Stat].
van de Geer, Bühlmann, Ritov, et al. 2014. On Asymptotically Optimal Confidence Regions and Tests for High-Dimensional Models.” The Annals of Statistics.
van de Geer, and Lederer. 2011. The Lasso, Correlated Design, and Improved Oracle Inequalities.” arXiv:1107.0189 [Stat].
Wasserman, and Roeder. 2009. High-Dimensional Variable Selection.” Annals of Statistics.
Zhang, and Zhang. 2014. Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
Zou, Hastie, and Tibshirani. 2007. On the ‘Degrees of Freedom’ of the Lasso.” The Annals of Statistics.