Model/hyperparameter selection

2016-04-15 — 2017-08-20

Wherein the choice among models and hyperparameters is treated as selection of regularisation strength for prediction, and cross-validation, information criteria, and bandit search are described.

information

model selection

statistics

Choosing which of an ensemble of models to use, or, which is the same thing more or less, the number of predictors, or the regularisation. This is a kind of complement to statistical learning theory where you hope to quantify how complicated a model you should bother fitting to a given amount of data.

If your predictors are discrete and small in number, you can do this in the traditional fashion, by stepwise model selection, and you might discuss the degrees of freedom of the model and the data. If you are in the luxurious position of having a small tractable number of parameters and the ability to perform controlled trials, then you do ANOVA.

When you have penalisation parameters, we sometimes phrase this as regularisation and talk about regularisation parameter selection, or hyperparameter selection, which we can do in various ways. Methods for this include degrees-of-freedom penalties, cross-validation etc. However, I’m not yet sure how to make that work in sparse regression.

Multiple testing is model selection writ large, where you can consider many hypothesis tests, possibly effectively infinitely many hypothesis tests, or you have a combinatorial explosion of possible predictors to include.

🏗 document connection with graphical models and thus conditional independence tests.

1 Bayesian

Bayesian model selection is also a thing, although the framing is a little different. In the classic Bayesian method I keep all my models, although some might become very unlikely. But apparently I can also throw some out entirely? Presumably for reasons of computational tractability or what-have-you.

2 Consistency

If the model order itself is the parameter of interest, how do you do consistent inference on that? AIC, for example, is derived for optimising prediction loss, not model selection. (Doesn’t BIC do better?)

An exhausting, exhaustive review of various model selection procedures with an eye to consistency, is given in C. R. Rao and Wu (2001).

3 Cross validation

See cross validation.

4 For mixture models

See mixture models.

5 Under sparsity

See sparse model selection.

6 Hyperparameter search

How do you choose your hyperparameters? NB hyperparameters might not always be about model selection per se; there are also ones that are about, e.g. convergence rate. Anyway. Also one could well regard hyperparameters as normal parameters.

Turns out you can cast this as a bandit problem, or a sequential Bayesian optimisation problem.

7 For time series

See model selection in time series.

8 References

Aghasi, Nguyen, and Romberg. 2016. “Net-Trim: A Layer-Wise Convex Pruning of Deep Neural Networks.” arXiv:1611.05162 [Cs, Stat].

Alquier, and Wintenberger. 2012. “Model Selection for Weakly Dependent Time Series Forecasting.” Bernoulli.

Andersen, Borgan, Gill, et al. 1997. Statistical models based on counting processes. Springer series in statistics.

Andrews. 1991. “Asymptotic Optimality of Generalized CL, Cross-Validation, and Generalized Cross-Validation in Regression with Heteroskedastic Errors.” Journal of Econometrics.

Ansley, and Kohn. 1985. “Estimation, Filtering, and Smoothing in State Space Models with Incompletely Specified Initial Conditions.” The Annals of Statistics.

Barber, and Candès. 2015. “Controlling the False Discovery Rate via Knockoffs.” The Annals of Statistics.

Benjamini, and Gavrilov. 2009. “A Simple Forward Selection Procedure Based on False Discovery Rate Control.” The Annals of Applied Statistics.

Bickel, Li, Tsybakov, et al. 2006. “Regularization in Statistics.” Test.

Birgé. 2008. “Model Selection for Density Estimation with L2-Loss.” arXiv:0808.1416 [Math, Stat].

Birgé, and Massart. 2006. “Minimal Penalties for Gaussian Model Selection.” Probability Theory and Related Fields.

Bloniarz, Liu, Zhang, et al. 2015. “Lasso Adjustments of Treatment Effect Estimates in Randomized Experiments.” arXiv:1507.03652 [Math, Stat].

Broersen. 2006. Automatic Autocorrelation and Spectral Analysis.

Bühlmann, and Künsch. 1999. “Block Length Selection in the Bootstrap for Time Series.” Computational Statistics & Data Analysis.

Burman, and Nolan. 1995. “A General Akaike-Type Criterion for Model Selection in Robust Regression.” Biometrika.

Burnham, and Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach.

Cai, and Sun. 2017. “Large-Scale Global and Simultaneous Inference: Estimation and Testing in Very High Dimensions.” Annual Review of Economics.

Candès, Fan, Janson, et al. 2016. “Panning for Gold: Model-Free Knockoffs for High-Dimensional Controlled Variable Selection.” arXiv Preprint arXiv:1610.02351.

Candès, Wakin, and Boyd. 2008. “Enhancing Sparsity by Reweighted ℓ 1 Minimization.” Journal of Fourier Analysis and Applications.

Cawley, and Talbot. 2010. “On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.” Journal of Machine Learning Research.

Chan, Lu, and Yau. 2016. “Factor Modelling for High-Dimensional Time Series: Inference and Model Selection.” Journal of Time Series Analysis.

Chernozhukov, Chetverikov, Demirer, et al. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal.

Chernozhukov, Hansen, Liao, et al. 2018. “Inference For Heterogeneous Effects Using Low-Rank Estimations.” arXiv:1812.08089 [Math, Stat].

Chernozhukov, Newey, and Singh. 2018. “Learning L2 Continuous Regression Functionals via Regularized Riesz Representers.” arXiv:1809.05224 [Econ, Math, Stat].

Claeskens, and Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics.

Cox, and Battey. 2017. “Large Numbers of Explanatory Variables, a Semi-Descriptive Analysis.” Proceedings of the National Academy of Sciences.

Dai, and Barber. 2016. “The Knockoff Filter for FDR Control in Group-Sparse and Multitask Regression.” arXiv Preprint arXiv:1602.03589.

Ding, Tarokh, and Yang. 2018. “Model Selection Techniques: An Overview.” IEEE Signal Processing Magazine.

Efron. 1986. “How Biased Is the Apparent Error Rate of a Prediction Rule?” Journal of the American Statistical Association.

Elhamifar, and Vidal. 2013. “Sparse Subspace Clustering: Algorithm, Theory, and Applications.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Fan, and Li. 2001. “Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties.” Journal of the American Statistical Association.

Fan, and Lv. 2008. “Sure Independence Screening for Ultrahigh Dimensional Feature Space.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).

Fisher, Rudin, and Dominici. 2019. “All Models Are Wrong, but Many Are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously.”

Geman, and Hwang. 1982. “Nonparametric Maximum Likelihood Estimation by the Method of Sieves.” The Annals of Statistics.

Guyon, and Elisseeff. 2003. “An Introduction to Variable and Feature Selection.” Journal of Machine Learning Research.

Hong, Mitchell, Chen, et al. 2008. “Model Selection Approaches for Non-Linear System Identification: A Review.” International Journal of Systems Science.

Ishwaran, and Rao. 2005. “Spike and Slab Variable Selection: Frequentist and Bayesian Strategies.” The Annals of Statistics.

Jamieson, and Talwalkar. 2015. “Non-Stochastic Best Arm Identification and Hyperparameter Optimization.” arXiv:1502.07943 [Cs, Stat].

Janson, Fithian, and Hastie. 2015. “Effective Degrees of Freedom: A Flawed Metaphor.” Biometrika.

Johnson, and Omland. 2004. “Model Selection in Ecology and Evolution.” Trends in Ecology & Evolution.

Kloft, Rückert, and Bartlett. 2010. “A Unifying View of Multiple Kernel Learning.” In Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science.

Konishi, and Kitagawa. 1996. “Generalised Information Criteria in Model Selection.” Biometrika.

Konishi, and Kitagawa. 2008. Information Criteria and Statistical Modeling. Springer Series in Statistics.

Li, Ker-Chau. 1987. “Asymptotic Optimality for \(C_p, C_L\), Cross-Validation and Generalized Cross-Validation: Discrete Index Set.” The Annals of Statistics.

Li, Lisha, Jamieson, DeSalvo, et al. 2016. “Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits.” arXiv:1603.06560 [Cs, Stat].

Lundberg, and Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems.

Machado. 1993. “Robust Model Selection and M-Estimation.” Econometric Theory.

Massart. 2007. Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003. Lecture Notes in Mathematics 1896.

Meinshausen, and Yu. 2009. “Lasso-Type Recovery of Sparse Representations for High-Dimensional Data.” The Annals of Statistics.

Navarro. 2019. “Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection.” Computational Brain & Behavior.

Paparoditis, and Sapatinas. 2014. “Bootstrap-Based Testing for Functional Data.” arXiv:1409.4317 [Math, Stat].

Qian. 1996. “On Model Selection in Robust Linear Regression.”

Qian, and Hans. 1996. “Some Notes on Rissanen’s Stochastic Complexity.”

Qian, and Künsch. 1998. “On Model Selection via Stochastic Complexity in Robust Linear Regression.” Journal of Statistical Planning and Inference.

Rao, Radhakrishna, and Wu. 1989. “A Strongly Consistent Procedure for Model Selection in a Regression Problem.” Biometrika.

Rao, C. R., and Wu. 2001. “On Model Selection.” In Institute of Mathematical Statistics Lecture Notes - Monograph Series.

Ročková, and George. 2018. “The Spike-and-Slab LASSO.” Journal of the American Statistical Association.

Ronchetti. 2000. “Robust Regression Methods and Model Selection.” In Data Segmentation and Model Selection for Computer Vision.

Royall. 1986. “Model Robust Confidence Intervals Using Maximum Likelihood Estimators.” International Statistical Review / Revue Internationale de Statistique.

Shao. 1996. “Bootstrap Model Selection.” Journal of the American Statistical Association.

Shen, and Huang. 2006. “Optimal Model Assessment, Selection, and Combination.” Journal of the American Statistical Association.

Shen, Huang, and Ye. 2004. “Adaptive Model Selection and Assessment for Exponential Family Distributions.” Technometrics.

Shen, and Ye. 2002. “Adaptive Model Selection.” Journal of the American Statistical Association.

Shibata. 1989. “Statistical Aspects of Model Selection.” In From Data to Model.

Stein. 1981. “Estimation of the Mean of a Multivariate Normal Distribution.” The Annals of Statistics.

Stone. 1977. “An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.” Journal of the Royal Statistical Society. Series B (Methodological).

Takeuchi. 1976. “Distribution of informational statistics and a criterion of model fitting.” Suri-Kagaku (Mathematical Sciences).

Taylor, Lockhart, Tibshirani, et al. 2014. “Exact Post-Selection Inference for Forward Stepwise and Least Angle Regression.” arXiv:1401.3889 [Stat].

Tharmaratnam, and Claeskens. 2013. “A Comparison of Robust Versions of the AIC Based on M-, S- and MM-Estimators.” Statistics.

Tibshirani, Rinaldo, Tibshirani, et al. 2015. “Uniform Asymptotic Inference and the Bootstrap After Model Selection.” arXiv:1506.06266 [Math, Stat].

Tibshirani, and Taylor. 2012. “Degrees of Freedom in Lasso Problems.” The Annals of Statistics.

Vansteelandt, Bekaert, and Claeskens. 2012. “On Model Selection and Model Misspecification in Causal Inference.” Statistical Methods in Medical Research.

Wahba. 1985. “A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem.” The Annals of Statistics.

Wasserman, and Roeder. 2009. “High-Dimensional Variable Selection.” Annals of Statistics.

Zhao, Rocha, and Yu. 2006. “Grouped and Hierarchical Model Selection Through Composite Absolute Penalties.”

———. 2009. “The Composite Absolute Penalties Family for Grouped and Hierarchical Variable Selection.” The Annals of Statistics.

Zhao, and Yu. 2006. “On Model Selection Consistency of Lasso.” Journal of Machine Learning Research.

Zou, and Li. 2008. “One-Step Sparse Estimates in Nonconcave Penalized Likelihood Models.” The Annals of Statistics.