# Model complexity penalties

## Information criteria, degrees of freedom etc

Understanding the “Degrees of freedom” of a model. Estimating that trace penalty matrix. As seen in robust estimation and (?) AIC/BIC.

🏗 Explain AIC, $$C_p$$, SURE, and BIC-type degrees of freedom, and whatever variants there are out there.

Complexity penalties crop up in model selection. (i.e. choosing the complexity of model appropriate to your data.) Efron (2004) is an excellent introduction, compressing 30 years of theory into 2 pages. Massart (2000) seems more fashionable in flavour:

The reader who is not familiar with model selection via [complexity] penalization can legitimately ask the question: where does the idea of penalization come from? It is possible to answer this question at two different levels:

• at some intuitive level by presenting the heuristics of one of the first criterion of this kind which has been introduced by Akaike (1973);

• at some technical level by explaining why such a strategy of model selection has some chances to succeed.

are an example of the kind of argumentation I need to use to use linear model approximation for general application of DOF in sparse model selection.

Degrees of freedom is a familiar phrase for many statisticians. In linear regression the degrees of freedom is the number of estimated predictors. Degrees of freedom is often used to quantify the model complexity of a statistical modeling procedure . However, generally speaking, there is no exact correspondence between the degrees of freedom and the number of parameters in the model (Ye 1998). […] Stein’s unbiased risk estimation (SURE) theory gives a rigorous definition of the degrees of freedom for any fitting procedure. […] Efron showed that $$C_p$$ is an unbiased estimator of the true prediction error, and in some settings it offers substantially better accuracy than cross-validation and related nonparametric methods. Thus degrees of freedom plays an important role in model assessment and selection. Donoho and Johnstone used the SURE theory to derive the degrees of freedom of soft thresholding and showed that it leads to an adaptive wavelet shrinkage procedure called SureShrink. (Ye 1998) and showed that the degrees of freedom can capture the inherent uncertainty in modeling and frequentist model selection. Shen and Ye and further proved that the degrees of freedom provides an adaptive model selection criterion that performs better than the fixed-penalty model selection criteria.

## Information Criteria

Akaike and friends. With M-estimation, (e.g. maximum likelihood estimation and robust estimation) these are marvelous and general shortcuts to do model selection. (i.e. choosing the complexity of model appropriate to your data) without resorting to computionally expensive cross validation.

For all of these, a thing called the number of effective degrees of freedom is important. There are several different definitions for that, and they only sometimes coincide, so I leave that for a different notebook. Claeskens and Hjort (2008) and Konishi and Kitagawa (2008) are probably canonical.

Information criteria can ideally do the same thing cross-validation (i.e. select ideal regularisation given possible models and data) a small fraction of the computational cost. Indeed, they are asymptotically the same - see below.

To learn:

• How this interacts with robust estimators

• How to use AIC with nonparametric or high dimensional methods (GIC)

• How it relates to minimum description length (e.g. Andrew R. Barron et al. (2008))

Influential current English-language texts in this area are Burnham and Anderson (2002), Claeskens and Hjort (2008) and Konishi and Kitagawa (2008). The first of these is highly cited and brought the AIC method into the mainstream in the West from where it had been on the specalised fringes. The latter two focus on extensions such as TIC and GIC.

🏗 general description.

🏗 clarify relationship to Minimum Description Length, Rissanen-style.

Bondell, Krishna, and Ghosh (2010):

In the literature, selection criteria are usually classified into two categories: consistent (e.g., the Bayesian information criterion BIC, Schwarz, 1978) and efficient (e.g., the Akaike information criterion AIC, Akaike, 1974; the generalized cross-validation GCV, Craven and Wahba, 1979). A consistent criterion identifies the true model with a probability that approaches 1 in large samples when a set of candidate models contains the true model. An efficient criterion selects the model so that its average squared error is asymptotically equivalent to the minimum offered by the candidate models when the true model is approximated by a family of candidate models. Detailed discussions on efficiency and consistency can be found in Shibata (1981, 1984), Li (1987), Shao (1997) and McQuarrie and Tsai (1998).

The classic.🏗

### Takeuchi Information Criterion (TIC)

Apparently this one was influential in Japan, but untranslated into English, so only belately common in the west. Good explanations are in Claeskens and Hjort (2008) and Konishi and Kitagawa (2008). Relaxes the assumption that the model is Fisher efficient (i.e. that the true generating process is included in your model, and with enough data you’d discover that.)

### Konishi and Kitegawa’s Generalised Information Criterion (GIC)

Taking information criteria to general (e.g. robust, penalised) M-estimation instead of purely ML estimation; also relaxing the assumption that we even have the “true” model in our class. (Konishi and Kitagawa (1996)); C&C Burman and Nolan (1995), probably others. In paricular, you are no longer trying to fit the midel by minimising least-squares errors, for example. Claeskens and Hjort (2008) mention the “Robustified Information Criterion” in passing, which may relate?

🏗 Explain my laborious reasoning that generalised Akaike information criteria for penalised regression don’t seem work when the penalty term is not differentiable, (cross validation works fine though, abd possibly also BIC) and the issues that therefore arise in model selection for such models in the sparse case.

### Focussed information criterion (FIC)

Claeskens and Hjort define this (Claeskens and Hjort (2008), chapter 6):

The model selection methods presented earlier (such as AIC and the BIC) have one thing in common: they select one single ‘best model’, which should then be used to explain all aspects of the mechanisms underlying the data and predict all future data points. The tolerance discussion in chapter 5 showed that sometimes one model is best for estimating one type of estimand, whereas another model is best for another estimand. The point of view expressed via the [FIC] is that a ‘best model’ should depend on the parameter under focus, such as the mean, or the variance, or the particular covariate values etc Thus the FIC allows and encourages different models to be selected for different parameters of interest.

This sounds very logical; of course, then one must do more work to make it go.

### Network information criterion

Murata, Yoshizawa, and Amari (1994): “an estimator of the expected loss of a loss function $$\ell(\theta)+\lambda H(\theta)$$ where $$H(\theta)$$ is a regularisation term”.

### Regularization information criterion

Shibata (1989) - is this distinct from GIC?

### Bootstrap information criterion

A compromise between the computational cheapness of information criteria and the practical simplicity of cross-validation.

Konishi and Kitagawa (2008) ch 8. See Claeskens and Hjort (2008) 6.3 for a bootstrap-FIC.

### Consistency of model order selected - AIC

Akaike Information criteria are not asymptotically consistent (see Konishi and Kitagawa (2008).) in the sense that if there is a true model, you do not asymptotically select in the large sample limit with P=1. However, the distribution of model orders does not get worse as n increases. Burnham and Anderson (2002) 6.3 and Konishi and Kitagawa (2008) 3.5.2 discuss this; In a sense it would be surprising if it did do especially well in selecting model order; since our criteria is rather designed to minimise prediction error, not model selection error. Model order is more or less a nuisance parameter in this framework.

TBC.

### Cross-validation equivalence

Konishi and Kitagawa (2008), 10.1.4 discuss the asymptotic equivalence of AIC/TIC/GIC and cross validation under various circumstances, attributing the equivalence results to Stone (1977) and Shibata (1989). Claeskens and Hjort (2008) proves a similar result.

### Automatic GIC

🏗; I know that Konishi and Kitagawa (1996) give formulae for loss functions for ANY M-estimation and penalisation procedure, but in general the degrees of freedom matrix trace calculation is nasty, and only in-principle estimable from the data, requiring a matrix product of the hessian at every data point. This is not necessarily computationally tractable - I know of formulae only for GLMs and robust regression with $$\ell_2$$ penalties. Can we get such penalties for more general ML fits?

### GIC & the LASSO

I thought this didn’t work because we needed the second derivative of the penalty; but see Bondell, Krishna, and Ghosh (2010).

### Information criteria at scale

Big-data information criteria. AIC is already computationally cheaper than cross validation. What about when my data is so large that I would like to select my mode before looking at all of it with such-and-such a guarantee of goodness? Can I do AIC at scale? if I am fitting a model using SGD, can I estimate my model order using partial data? How? I’m interested in doing this in a way that preserves the property of being computationally cheaper than cross-validating.

Here’s an example… Bondell, Krishna, and Ghosh (2010):

In order to avoid complete enumeration of all possible $$2^{p+q}$$ models, Wolfinger (1993) and Diggle, Liang and Zeger (1994) recommended the Restricted Information Criterion (denoted by REML.IC), in that, by using the most complex mean structure, selection is first performed on the variance-covariance structure by computing the AIC and/or BIC. Given the best covariance structure, selection is then performed on the fixed effects. Alternatively, Pu and Niu (2006) proposed the EGIC (Extended GIC), where using the BIC, selection is first performed on the fixed effects by including all of the random effects into the model. Once the fixed effect structure is chosen, selection is then performed on the random effects.

In general I’d like to avoid enumerating the models as much as possible and simply select relevant predictors with high probability, compressive-sensing style.

## Consistent: Bayesian Information Criteria

a.k.a. Schwarz Information Criterion. Also co-invented by the unstoppable Akaike.

This is a different family to the original AIC. This has a justification in terms of MDL and of Bayes risk? Different regularity conditions, something something…

How would this work with regularisation? Apparently Machado (1993) extends the setup to robust inference, much as the GIC extends the AIC. Claeskens and Hjort (2008) give an easy summary and more general settings.

## Consistent and/or efficient: Nishii’s Generalised Information Criterion

Nishii (1984), commended by Zhang, Li, and Tsai (2010) as a unifying formalism for these efficient/consistent others, includes efficient and consistent-type information penalties as special cases I don’t know much about this.

See QIC.

## References

Akaike, Hirotogu. 1973. In Proceeding of the Second International Symposium on Information Theory, edited by Petrovand F Caski, 199–213. Budapest: Akademiai Kiado.
Akaike, Hirotugu. 1978. Biometrika 65 (1): 53–59.
———. 1981. Journal of Econometrics 16 (1): 3–14.
Akaike, Htrotugu. 1973. Biometrika 60 (2): 255–65.
Ando, Tomohiro, Sadanori Konishi, and Seiya Imoto. 2008. Journal of Statistical Planning and Inference, Special Issue in Honor of Junjiro Ogawa (1915 - 2000): Design of Experiments, Multivariate Analysis and Statistical Inference, 138 (11): 3616–33.
Barron, Andrew R. 1986. The Annals of Probability 14 (1): -336-342.
Barron, Andrew R., Cong Huang, Jonathan Q. Li, and Xi Luo. 2008. In Information Theory Workshop, 2008. ITW’08. IEEE, 247–57. IEEE.
Barron, A., J. Rissanen, and Bin Yu. 1998. IEEE Transactions on Information Theory 44 (6): 2743–60.
Bashtannyk, David M., and Rob J. Hyndman. 2001. Computational Statistics & Data Analysis 36 (3): 279–98.
Bickel, Peter J., Bo Li, Alexandre B. Tsybakov, Sara A. van de Geer, Bin Yu, Teófilo Valdés, Carlos Rivero, Jianqing Fan, and Aad van der Vaart. 2006. Test 15 (2): 271–344.
Birgé, Lucien, and Pascal Massart. 2006. Probability Theory and Related Fields 138 (1-2): 33–73.
Bondell, Howard D., Arun Krishna, and Sujit K. Ghosh. 2010. Biometrics 66 (4): 1069–77.
Buckland, S. T., K. P. Burnham, and N. H. Augustin. 1997. Biometrics 53 (2): 603–18.
Bunea, Florentina. 2004. The Annals of Statistics 32 (3): 898–927.
Burman, P., and D. Nolan. 1995. Biometrika 82 (4): 877–86.
Burnham, Kenneth P., and David R. Anderson. 2004. Sociological Methods & Research 33 (2): 261–304.
Burnham, Kenneth P., and David Raymond Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. 2nd ed. New York: Springer.
Cavanaugh, Joseph E. 1997. Statistics & Probability Letters 33 (2): 201–8.
Cavanaugh, Joseph E., and Robert H. Shumway. 1998. Journal of Statistical Planning and Inference 67 (1): 45–65.
Chen, Jiahua, and Zehua Chen. 2008. Biometrika 95 (3): 759–71.
Chichignoud, Michaël, Johannes Lederer, and Martin Wainwright. 2014. arXiv:1410.0247 [Math, Stat], October.
Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge ; New York: Cambridge University Press.
Claeskens, Gerda, Tatyana Krivobokova, and Jean D. Opsomer. 2009. Biometrika 96 (3): 529–44.
Donoho, David L., and Iain M. Johnstone. 1995. Journal of the American Statistical Association 90 (432): 1200–1224.
Dossal, Charles, Maher Kachour, Jalal M. Fadili, Gabriel Peyré, and Christophe Chesneau. 2011. arXiv:1111.1162 [Cs, Math, Stat], November.
Efron, Bradley. 1986. Journal of the American Statistical Association 81 (394): 461–70.
———. 2004. Journal of the American Statistical Association 99 (467): 619–32.
———. 2021. Stats 4 (4): 1091–1115.
Fan, Jianqing, and Runze Li. 2001. Journal of the American Statistical Association 96 (456): 1348–60.
Firth, David. 1993. Biometrika 80 (1): 27–38.
Hastie, Trevor J., and Robert J. Tibshirani. 1990. Generalized Additive Models. Vol. 43. CRC Press.
Hu, Feifang, and James V. Zidek. 2002. The Canadian Journal of Statistics / La Revue Canadienne de Statistique 30 (3): 347–71.
Huang, Cong, G. L. H. Cheang, and Andrew R. Barron. 2008.
Huang, Jian, Shuange Ma, Huiliang Xie, and Cun-Hui Zhang. 2009. Biometrika 96 (2): 339–55.
Hurvich, Clifford M., Jeffrey S. Simonoff, and Chih-Ling Tsai. 1998. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 60 (2): 271–93.
Hurvich, Clifford M., and Chih-Ling Tsai. 1989. Biometrika 76 (2): 297–307.
Imoto, Seiya, and Sadanori Konishi. 1999.
Janson, Lucas, William Fithian, and Trevor J. Hastie. 2015. Biometrika 102 (2): 479–85.
Kato, Kengo. 2009. Journal of Multivariate Analysis 100 (7): 1338–52.
Kaufman, S., and S. Rosset. 2014. Biometrika 101 (4): 771–84.
Konishi, Sadanori, and G. Kitagawa. 2008. Information Criteria and Statistical Modeling. Springer Series in Statistics. New York: Springer.
Konishi, Sadanori, and Genshiro Kitagawa. 1996. Biometrika 83 (4): 875–90.
———. 2003. Journal of Statistical Planning and Inference, C.R. Rao 80th Birthday Felicitation Volume, Part IV, 114 (1–2): 45–61.
Kosmidis, Ioannis. 2014. WIREs Computational Statistics 6 (3): 185–96.
Kosmidis, Ioannis, and Nicola Lunardon. 2020. arXiv:2001.03786 [Math, Stat], January.
Le, Tri, and Bertrand Clarke. 2017. Bayesian Analysis 12 (3): 807–29.
Leung, G., and A.R. Barron. 2006. IEEE Transactions on Information Theory 52 (8): 3396–3410.
Li, Jonathan Q., and Andrew R. Barron. 2000. In Advances in Neural Information Processing Systems 12, edited by S. A. Solla, T. K. Leen, and K. Müller, 279–85. MIT Press.
Li, Ker-Chau. 1987. The Annals of Statistics 15 (3): 958–75.
Li, Runze, and Hua Liang. 2008. The Annals of Statistics 36 (1): 261–86.
Lim, Néhémy, and Johannes Lederer. 2016. arXiv:1609.07195 [Stat], September.
Machado, José A.F. 1993. Econometric Theory 9 (03): 478–93.
Massart, Pascal. 2000. In Annales de La Faculté Des Sciences de Toulouse: Mathématiques, 9:245–303.
———. 2007. Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003. Lecture Notes in Mathematics 1896. Berlin ; New York: Springer-Verlag.
Murata, N., S. Yoshizawa, and S. Amari. 1994. IEEE Transactions on Neural Networks 5 (6): 865–72.
Nishii, Ryuei. 1984. The Annals of Statistics 12 (2): 758–65.
Qian, Guoqi, and R. K. Hans. 1996.
Qian, Guoqi, and Hans R. Künsch. 1998. Journal of Statistical Planning and Inference 75 (1): 91–116.
Rao, C. R., and Y. Wu. 2001. In Institute of Mathematical Statistics Lecture Notes - Monograph Series, 38:1–57. Beachwood, OH: Institute of Mathematical Statistics.
Rao, Radhakrishna, and Yuehua Wu. 1989. Biometrika 76 (2): 369–74.
Rissanen, J. 1978. Automatica 14 (5): 465–71.
Saefken, Benjamin, Thomas Kneib, Clara-Sophie van Waveren, and Sonja Greven. 2014. Electronic Journal of Statistics 8 (1): 201–25.
Schwarz, Gideon. 1978. The Annals of Statistics 6 (2): 461–64.
Shen, Xiaotong, and Hsin-Cheng Huang. 2006. Journal of the American Statistical Association 101 (474): 554–68.
Shen, Xiaotong, Hsin-Cheng Huang, and Jimmy Ye. 2004. Technometrics 46 (3): 306–17.
Shen, Xiaotong, and Jianming Ye. 2002. Journal of the American Statistical Association 97 (457): 210–21.
Shibata, Ritei. 1989. In From Data to Model, edited by Professor Jan C. Willems, 215–40. Springer Berlin Heidelberg.
Stein, Charles M. 1981. The Annals of Statistics 9 (6): 1135–51.
Stone, M. 1977. Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 44–47.
———. 1979. Journal of the Royal Statistical Society. Series B (Methodological) 41 (2): 276–78.
Sugiura, Nariaki. 1978. Communications in Statistics - Theory and Methods 7 (1): 13–26.
Taddy, Matt. 2013. arXiv:1308.5623 [Stat], August.
Tharmaratnam, Kukatharmini, and Gerda Claeskens. 2013. Statistics 47 (1): 216–35.
Tibshirani, Robert. 1996. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1): 267–88.
Tibshirani, Ryan J. 2015. Statistica Sinica 25 (3): 1265–96.
Ye, Jianming. 1998. Journal of the American Statistical Association 93 (441): 120–31.
Yuan, Ming, and Yi Lin. 2006. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1): 49–67.
Zhang, Yiyun, Runze Li, and Chih-Ling Tsai. 2010. Journal of the American Statistical Association 105 (489): 312–23.
Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2007. The Annals of Statistics 35 (5): 2173–92.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.