# Penalised/regularised regression

June 23, 2016 — September 19, 2022

Bayes
functional analysis
linear algebra
model selection
optimization
probability
signal processing
sparser than thou
statistics

Regression estimation with penalties on the model parameters. I am especially interested when the penalties are sparsifying penalties, and I have more notes to sparse regression.

Here I consider general penalties: ridge etc. At least in principle — I have no active projects using penalties without sparsifying them at the moment.

Why might I use such penalties? One reason would be that $$L_2$$ penalties have simple forms for their information criteria, as shown by Konishi and Kitagawa .

See also matrix factorisations, optimisation, multiple testing, concentration inequalities, sparse flavoured icecream.

To discuss:

Ridge penalties, relationship with robust regression, statistical learning theory etc.

In nonparametric statistics we might estimate simultaneously what look like many, many parameters, which we constrain in some clever fashion, which usually boils down to something we can interpret as a “penalty” on the parameters.

“Penalization” has a genealogy unknown to me, but is probably the least abstruse for common, general usage.

The “regularisation” nomenclature claims descent from Tikhonov, (e.g. Tikhonov and Glasko (1965)) who wanted to solve ill-conditioned integral and differential equations, which is slightly more general.

In statistics, the term “shrinkage” is used for very nearly the same thing.

“Smoothing” seems to be common in the spline and kernel estimate communities of et al, who usually actually want to smooth curves. When we say “smoothing” you usually mean that you can express your predictions as a “linear smoother”/hat matrix, which has certain nice properties in generalised cross validation.

“Smoothing” is not a great general term, since penalisation does not necessarily cause “smoothness” from any particular perspective — for example, some penalties cause the coefficients to become sparse and therefore, from the perspective of coefficients, it promotes non-smooth vectors. Often the thing that becomes smooth is not obvious.

Regardless, what these problems share in common is that we wish to solve an ill-conditioned inverse problem, so we tame it by adding a penalty to solutions we feel one should be reluctant to accept.

🏗 specifics

## 1 Connection to Bayesian priors

Famously, a penalty can have an interpretation as a Bayesian prior on the solution space. It is a fun exercise for example to “rediscover” the lasso regression as a typical linear regression but with the plus prize for the coefficients. In that case the maximum a posteriori estimate given those bays rise and the lasso solution coincide. If you want to know the full posterior you have to do a lot more work. But the connection is suggestive nonetheless.

A related and useful connection the interpretation of covariance kernels as prize producing smoothness in solutions. A very elegant introduction to these is given in Miller, Glennie, and Seaton (2020).

## 2 As shrinkage

TBD. James and Stein (1961). beautiful explainer video.

## 3 Adaptive regularization

What should we regularize to attain specific kinds of solutions?

Here’s one thing I saw recently:

Venkat Chandrasekaran, Learning Semidefinite Regularizers via Matrix Factorization

Abstract: Regularization techniques are widely employed in the solution of inverse problems in data analysis and scientific computing due to their effectiveness in addressing difficulties due to ill-posedness. In their most common manifestation, these methods take the form of penalty functions added to the objective in optimization-based approaches for solving inverse problems. The purpose of the penalty function is to induce a desired structure in the solution, and these functions are specified based on prior domain-specific expertise. We consider the problem of learning suitable regularization functions from data in settings in which prior domain knowledge is not directly available. Previous work under the title of ‘dictionary learning’ or ‘sparse coding’ may be viewed as learning a polyhedral regularizer from data. We describe generalizations of these methods to learn semidefinite regularizers by computing structured factorizations of data matrices. Our algorithmic approach for computing these factorizations combines recent techniques for rank minimization problems along with operator analogs of Sinkhorn scaling. The regularizers obtained using our framework can be employed effectively in semidefinite programming relaxations for solving inverse problems. (Joint work with Yong Sheng Soh)

## 4 References

Akaike, Hirotogu. 1973. In Proceeding of the Second International Symposium on Information Theory, edited by Petrovand F Caski, 199–213. Budapest: Akademiai Kiado.
Akaike, Htrotugu. 1973. Biometrika 60 (2): 255–65.
Azizyan, Martin, Akshay Krishnamurthy, and Aarti Singh. 2015. arXiv:1506.00898 [Cs, Math, Stat], June.
Bach, Francis. 2009. arXiv:0901.3202 [Cs, Stat].
Banerjee, Arindam, Sheng Chen, Farideh Fazayeli, and Vidyashankar Sivakumar. 2014. In Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 1556–64. Curran Associates, Inc.
Barron, Andrew R., Cong Huang, Jonathan Q. Li, and Xi Luo. 2008. In Information Theory Workshop, 2008. ITW’08. IEEE, 247–57. IEEE.
Battiti, Roberto. 1992. Neural Computation 4 (2): 141–66.
Bellec, Pierre C., and Alexandre B. Tsybakov. 2016. arXiv:1609.06675 [Math, Stat], September.
Bickel, Peter J., Bo Li, Alexandre B. Tsybakov, Sara A. van de Geer, Bin Yu, Teófilo Valdés, Carlos Rivero, Jianqing Fan, and Aad van der Vaart. 2006. Test 15 (2): 271–344.
Brown, Lawrence D., and Yi Lin. 2004. The Annals of Statistics 32 (4): 1723–43.
Bühlmann, Peter, and Sara van de Geer. 2011. In Statistics for High-Dimensional Data, 77–97. Springer Series in Statistics. Springer Berlin Heidelberg.
———. 2015. arXiv:1503.06426 [Stat] 9 (1): 1449–73.
Burman, P., and D. Nolan. 1995. Biometrika 82 (4): 877–86.
Candès, Emmanuel J., and Carlos Fernandez-Granda. 2013. Journal of Fourier Analysis and Applications 19 (6): 1229–54.
Candès, Emmanuel J., and Y. Plan. 2010. Proceedings of the IEEE 98 (6): 925–36.
Cavanaugh, Joseph E. 1997. Statistics & Probability Letters 33 (2): 201–8.
Chen, Yen-Chi, and Yu-Xiang Wang. n.d.
Chernozhukov, Victor, Christian Hansen, and Martin Spindler. 2015. Annual Review of Economics 7 (1): 649–88.
Efron, Bradley. 2004. Journal of the American Statistical Association 99 (467): 619–32.
Efron, Bradley, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. 2004. The Annals of Statistics 32 (2): 407–99.
Flynn, Cheryl J., Clifford M. Hurvich, and Jeffrey S. Simonoff. 2013. arXiv:1302.2068 [Stat], February.
Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2010. Journal of Statistical Software 33 (1): 1–22.
Fuglstad, Geir-Arne, Daniel Simpson, Finn Lindgren, and Håvard Rue. 2019. Journal of the American Statistical Association 114 (525): 445–52.
Geer, Sara van de. 2014a. Scandinavian Journal of Statistics 41 (1): 72–86.
———. 2014b. arXiv:1409.8557 [Math, Stat], September.
Giryes, Raja, Guillermo Sapiro, and Alex M. Bronstein. 2014. arXiv:1412.5896 [Cs, Math, Stat], December.
Golub, Gene H., Michael Heath, and Grace Wahba. 1979. Technometrics 21 (2): 215–23.
Golubev, Grigori K., and Michael Nussbaum. 1990. The Annals of Statistics 18 (2): 758–78.
Green, P. J. 1990. IEEE Transactions on Medical Imaging 9 (1): 84–93.
Green, Peter J. 1990. Journal of the Royal Statistical Society. Series B (Methodological) 52 (3): 443–52.
Gu, Chong. 1993. Journal of the American Statistical Association 88 (422): 495–504.
Gui, Jiang, and Hongzhe Li. 2005. Bioinformatics 21 (13): 3001–8.
Hastie, Trevor J., and Robert J. Tibshirani. 1990. Generalized Additive Models. Vol. 43. CRC Press.
Hastie, Trevor J., Tibshirani, Rob, and Martin J. Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton: Chapman and Hall/CRC.
Hawe, S., M. Kleinsteuber, and K. Diepold. 2013. IEEE Transactions on Image Processing 22 (6): 2138–50.
Hegde, Chinmay, Piotr Indyk, and Ludwig Schmidt. 2015. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 928–37.
Hoerl, Arthur E., and Robert W. Kennard. 1970. Technometrics 12 (1): 55–67.
Huang, Jianhua Z., Naiping Liu, Mohsen Pourahmadi, and Linxu Liu. 2006. Biometrika 93 (1): 85–98.
James, William, and Charles Stein. 1961. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1:361–79. University of California Press.
Janson, Lucas, William Fithian, and Trevor J. Hastie. 2015. Biometrika 102 (2): 479–85.
Javanmard, Adel, and Andrea Montanari. 2014. Journal of Machine Learning Research 15 (1): 2869–909.
Kaufman, S., and S. Rosset. 2014. Biometrika 101 (4): 771–84.
Kloft, Marius, Ulrich Rückert, and Peter L. Bartlett. 2010. In Machine Learning and Knowledge Discovery in Databases, edited by José Luis Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, 66–81. Lecture Notes in Computer Science. Springer Berlin Heidelberg.
Koenker, Roger, and Ivan Mizera. 2006. Advances in Statistical Modeling and Inference, 613–34.
Konishi, Sadanori, and G. Kitagawa. 2008. Information Criteria and Statistical Modeling. Springer Series in Statistics. New York: Springer.
Konishi, Sadanori, and Genshiro Kitagawa. 1996. Biometrika 83 (4): 875–90.
Lange, K. 1990. IEEE transactions on medical imaging 9 (4): 439–46.
Liu, Han, Kathryn Roeder, and Larry Wasserman. 2010. In Advances in Neural Information Processing Systems 23, edited by J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, 1432–40. Curran Associates, Inc.
Meinshausen, Nicolai, and Peter Bühlmann. 2010. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (4): 417–73.
Meyer, Mary C. 2008. The Annals of Applied Statistics 2 (3): 1013–33.
Miller, David L., Richard Glennie, and Andrew E. Seaton. 2020. Journal of Agricultural, Biological and Environmental Statistics 25 (1): 1–16.
Montanari, Andrea. 2012. Compressed Sensing: Theory and Applications, 394–438.
Needell, D., and J. A. Tropp. 2008. arXiv:0803.2392 [Cs, Math], March.
Rahimi, Ali, and Benjamin Recht. 2009. In Advances in Neural Information Processing Systems, 1313–20. Curran Associates, Inc.
Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2015. In Proceedings of ICML.
Rigollet, Philippe, and Jonathan Weed. 2018. arXiv.
Shen, Xiaotong, and Hsin-Cheng Huang. 2006. Journal of the American Statistical Association 101 (474): 554–68.
Shen, Xiaotong, Hsin-Cheng Huang, and Jimmy Ye. 2004. Technometrics 46 (3): 306–17.
Shen, Xiaotong, and Jianming Ye. 2002. Journal of the American Statistical Association 97 (457): 210–21.
Silverman, B. W. 1982. The Annals of Statistics 10 (3): 795–810.
———. 1984. The Annals of Statistics 12 (3): 898–916.
Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. Journal of Statistical Software 39 (5).
Smola, Alex J., Bernhard Schölkopf, and Klaus-Robert Müller. 1998. Neural Networks 11 (4): 637–49.
Somekh-Baruch, Anelia, Amir Leshem, and Venkatesh Saligrama. 2016. arXiv:1609.07415 [Cs, Math, Stat], September.
Stein, Charles M. 1981. The Annals of Statistics 9 (6): 1135–51.
Tansey, Wesley, Oluwasanmi Koyejo, Russell A. Poldrack, and James G. Scott. 2014. arXiv:1411.6144 [Stat], November.
Tikhonov, A. N., and V. B. Glasko. 1965. USSR Computational Mathematics and Mathematical Physics 5 (3): 93–107.
Uematsu, Yoshimasa. 2015. arXiv:1504.06706 [Math, Stat], April.
Wahba, Grace. 1990. Spline Models for Observational Data. SIAM.
Weng, Haolei, Arian Maleki, and Le Zheng. 2018. The Annals of Statistics 46 (6A): 3099–129.
Wood, S. N. 2000. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 (2): 413–28.
Wood, Simon N. 2008. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (3): 495–518.
Wu, Tong Tong, and Kenneth Lange. 2008. The Annals of Applied Statistics 2 (1): 224–44.
Xie, Bo, Yingyu Liang, and Le Song. 2016. arXiv:1611.03131 [Cs, Stat], November.
Ye, Jianming. 1998. Journal of the American Statistical Association 93 (441): 120–31.
Zhang, Cun-Hui, and Stephanie S. Zhang. 2014. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1): 217–42.
Zhang, Yiyun, Runze Li, and Chih-Ling Tsai. 2010. Journal of the American Statistical Association 105 (489): 312–23.
Zou, Hui, and Trevor Hastie. 2005. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20.
Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2007. The Annals of Statistics 35 (5): 2173–92.