# Sparse regression

June 23, 2016 — October 24, 2019

Penalised regression where the penalties are sparsifying. The prediction losses could be anything — likelihood, least-squares, robust Huberised losses, absolute deviation etc.

I will play fast and loose with terminology here regarding theoretical and empirical losses, and the statistical models we attempt to fit.

In nonparametric statistics we might estimate simultaneously what look like many, many parameters, which we constrain in some clever fashion, which usually boils down to something we can interpret as a smoothing parameters, controlling how many factors we still have to consider, from a subset of the original.

I will usually discuss our intent to minimise prediction error, but one could also try to minimise model selection error too.

Then we have a simultaneous estimation and model selection procedure, probably a specific sparse model selection procedure and we possibly have to choose clever optimisation method to do the whole thing fast. Related to compressed sensing, but here we consider sampling complexity and measurement error.

See also matrix factorisations, optimisation, multiple testing, concentration inequalities, sparse flavoured icecream.

🏗 disambiguate the optimisation technologies at play — iteratively reweighted least squares etc.

Now! A set of headings under which I will try to understand some things, mostly the LASSO variants.

## 1 LASSO

Quadratic loss penalty, absolute coefficient penalty. We estimate the regression coefficients \(\beta\) by solving

\[\begin{aligned} \hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \: \frac{1}{2} \| y - {\bf X} \beta \|_2^2 + \lambda \| \beta \|_1, \end{aligned}\]

The penalty coefficient \(\lambda\) is left for you to choose, but one of the magical properties of the lasso is that it is easy to test many possible values of \(\lambda\) at low marginal cost.

Popular because, amongst other reasons, it turns out to be in practice fast and convenient, and amenable to various performance accelerations e.g. aggressive approximate variable selection.

## 2 Adaptive LASSO

🏗 This is the one with famous oracle properties if you choose \(\lambda\) correctly. Hsi Zou’s paper on this (Zou 2006) is readable. I am having trouble digesting Sara van de Geer’s paper (S. A. van de Geer 2008) on the Generalised Lasso, but it *seems* to offer me guarantees for something *very similar* to the Adaptive Lasso, but with far more general assumptions on the model and loss functions, and some finite sample guarnatees.

## 3 LARS

A confusing one; LASSO and LARS are not the same thing but you can use one to calculate the other? Something like that? I need to work this one through with a pencil and paper.

## 4 Graph LASSO

As used in graphical models. 🏗

## 5 Elastic net

Combination of \(L_1\) and \(L_2\) penalties. 🏗

## 6 Grouped LASSO

AFAICT this is the usual LASSO but with grouped factors. See (Yuan and Lin 2006).

## 7 Model selection

Can be fiddly with sparse regression, which couples variable selection tightly with parameter estimation. See sparse model selection.

## 8 Debiased LASSO

There exist a few versions, but the one I have needed is (S. A. van de Geer 2008), section 2.1. See also and (S. van de Geer 2014b). (🏗 relation to (S. A. van de Geer 2008)?)

## 9 Sparse basis expansions

Wavelets etc; mostly handled under sparse dictionary bases.

## 10 Sparse neural nets

That is, sparse regressions as the layers in a neural network? Sure thing. (Wisdom et al. 2016)

## 11 Other coefficient penalties

Put a weird penalty on the coefficients! E.g. “Smoothly Clipped Absolute Deviation” (SCAD). 🏗

## 12 Other prediction losses

Put a weird penalty on the error! MAD prediction penalty, lasso-coefficient penalty, etc.

See (H. Wang, Li, and Jiang 2007; Portnoy and Koenker 1997) for some implementations using e.g. maximum absolute prediction error.

## 13 Bayesian Lasso

See Bayesian sparsity.

## 14 Implementations

Hastie, Friedman eta’s glmnet for R is fast and well-regarded, and has a MATLAB version. Here’s how to use it for adaptive lasso. Kenneth Tay has implemented elasticnet penalty for any GLM in `glmnet`

.

SPAMS (C++, MATLAB, R, python) by Mairal, looks interesting. It’s an optimisation library for many, many sparse problems.

liblinear also include lasso-type solvers, as well as support-vector regression.

## 15 Tidbits

Sparse regression as a universal classifier explainer? Local Interpretable Model-agnostic Explanations (Ribeiro, Singh, and Guestrin 2016) uses LASSO for model interpretation this. (See the blog post, or the source.

## 16 References

*The Annals of Statistics*.

*arXiv:1611.05162 [Cs, Stat]*.

*arXiv:1511.08963 [Cs, Math, Stat]*.

*arXiv:1910.12327 [Cs, Math, Stat]*.

*arXiv:1506.00898 [Cs, Math, Stat]*.

*arXiv:0901.3202 [Cs, Stat]*.

*Optimization With Sparsity-Inducing Penalties*. Foundations and Trends(r) in Machine Learning 1.0.

*arXiv:1501.00046 [Cs, Math, Stat]*.

*Advances in Neural Information Processing Systems 27*.

*Journal of Machine Learning Research*.

*The Annals of Statistics*.

*arXiv:1511.01650 [Cs, Math]*.

*IEEE Transactions on Signal Processing*.

*The Annals of Statistics*.

*Information Theory Workshop, 2008. ITW’08. IEEE*.

*Neural Computation*.

*IEEE Transactions on Information Theory*.

*arXiv:1609.06675 [Math, Stat]*.

*Biometrika*.

*The Annals of Statistics*.

*Annales de l’Institut Henri Poincaré, Probabilités Et Statistiques*.

*Mathematical Programming*.

*Journal of Computational and Graphical Statistics*.

*arXiv:1507.03652 [Math, Stat]*.

*Biometrics*.

*arXiv:1401.2906 [Math]*.

*arXiv:1606.04838 [Cs, Math, Stat]*.

*Technometrics*.

*IEEE Transactions on Information Theory*.

*Proceedings of the National Academy of Sciences*.

*Statistics for High-Dimensional Data*. Springer Series in Statistics.

*arXiv:1503.06426 [Stat]*.

*arXiv:1704.02739 [Stat]*.

*Electronic Journal of Statistics*.

*Learning Theory*. Lecture Notes in Computer Science.

*arXiv:1104.5246 [Cs, Math, Stat]*.

*arXiv Preprint arXiv:1610.02351*.

*Journal of Fourier Analysis and Applications*.

*Proceedings of the IEEE*.

*Communications on Pure and Applied Mathematics*.

*Journal of Fourier Analysis and Applications*.

*Digital Signal Processing*.

*Compressed Sensing & Sparse Filtering*. Signals and Communication Technology.

*Advances in Neural Information Processing Systems*.

*IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008*.

*arXiv:1909.10140 [Math, Stat]*.

*Mathematical Programming*.

*IEEE Transactions on Signal Processing*.

*IEEE Transactions on Signal Processing*.

*The Econometrics Journal*.

*arXiv:1812.08089 [Math, Stat]*.

*arXiv:1809.05224 [Econ, Math, Stat]*.

*arXiv:1605.02214 [Math, Stat]*.

*arXiv:1410.0247 [Math, Stat]*.

*arXiv Preprint arXiv:1602.03589*.

*ICML*.

*arXiv:1805.05133 [Stat]*.

*The Annals of Statistics*.

*arXiv:1111.1162 [Cs, Math, Stat]*.

*The Annals of Statistics*.

*University of California, Berkeley*.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*.

*Clinical Epigenetics*.

*arXiv:1507.05315 [Math, Stat]*.

*Journal of Machine Learning Research*.

*Journal of the American Statistical Association*.

*Statistica Sinica*.

*arXiv:1302.2068 [Stat]*.

*arXiv:1108.0373 [Math, Stat]*.

*The Annals of Applied Statistics*.

*Biostatistics*.

*Journal of the American Statistical Association*.

*IEEE Transactions on Signal Processing*.

*SIAM Journal on Optimization*.

*arXiv:1310.3787 [Math]*.

*Neural Computation*.

*arXiv:1412.5896 [Cs, Math, Stat]*.

*arXiv:1606.01586 [Math]*.

*arXiv:1403.2310 [Stat]*.

*Bioinformatics*.

*arXiv:1605.07913 [Math, Stat]*.

*arXiv:1507.00280 [Cs, Math, Stat]*.

*Bernoulli*.

*Computational Statistics & Data Analysis*.

*Bernoulli*.

*Statistical Learning with Sparsity: The Lasso and Generalizations*.

*IEEE Transactions on Image Processing*.

*Electronic Journal of Statistics*.

*IEEE Transactions on Information Theory*.

*Proceedings of the 32nd International Conference on Machine Learning (ICML-15)*.

*Proceedings of the 2014 SIAM International Conference on Data Mining*. Proceedings.

*Statistics Surveys*.

*arXiv:1109.2411 [Stat]*.

*IEEE Transactions on Signal Processing*.

*Journal of Machine Learning Research*.

*2014 48th Asilomar Conference on Signals, Systems and Computers*.

*The Annals of Statistics*.

*arXiv:1610.01353 [Math, Stat]*.

*Biometrika*.

*Journal of Machine Learning Research*.

*Advances in Neural Information Processing Systems 29*.

*Journal of Machine Learning Research*.

*Journal of Multivariate Analysis*.

*Journal of Machine Learning Research*.

*Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems*. Lecture Notes in Mathematics École d’Été de Probabilités de Saint-Flour 2033.

*arXiv:1612.04111 [Cs, Stat]*.

*SPARS’09-Signal Processing with Adaptive Sparse Structured Representations*.

*BMC Bioinformatics*.

*Electronic Journal of Statistics*.

*Annals of Statistics*.

*Advances in Neural Information Processing Systems 21*.

*arXiv:2004.11554 [Stat]*.

*arXiv:1311.6238 [Math, Stat]*.

*Journal of Machine Learning Research*.

*Journal of Statistical Planning and Inference*.

*arXiv:1609.07195 [Stat]*.

*The Annals of Statistics*.

*Biometrika*.

*Advances in Neural Information Processing Systems*.

*arXiv Preprint arXiv:1608.04845*.

*SIAM Journal on Optimization*.

*Group*.

*The Annals of Statistics*.

*The Annals of Statistics*.

*Proceedings of ICML*.

*Compressed Sensing: Theory and Applications*.

*ICASSP*.

*TEST*.

*Biometrika*.

*2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*.

*arXiv:0803.2392 [Cs, Math]*.

*Mathematical Programming*.

*Electronic Journal of Statistics*.

*Advances in Neural Information Processing Systems 24*.

*The Annals of Statistics*.

*2013 IEEE 52nd Annual Conference on Decision and Control (CDC)*.

*IEEE Transactions on Signal Processing*.

*Statistical Science*.

*Proceedings of The 32nd International Conference on Machine Learning*.

*Statistical Science*.

*Annals of the Institute of Statistical Mathematics*.

*Mathematical Programming Computation*.

*Advances in Neural Information Processing Systems*.

*Electronic Journal of Statistics*.

*arXiv:1501.02923 [Cs, Stat]*.

*IEEE Transactions on Signal Processing*.

*Probability Theory and Related Fields*.

*The Annals of Statistics*.

*Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. KDD ’16.

*Compressed Sensing & Sparse Filtering*. Signals and Communication Technology.

*Sparse Modeling: Theory, Algorithms, and Applications*. Chapman & Hall/CRC Machine Learning & Pattern Recognition Series.

*Journal of the American Statistical Association*.

*Scandinavian Journal of Statistics*.

*arXiv:1908.01755 [Cs, Stat]*.

*Journal of the American Statistical Association*.

*Technometrics*.

*Journal of the American Statistical Association*.

*Journal of Statistical Software*.

*arXiv:1512.04011 [Cs]*.

*arXiv:1701.01207 [Cs, Math, Stat]*.

*Statistics*.

*IEEE Transactions on Image Processing*.

*The Annals of Statistics*.

*arXiv:1511.01957 [Cs, Math, Stat]*.

*arXiv:1308.5623 [Stat]*.

*Journal of Statistical Software*.

*Statistical Science*.

*Advances in Neural Information Processing Systems 28*.

*Journal of the Royal Statistical Society. Series B (Methodological)*.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*.

*arXiv:1408.5801 [Stat]*.

*Analysis of Images, Social Networks and Texts*. Communications in Computer and Information Science 542.

*arXiv:1611.02101 [Cs, Stat]*.

*Proceedings of the IEEE*.

*arXiv:1612.03450 [Cs, Math, Stat]*.

*arXiv:1504.06706 [Math, Stat]*.

*An Introduction to Sparse Stochastic Processes*.

*IEEE Transactions on Information Theory*.

*IEEE Transactions on Information Theory*.

*The Annals of Statistics*.

*Scandinavian Journal of Statistics*.

*arXiv:1403.7023 [Math, Stat]*.

*arXiv:1409.8557 [Math, Stat]*.

*Estimation and Testing Under Sparsity*. Lecture Notes in Mathematics.

*The Annals of Statistics*.

*Electronic Journal of Statistics*.

*arXiv:1512.03099 [Cs, Math, Stat]*.

*Spline Models for Observational Data*.

*Sixth International Conference on Data Mining (ICDM’06)*.

*Journal of Business & Economic Statistics*.

*Advances in Neural Information Processing Systems 29*.

*arXiv:1504.02923 [Cs, Math]*.

*IEEE Transactions on Signal Processing*.

*The Annals of Applied Statistics*.

*IEEE Transactions on Information Theory*.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*.

*2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*.

*ICML (3)*.

*Journal of Machine Learning Research*.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*.

*Biometrika*.

*Computational Optimization and Applications*.

*The Annals of Statistics*.

*Journal of the American Statistical Association*.

*arXiv:1511.03766 [Cs]*.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*.

*The Annals of Statistics*.

*The Annals of Statistics*.

*Journal of Machine Learning Research*.

*Data Mining and Knowledge Discovery*.

*Journal of the American Statistical Association*.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*.

*The Annals of Statistics*.

*The Annals of Statistics*.