# Compressing neural nets

pruning, compacting and otherwise fitting a good estimate into fewer parameters

October 14, 2016 — May 7, 2021

How to make neural nets smaller while still preserving their performance. This is a subtle problem, as we suspect that part of their special sauce is precisely that they are overparameterized, which is to say, one reason they work is precisely that they are bigger than they “need” to be. The problem of finding the network that is *smaller than the bigger one that it seems to need to be* is tricky. My instinct is to use some sparse regularization, but this does not carry over to the deep network setting, at least naïvely.

## 1 Pruning

Train a big network, then delete neurons and see if it still works. See Jacob Gildenblat, Pruning deep neural networks to make them fast and small, or Why reducing the costs of training neural networks remains a challenge.

## 2 Lottery tickets

Kim Martineau’s summary of the state of the art in “Lottery ticket” (Frankle and Carbin 2019) pruning strategies is fun; see also You et al. (2019) for an elaboration. The idea is that we can try to “prune early” and never bother fitting the big network as in classic pruning.

## 3 Regularising away neurons

Seems like it should be easy to apply something like LASSO in the NN setting to deep neural nets to trim away irrelevant features. Aren’t they just stacked layers of regressions, after all? And it works so well in linear regressions. But in deep nets, it is not generally obvious how to shrink away whole neurons.

I am curious if Lemhadri et al. (2021) does the job:

Much work has been done recently to make neural networks more interpretable, and one approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or ℓ1-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However, the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach achieves feature sparsity by adding a skip (residual) layer and allowing a feature to participate in any hidden layer only if its skip-layer representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. We apply LassoNet to a number of real-data problems and find that it significantly outperforms state-of-the-art methods for feature selection and regression. LassoNet uses projected proximal gradient descent and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

## 4 Edge ML

A.k.a. Tiny ML, Mobile ML, etc. A major consumer of compressing neural nets, since small devices cannot fit large neural nets. See Edge ML.

## 5 Incoming

## 6 Computational cost of

TBD.

## 7 References

*arXiv:1611.05162 [Cs, Stat]*.

*arXiv:2003.03033 [Cs, Stat]*.

*SIAM Journal on Mathematics of Data Science*.

*arXiv:1612.01183 [Cs, Math]*.

*arXiv:1511.05641 [Cs]*.

*arXiv:1710.09282 [Cs]*.

*arXiv:1506.04449 [Cs]*.

*PMLR*.

*arXiv:1702.08489 [Cs, Stat]*.

*Acta Numerica*.

*IEEE Transactions on Information Theory*.

*arXiv:1803.03635 [Cs]*.

*arXiv:1701.06106 [Cs, Stat]*.

*arXiv:1701.02291 [Cs, Stat]*.

*arXiv:1606.05316 [Cs]*.

*Advances in Neural Information Processing Systems*.

*arXiv:1609.09106 [Cs]*.

*arXiv:1509.01240 [Cs, Math, Stat]*.

*arXiv:2002.08797 [Cs, Stat]*.

*arXiv:1802.03494 [Cs]*.

*arXiv:1704.04861 [Cs]*.

*arXiv:1602.07360 [Cs]*.

*Advances in Neural Information Processing Systems*.

*arXiv:1702.07028 [Cs]*.

*Journal of Machine Learning Research*.

*arXiv:2103.03014 [Cs]*.

*Workshop on Learning to Generate Natural Language*.

*arXiv:1712.01312 [Cs, Stat]*.

*Proceedings of ICML*.

*arXiv:1711.02782 [Cs, Stat]*.

*arXiv:1606.07326 [Cs, Stat]*.

*Pattern Recognition*.

*arXiv:2003.02389 [Cs, Stat]*.

*arXiv:1607.00485 [Cs, Stat]*.

*arXiv:1605.06560 [Cs]*.

*arXiv:1611.06791 [Cs]*.

*arXiv Preprint arXiv:1702.04008*.

*arXiv:1603.05691 [Cs, Stat]*.

*arXiv:1507.02284 [Cs, Math, Stat]*.

*IEEE transactions on pattern analysis and machine intelligence*.

*Advances in Neural Information Processing Systems 29*.

*TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers*.

*Proceedings of the 34th International Conference on Neural Information Processing Systems*. NIPS’20.