Compressing neural nets
pruning, compacting and otherwise fitting a good estimate into fewer parameters
October 14, 2016 — May 7, 2021
How to make neural nets smaller while still preserving their performance. This is a subtle problem, as we suspect that part of their special sauce is precisely that they are overparameterized, which is to say, one reason they work is precisely that they are bigger than they “need” to be. The problem of finding the network that is smaller than the bigger one that it seems to need to be is tricky. My instinct is to use some sparse regularization, but this does not carry over to the deep network setting, at least naïvely.
1 Pruning
Train a big network, then delete neurons and see if it still works. See Jacob Gildenblat, Pruning deep neural networks to make them fast and small, or Why reducing the costs of training neural networks remains a challenge.
2 Lottery tickets
Kim Martineau’s summary of the state of the art in “Lottery ticket” (Frankle and Carbin 2019) pruning strategies is fun; see also You et al. (2019) for an elaboration. The idea is that we can try to “prune early” and never bother fitting the big network as in classic pruning.
3 Regularising away neurons
Seems like it should be easy to apply something like LASSO in the NN setting to deep neural nets to trim away irrelevant features. Aren’t they just stacked layers of regressions, after all? And it works so well in linear regressions. But in deep nets, it is not generally obvious how to shrink away whole neurons.
I am curious if Lemhadri et al. (2021) does the job:
Much work has been done recently to make neural networks more interpretable, and one approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or ℓ1-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However, the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach achieves feature sparsity by adding a skip (residual) layer and allowing a feature to participate in any hidden layer only if its skip-layer representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. We apply LassoNet to a number of real-data problems and find that it significantly outperforms state-of-the-art methods for feature selection and regression. LassoNet uses projected proximal gradient descent and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.
4 Edge ML
A.k.a. Tiny ML, Mobile ML, etc. A major consumer of compressing neural nets, since small devices cannot fit large neural nets. See Edge ML.
5 Incoming
6 Computational cost of
TBD.