Compressing neural nets

pruning, compacting and otherwise fitting a good estimate into fewer parameters

How to make neural nets smaller while still preserving their performance. This is a subtle problem, As we suspect that part of their special sauce is precisely that they are overparameterized which is to say, one reason they work is precisely that they are bigger than they β€œneed” to be. The problem of finding the network that is smaller than the bigger one that it seems to need to be is tricky. My instinct is to use some sparse regularisation but this does not carry over to the deep network setting, at least naΓ―vely.


Train a big network, then deleting neurons and see if it still works. See Jacob Gildenblat, Pruning deep neural networks to make them fast and small, or Why reducing the costs of training neural networks remains a challenge

Lottery tickets

Kim Martineau’s summary of the state of the art in β€œLottery ticket” (Frankle and Carbin 2019) pruning strategies is fun; See also You et al. (2019) for an elaboration. The idea is that we can try to β€œprune early” and never both fitting the big network as in classic pruning.

Regularising away neurons

Salome prepares trim the neural net.

Seems like it should be easy to apply something like LASSO in the NN setting to deep neural nets to trim away irrelevant features. Aren’t they just stacked layers of regressions, after all? and it works so well in linear regressions. But in deep nets it is not generally obvious how to shrink away whole neurons.

I am curious if Lemhadri et al. (2021) does the job:

Much work has been done recently to make neural networks more interpretable, and one approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or β„“1-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach achieves feature sparsity by adding a skip (residual) layer and allowing a feature to participate in any hidden layer only if its skip-layer representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. We apply LassoNet to a number of real-data problems and find that it significantly outperforms state-of-the-art methods for feature selection and regression. LassoNet uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

Edge ML

A.k.a. Tiny ML, Mobile ML etc. A major consumer of compressing neural nets, since small devices cannot fit large nerual nets. See Edge ML

Computational cost of



Aghasi, Alireza, Nam Nguyen, and Justin Romberg. 2016. β€œNet-Trim: A Layer-Wise Convex Pruning of Deep Neural Networks.” arXiv:1611.05162 [Cs, Stat], November.
Bardes, Adrien, Jean Ponce, and Yann LeCun. 2022. β€œVICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.” arXiv.
Blalock, Davis, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. β€œWhat Is the State of Neural Network Pruning?” arXiv:2003.03033 [Cs, Stat], March.
BΓΆlcskei, Helmut, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. 2019. β€œOptimal Approximation with Sparsely Connected Deep Neural Networks.” SIAM Journal on Mathematics of Data Science 1 (1): 8–45.
Borgerding, Mark, and Philip Schniter. 2016. β€œOnsager-Corrected Deep Networks for Sparse Linear Inverse Problems.” arXiv:1612.01183 [Cs, Math], December.
Cai, Han, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2020. β€œOnce-for-All: Train One Network and Specialize It for Efficient Deployment.” In.
Chen, Tianqi, Ian Goodfellow, and Jonathon Shlens. 2015. β€œNet2Net: Accelerating Learning via Knowledge Transfer.” arXiv:1511.05641 [Cs], November.
Chen, Wenlin, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. 2015. β€œCompressing Convolutional Neural Networks.” arXiv:1506.04449 [Cs], June.
Cheng, Yu, Duo Wang, Pan Zhou, and Tao Zhang. 2017. β€œA Survey of Model Compression and Acceleration for Deep Neural Networks.” arXiv:1710.09282 [Cs], October.
Cutajar, Kurt, Edwin V. Bonilla, Pietro Michiardi, and Maurizio Filippone. 2017. β€œRandom Feature Expansions for Deep Gaussian Processes.” In PMLR.
Daniely, Amit. 2017. β€œDepth Separation for Neural Networks.” arXiv:1702.08489 [Cs, Stat], February.
DeVore, Ronald, Boris Hanin, and Guergana Petrova. 2021. β€œNeural Network Approximation.” Acta Numerica 30 (May): 327–444.
ElbrΓ€chter, Dennis, Dmytro Perekrestenko, Philipp Grohs, and Helmut BΓΆlcskei. 2021. β€œDeep Neural Network Approximation Theory.” IEEE Transactions on Information Theory 67 (5): 2581–2623.
Frankle, Jonathan, and Michael Carbin. 2019. β€œThe Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” arXiv:1803.03635 [Cs], March.
Garg, Sahil, Irina Rish, Guillermo Cecchi, and Aurelie Lozano. 2017. β€œNeurogenesis-Inspired Dictionary Learning: Online Model Adaption in a Changing World.” In arXiv:1701.06106 [Cs, Stat].
Gelder, Maxwell van, Mitchell Wortsman, and Kiana Ehsani. 2020. β€œDeconstructing the Structure of Sparse Neural Networks.” In. arXiv.
Ghosh, Tapabrata. 2017. β€œQuickNet: Maximizing Efficiency and Efficacy in Deep Architectures.” arXiv:1701.02291 [Cs, Stat], January.
Globerson, Amir, and Roi Livni. 2016. β€œLearning Infinite-Layer Networks: Beyond the Kernel Trick.” arXiv:1606.05316 [Cs], June.
Gray, Scott, Alec Radford, and Diederik P Kingma. n.d. β€œGPU Kernels for Block-Sparse Weights,” 12.
Gu, Albert, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher RΓ©. 2021. β€œCombining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers.” In Advances in Neural Information Processing Systems, 34:572–85. Curran Associates, Inc.
Ha, David, Andrew Dai, and Quoc V. Le. 2016. β€œHyperNetworks.” arXiv:1609.09106 [Cs], September.
Hardt, Moritz, Benjamin Recht, and Yoram Singer. 2015. β€œTrain Faster, Generalize Better: Stability of Stochastic Gradient Descent.” arXiv:1509.01240 [Cs, Math, Stat], September.
Hayou, Soufiane, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. 2020. β€œPruning Untrained Neural Networks: Principles and Analysis.” arXiv:2002.08797 [Cs, Stat], June.
Hazimeh, Hussein, Natalia Ponomareva, Petros Mol, Zhenyu Tan, and Rahul Mazumder. 2020. β€œThe Tree Ensemble Layer: Differentiability Meets Conditional Computation,” February.
He, Yihui, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2019. β€œAMC: AutoML for Model Compression and Acceleration on Mobile Devices.” arXiv:1802.03494 [Cs], January.
Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. β€œMobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” arXiv:1704.04861 [Cs], April.
Iandola, Forrest N., Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. β€œSqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size.” arXiv:1602.07360 [Cs], February.
Ke, Xiongwen, and Yanan Fan. 2022. β€œOn the Optimization and Pruning for Bayesian Deep Learning.” arXiv.
LeCun, Yann, John S. Denker, and Sara A. Solla. 1990. β€œOptimal Brain Damage.” In Advances in Neural Information Processing Systems, 598–605.
Lee, Holden, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. 2017. β€œOn the Ability of Neural Nets to Express Distributions.” In arXiv:1702.07028 [Cs].
Lemhadri, Ismael, Feng Ruan, Louis Abraham, and Robert Tibshirani. 2021. β€œLassoNet: A Neural Network with Feature Sparsity.” Journal of Machine Learning Research 22 (127): 1–29.
Liebenwein, Lucas, Cenk Baykal, Brandon Carter, David Gifford, and Daniela Rus. 2021. β€œLost in Pruning: The Effects of Pruning Neural Networks Beyond Test Accuracy.” arXiv:2103.03014 [Cs], March.
Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. 2017. β€œBayesian Sparsification of Recurrent Neural Networks.” In Workshop on Learning to Generate Natural Language.
Louizos, Christos, Max Welling, and Diederik P. Kingma. 2017. β€œLearning Sparse Neural Networks Through \(L_0\) Regularization.” arXiv:1712.01312 [Cs, Stat], December.
Mariet, Zelda Elaine. 2016. β€œLearning and enforcing diversity with Determinantal Point Processes.” Thesis, Massachusetts Institute of Technology.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. β€œVariational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML.
Narang, Sharan, Eric Undersander, and Gregory Diamos. 2017. β€œBlock-Sparse Recurrent Neural Networks.” arXiv:1711.02782 [Cs, Stat], November.
Pan, Wei, Hao Dong, and Yike Guo. 2016. β€œDropNeuron: Simplifying the Structure of Deep Neural Networks.” arXiv:1606.07326 [Cs, Stat], June.
β€œPruning by Explaining: A Novel Criterion for Deep Neural Network Pruning.” 2021. Pattern Recognition 115 (July): 107899.
Renda, Alex, Jonathan Frankle, and Michael Carbin. 2020. β€œComparing Rewinding and Fine-Tuning in Neural Network Pruning.” arXiv:2003.02389 [Cs, Stat], March.
Scardapane, Simone, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. 2016. β€œGroup Sparse Regularization for Deep Neural Networks.” arXiv:1607.00485 [Cs, Stat], July.
Shi, Lei, Shikun Feng, and ZhifanZhu. 2016. β€œFunctional Hashing for Compressing Neural Networks.” arXiv:1605.06560 [Cs], May.
Srinivas, Suraj, and R. Venkatesh Babu. 2016. β€œGeneralized Dropout.” arXiv:1611.06791 [Cs], November.
Steeg, Greg ver, and Aram Galstyan. 2015. β€œThe Information Sieve.” arXiv:1507.02284 [Cs, Math, Stat], July.
Ullrich, Karen, Edward Meeds, and Max Welling. 2017. β€œSoft Weight-Sharing for Neural Network Compression.” arXiv Preprint arXiv:1702.04008.
Urban, Gregor, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, and Matt Richardson. 2016. β€œDo Deep Convolutional Nets Really Need to Be Deep (Or Even Convolutional)?” arXiv:1603.05691 [Cs, Stat], March.
Venturi, Daniele, and Xiantao Li. 2022. β€œThe Mori-Zwanzig Formulation of Deep Learning.” arXiv.
Wang, Yunhe, Chang Xu, Chao Xu, and Dacheng Tao. 2019. β€œPacking Convolutional Neural Networks in the Frequency Domain.” IEEE transactions on pattern analysis and machine intelligence 41 (10): 2495–2510.
Wang, Yunhe, Chang Xu, Shan You, Dacheng Tao, and Chao Xu. 2016. β€œCNNpack: Packing Convolutional Neural Networks in the Frequency Domain.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 253–61. Curran Associates, Inc.
Wang, Zhangyang, Shiyu Chang, Qing Ling, Shuai Huang, Xia Hu, Honghui Shi, and Thomas S. Huang. 2016. β€œStacked Approximated Regression Machine: A Simple Deep Learning Approach.” In.
Warden, Pete, and Daniel Situnayake. 2020. TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers. O’Reilly Media, Incorporated.
Yarotsky, Dmitry, and Anton Zhevnerchuk. 2020. β€œThe Phase Diagram of Approximation Rates for Deep Neural Networks.” In Proceedings of the 34th International Conference on Neural Information Processing Systems, 33:13005–15. NIPS’20. Red Hook, NY, USA: Curran Associates Inc.
You, Haoran, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2019. β€œDrawing Early-Bird Tickets: Toward More Efficient Training of Deep Networks.” In.
Zhao, Liang. 2017. β€œFast Algorithms on Random Matrices and Structured Matrices.”

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.