Neural network activation functions



The Rectified Linear Unit circa 1920. Don’t you long to be as cool as this guy?

There is a whole cottage industry in showing neural networks are reasonably universal function approximators with various nonlinearities as activations, under various conditions. Usually we take it as a given that the particular activation function is not too important.

Sometimes, we might like to play with the precise form of the nonlinearities, even making the nonlinearities themselves directly learnable, because some function shapes might have better approximation properties with respect to various assumptions on the learning problems, in a sense which I will not attempt to make rigorous now, vague hand-waving arguments being the whole point of deep learning.

I think a part of this field has been subsumed into the stability-of-dynamical-systems setting? Or we do not care because something-something BatchNorm?

The current default activation function is ReLU, i.e. \(x\mapsto \max\{0,x\}\), which has many nice properties. However, it does lead to piecewise linear spline approximators. One could regard that as a plus (Unser 2019) but OTOH that makes it hard to solve differential equations.

Sometimes, then, we want something different. Other classic activations such as \(x\mapsto\tanh x\) have fallen from favour, supplanted by ReLU. However, differentiable activations are useful, especially if higher-order gradients of the solution will be important. Many virtues of differentiable activation functions are documented Implicit Neural Representations with Periodic Activation Functions. Sitzmann et al. (2020) argues for \(x\mapsto\sin x\) which has some handy properties. Ramachandran, Zoph, and Le (2017) advocate Swish, \(x\mapsto \frac{x}{1+\exp -x}.\)

Other fun things, SELU, the β€œself-normalizing” SELU (scaled exponential linear unit) Klambauer et al. (2017).

References

Agostinelli, Forest, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. 2015. β€œLearning Activation Functions to Improve Deep Neural Networks.” In Proceedings of International Conference on Learning Representations (ICLR) 2015.
Anil, Cem, James Lucas, and Roger Grosse. 2018. β€œSorting Out Lipschitz Function Approximation,” November.
Arjovsky, Martin, Amar Shah, and Yoshua Bengio. 2016. β€œUnitary Evolution Recurrent Neural Networks.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, 1120–28. ICML’16. New York, NY, USA: JMLR.org.
Balduzzi, David, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017. β€œThe Shattered Gradients Problem: If Resnets Are the Answer, Then What Is the Question?” In PMLR, 342–50.
Cho, Youngmin, and Lawrence K. Saul. 2009. β€œKernel Methods for Deep Learning.” In Proceedings of the 22nd International Conference on Neural Information Processing Systems, 22:342–50. NIPS’09. Red Hook, NY, USA: Curran Associates Inc.
Clevert, Djork-ArnΓ©, Thomas Unterthiner, and Sepp Hochreiter. 2016. β€œFast and Accurate Deep Network Learning by Exponential Linear Units (ELUs).” In Proceedings of ICLR.
Duch, WΕ‚odzisΕ‚aw, and Norbert Jankowski. 1999. β€œSurvey of Neural Transfer Functions.”
Glorot, Xavier, and Yoshua Bengio. 2010. β€œUnderstanding the Difficulty of Training Deep Feedforward Neural Networks.” In Aistats, 9:249–56.
Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. 2011. β€œDeep Sparse Rectifier Neural Networks.” In Aistats, 15:275.
Goodfellow, Ian J., David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. 2013. β€œMaxout Networks.” In ICML (3), 28:1319–27.
Hayou, Soufiane, Arnaud Doucet, and Judith Rousseau. 2019. β€œOn the Impact of the Activation Function on Deep Neural Networks Training.” arXiv:1902.06853 [Cs, Stat], May.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015a. β€œDelving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.” arXiv:1502.01852 [Cs], February.
β€”β€”β€”. 2015b. β€œDeep Residual Learning for Image Recognition.”
β€”β€”β€”. 2016. β€œIdentity Mappings in Deep Residual Networks.” In arXiv:1603.05027 [Cs].
Hochreiter, Sepp. 1998. β€œThe Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.” International Journal of Uncertainty Fuzziness and Knowledge Based Systems 6: 107–15.
Hochreiter, Sepp, Yoshua Bengio, Paolo Frasconi, and JΓΌrgen Schmidhuber. 2001. β€œGradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.
Klambauer, GΓΌnter, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017. β€œSelf-Normalizing Neural Networks.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 972–81. Red Hook, NY, USA: Curran Associates Inc.
Laurent, Thomas. n.d. β€œThe Multilinear Structure of ReLU Networks,” 9.
Lederer, Johannes. 2021. β€œActivation Functions in Artificial Neural Networks: A Systematic Overview.” arXiv:2101.09957 [Cs, Stat], January.
Lee, Jaehoon, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. β€œDeep Neural Networks as Gaussian Processes.” In ICLR.
Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. 2013. β€œRectifier Nonlinearities Improve Neural Network Acoustic Models.” In Proceedings of ICML. Vol. 30.
Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. 2013. β€œOn the Difficulty of Training Recurrent Neural Networks.” In arXiv:1211.5063 [Cs], 1310–18.
Rahaman, Nasim, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred A. Hamprecht, Yoshua Bengio, and Aaron Courville. 2019. β€œOn the Spectral Bias of Neural Networks.” arXiv:1806.08734 [Cs, Stat], May.
Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. 2017. β€œSearching for Activation Functions.” arXiv:1710.05941 [Cs], October.
Sitzmann, Vincent, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. 2020. β€œImplicit Neural Representations with Periodic Activation Functions.” arXiv:2006.09661 [Cs, Eess], June.
Srivastava, Rupesh Kumar, Klaus Greff, and JΓΌrgen Schmidhuber. 2015. β€œHighway Networks.” In arXiv:1505.00387 [Cs].
Unser, Michael. 2019. β€œA Representer Theorem for Deep Neural Networks.” Journal of Machine Learning Research 20 (110): 30.
Wisdom, Scott, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. 2016. β€œFull-Capacity Unitary Recurrent Neural Networks.” In Advances in Neural Information Processing Systems, 4880–88.
Yang, Greg, and Hadi Salman. 2020. β€œA Fine-Grained Spectral Perspective on Neural Networks.” arXiv:1907.10599 [Cs, Stat], April.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.