Neural network activation functions

2017-01-12 — 2025-05-20

Suspiciously similar content

Figure 1: The Rectified Linear Unit circa 1920. Don’t we long to be as cool as this guy?

There is a cottage industry built upon showing that neural networks are reasonably universal function approximators with various nonlinearities as activations, under various conditions. Usually, we take it as a given that the particular activation function is not too important.

Sometimes, we like to play with the exact form of the nonlinearities, even making the nonlinearities themselves directly learnable. The idea is that some function shapes might have better approximation properties with respect to various assumptions on the learning problems, in a sense I won’t make rigorous now. Vague hand-waving arguments are the whole point of deep learning. Taking that to its extreme and learning activations instead of weights, leads to Kolmogorov-Arnold networks.

I think a part of this field has been subsumed into the stability-of-dynamical-systems setting? Or we don’t care because of something-something BatchNorm?

1 ReLU

The current default activation function is ReLU, i.e. $x \mapsto max {0, x}$ , which has many nice properties. However, it does lead to piecewise linear spline approximators. One could regard that as a plus (Unser 2019) but OTOH that makes it hard to solve differential equations.

2 Continuously differentiable activations

Sometimes, we want something different. Other classic activations such as $x \mapsto \tanh x$ have fallen out of favour, supplanted by ReLU. However, differentiable activations are useful, especially if higher-order gradients of the solution will be important, e.g. in implicit representation NNs. Many virtues of differentiable activation functions for that purpose are documented Implicit Neural Representations with Periodic Activation Functions. Sitzmann et al. (2020) argues for $x \mapsto \sin x$ on the basis of various handy properties. This seems to require careful initialization.

2.1 Swish

Ramachandran, Zoph, and Le (2017) searches through many functions and finds some fun ones, most famously

Swish, $x \mapsto \frac{x}{1 + \exp - x} .$

That is, it’s the classic sigmoid function, multiplied by $x$ .

2.2 SELU

SELU, the “self-normalising” SELU (scaled exponential linear unit) (Klambauer et al. 2017)

3 Snake

Ziyin, Hartwig, and Ueda (2020) credit the Snake activation function to Ramachandran, Zoph, and Le (2017) but I couldn’t find it in the paper. Anyway, it’s supposed to be helpful for learning periodic functions.

${Snake}_{a} x \mapsto x + \frac{1}{a} \sin^{2} (a x)$

4 Learnable activations

Learnable activations are a thing, e.g. Ramachandran, Zoph, and Le (2017), Agostinelli et al. (2015), Lederer (2021), achieving their apotheosis in Kolmogorov-Arnold Networks.

5 Kolmogorov-Arnold networks

A cute related case of a learnable activation function is the Kolmogorov-Arnold network (Liu, Wang, et al. 2024), where the edge learns an activation function and there are no other weights. This has various nice properties such as being easy to compress, somehow. See Kolmogorov-Arnold Networks for a deeper treatment.

6 Really silly activations

GradIEEEnt half decent.

7 References

Agostinelli, Hoffman, Sadowski, et al. 2015. “Learning Activation Functions to Improve Deep Neural Networks.” In Proceedings of International Conference on Learning Representations (ICLR) 2015.

Anil, Lucas, and Grosse. 2018. “Sorting Out Lipschitz Function Approximation.”

Arjovsky, Shah, and Bengio. 2016. “Unitary Evolution Recurrent Neural Networks.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.

Balduzzi, Frean, Leary, et al. 2017. “The Shattered Gradients Problem: If Resnets Are the Answer, Then What Is the Question?” In PMLR.

Cho, and Saul. 2009. “Kernel Methods for Deep Learning.” In Proceedings of the 22nd International Conference on Neural Information Processing Systems. NIPS’09.

Clevert, Unterthiner, and Hochreiter. 2016. “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs).” In Proceedings of ICLR.

Duch, and Jankowski. 1999. “Survey of Neural Transfer Functions.”

Glorot, and Bengio. 2010. “Understanding the Difficulty of Training Deep Feedforward Neural Networks.” In Aistats.

Glorot, Bordes, and Bengio. 2011. “Deep Sparse Rectifier Neural Networks.” In Aistats.

Godfrey. 2019. “An Evaluation of Parametric Activation Functions for Deep Learning.” In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC).

Goodfellow, Warde-Farley, Mirza, et al. 2013. “Maxout Networks.” In ICML (3).

Hayou, Doucet, and Rousseau. 2019. “On the Impact of the Activation Function on Deep Neural Networks Training.” In Proceedings of the 36th International Conference on Machine Learning.

He, Zhang, Ren, et al. 2015a. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.” arXiv:1502.01852 [Cs].

———, et al. 2015b. “Deep Residual Learning for Image Recognition.”

———, et al. 2016. “Identity Mappings in Deep Residual Networks.” In arXiv:1603.05027 [Cs].

Hochreiter. 1998. “The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.” International Journal of Uncertainty Fuzziness and Knowledge Based Systems.

Hochreiter, Bengio, Frasconi, et al. 2001. “Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks.

Klambauer, Unterthiner, Mayr, et al. 2017. “Self-Normalizing Neural Networks.” In Proceedings of the 31st International Conference on Neural Information Processing Systems.

Laurent. n.d. “The Multilinear Structure of ReLU Networks.”

Lederer. 2021. “Activation Functions in Artificial Neural Networks: A Systematic Overview.” arXiv:2101.09957 [Cs, Stat].

Lee, Bahri, Novak, et al. 2018. “Deep Neural Networks as Gaussian Processes.” In ICLR.

Liu, Ma, Wang, et al. 2024. “KAN 2.0: Kolmogorov-Arnold Networks Meet Science.”

Liu, Wang, Vaidya, et al. 2024. “KAN: Kolmogorov-Arnold Networks.”

Maas, Hannun, and Ng. 2013. “Rectifier Nonlinearities Improve Neural Network Acoustic Models.” In Proceedings of ICML.

Pascanu, Mikolov, and Bengio. 2013. “On the Difficulty of Training Recurrent Neural Networks.” In arXiv:1211.5063 [Cs].

Rahaman, Baratin, Arpit, et al. 2019. “On the Spectral Bias of Neural Networks.” arXiv:1806.08734 [Cs, Stat].

Ramachandran, Zoph, and Le. 2017. “Searching for Activation Functions.”

Sitzmann, Martel, Bergman, et al. 2020. “Implicit Neural Representations with Periodic Activation Functions.” arXiv:2006.09661 [Cs, Eess].

Srivastava, Greff, and Schmidhuber. 2015. “Highway Networks.” In arXiv:1505.00387 [Cs].

Unser. 2019. “A Representer Theorem for Deep Neural Networks.” Journal of Machine Learning Research.

Wisdom, Powers, Hershey, et al. 2016. “Full-Capacity Unitary Recurrent Neural Networks.” In Advances in Neural Information Processing Systems.

Yang, and Salman. 2020. “A Fine-Grained Spectral Perspective on Neural Networks.” arXiv:1907.10599 [Cs, Stat].

Ziyin, Hartwig, and Ueda. 2020. “Neural Networks Fail to Learn Periodic Functions and How to Fix It.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20.