Neural network activation functions
2017-01-12 — 2025-05-20
Suspiciously similar content
There is a cottage industry built upon showing that neural networks are reasonably universal function approximators with various nonlinearities as activations, under various conditions. Usually, we take it as a given that the particular activation function is not too important.
Sometimes, we like to play with the exact form of the nonlinearities, even making the nonlinearities themselves directly learnable. The idea is that some function shapes might have better approximation properties with respect to various assumptions on the learning problems, in a sense I won’t make rigorous now. Vague hand-waving arguments are the whole point of deep learning. Taking that to its extreme and learning activations instead of weights, leads to Kolmogorov-Arnold networks.
I think a part of this field has been subsumed into the stability-of-dynamical-systems setting? Or we don’t care because of something-something BatchNorm?
1 ReLU
The current default activation function is ReLU, i.e. \(x\mapsto \max\{0,x\}\), which has many nice properties. However, it does lead to piecewise linear spline approximators. One could regard that as a plus (Unser 2019) but OTOH that makes it hard to solve differential equations.
2 Continuously differentiable activations
Sometimes, we want something different. Other classic activations such as \(x\mapsto\tanh x\) have fallen out of favour, supplanted by ReLU. However, differentiable activations are useful, especially if higher-order gradients of the solution will be important, e.g. in implicit representation NNs. Many virtues of differentiable activation functions for that purpose are documented Implicit Neural Representations with Periodic Activation Functions. Sitzmann et al. (2020) argues for \(x\mapsto\sin x\) on the basis of various handy properties. This seems to require careful initialization.
2.1 Swish
Ramachandran, Zoph, and Le (2017) searches through many functions and finds some fun ones, most famously
Swish, \(x\mapsto \frac{x}{1+\exp -x}.\)
That is, it’s the classic sigmoid function, multiplied by \(x\).
2.2 SELU
SELU, the “self-normalising” SELU (scaled exponential linear unit) (Klambauer et al. 2017)
3 Snake
Ziyin, Hartwig, and Ueda (2020) credit the Snake activation function to Ramachandran, Zoph, and Le (2017) but I couldn’t find it in the paper. Anyway, it’s supposed to be helpful for learning periodic functions.
\(\operatorname{Snake}_a x \mapsto x+\frac{1}{a} \sin ^2(a x)\)
4 Learnable activations
Learnable activations are a thing, e.g. Ramachandran, Zoph, and Le (2017), Agostinelli et al. (2015), Lederer (2021), achieving their apotheosis in Kolmogorov-Arnold Networks.
5 Kolmogorov-Arnold networks
A cute related case of a learnable activation function is the Kolmogorov-Arnold network (Liu, Wang, et al. 2024), where the edge learns an activation function and there are no other weights. This has various nice properties such as being easy to compress, somehow. See Kolmogorov-Arnold Networks for a deeper treatment.