Kolmogorov-Arnold neural networks

Don’t learn weights, learn activations encode a physical process.

2024-10-13 — 2024-10-14

Wherein a Neural Architecture Is Described That Learns Univariate Activation Splines, Is Placed Between Symbolic Regression and MLPs, and an N^{-4} Scaling of Test Loss Is Reported.

calculus

dynamical systems

geometry

Hilbert space

how do science

Lévy processes

machine learning

neural nets

PDEs

physics

regression

sciml

SDEs

signal processing

statistics

statmech

stochastic processes

surrogate

time series

uncertainty

A hyped variant of classic NNs.

Where the classic NN (i.e. the MLP) relies on layers of linear transformations (weights) and fixed activation functions (like ReLU or tanh) at the nodes, the Kolmogorov-Arnold Networks (KANs) learn activation functions.

Interesting things about these networks, from my first impression

They seem to fill a niche between symbolic regression and MLPs.
It seems to be easy to sparsify them in a way that is not possible with MLPs.

1 Kolmogorov-Arnold Theorem

The Kolmogorov-Arnold theorem claims that any continuous multivariate function $f(x_1, \dots, x_n)$ can be decomposed into sums of univariate functions. The classic representation looks like this:

\[ f(x_1, \dots, x_n) = \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^{n} \varphi_{q,p}(x_p) \right) \]

This means that for any complex multivariate function, you can break it down into a composition of univariate functions plus some addition.

2 KAN networks

In a KAN (Liu, Wang, et al. 2024), we learn how these univariate functions compose themselves into a multivariate structure, instead of fixing the composition in advance. The “weights” between nodes, represented by splines or other parameterized functions, are free to learn what the best local univariate relationship is.

KANs are structured by stacking KAN layers, where each layer looks something like this:

\[ x_{l+1,j} = \sum_{i=1}^{n_l} \varphi_{l,j,i}(x_{l,i}) \]

where $\varphi_{l,j,i}$ is a learnable univariate function (parameterized as a spline, for instance), and $x_{l,i}$ is the activation value from the previous layer. So while each function is univariate, the overall transformation still respects the multivariate nature of the input. The final function learned by a KAN is a composition of these layers:

\[ \operatorname{KAN}(x) = (\Phi_L \circ \Phi_{L-1} \circ \dots \circ \Phi_1)(x) \]

This makes KANs flexible like MLPs, but maybe the information is combined in a way that is more comprehensible to the human mind: We can visualize or probe the learned univariate functions $\varphi_{l,j,i}$.

3 As symbolic-ish regression

Symbolic regression tries to discover closed-form expressions—think $ y = (x) + $—directly from the data, i.e. a symbolic representation of a function. Symbolic regression is powerful because it gives you a human-readable formula, something interpretable, but it is not robust to noise. The search space of possible functions is huge, and small changes in data or noise can cause symbolic regression to completely fail.

On the other side of the spectrum, we have traditional NNs (MLPs), which are universal function approximators but work as “black boxes.” They don’t tell us how they approximate a function; they just do it. We get almost zero interpretability.

A KAN can, in theory, produce output that mimics symbolic regression by learning a function’s compositional structure. For example, if we’re modelling something like:

\[ f(x, y) = \exp(\sin(\pi x) + y^2) \]

MLPs would use layers of matrix multiplications and fixed activations (like ReLU) to approximate this. But with KANs, the model potentially actually learns the internal univariate functions (like the $\sin$ and $\exp$) and how to combine them. Once the KAN has trained, we could probe its learned activation functions and discover, for example, that it has closely approximated $\sin(\pi x)$ and $\exp(x)$ as part of its learned structure.

4 Scaling accuracy

The paper claims that KANs enjoy a neural scaling law of $ N^{-4} $ (where $\ell$ is the test loss and $N$ is the number of parameters).

5 References

Abueidda, Pantidis, and Mobasher. 2024. “DeepOKAN: Deep Operator Network Based on Kolmogorov Arnold Networks for Mechanics Problems.”

Genet, and Inzirillo. 2024. “TKAN: Temporal Kolmogorov-Arnold Networks.”

Howard, Jacob, Murphy, et al. 2024. “Finite Basis Kolmogorov-Arnold Networks: Domain Decomposition for Data-Driven and Physics-Informed Problems.”

Koenig, Kim, and Deng. 2024. “KAN-ODEs: Kolmogorov–Arnold Network Ordinary Differential Equations for Learning Dynamical Systems and Hidden Physics.” Computer Methods in Applied Mechanics and Engineering.

Li. 2024. “Kolmogorov-Arnold Networks Are Radial Basis Function Networks.”

Liu, Ma, Wang, et al. 2024. “KAN 2.0: Kolmogorov-Arnold Networks Meet Science.”

Liu, Wang, Vaidya, et al. 2024. “KAN: Kolmogorov-Arnold Networks.”

Shen, Zeng, Wang, et al. 2024. “Reduced Effectiveness of Kolmogorov-Arnold Networks on Functions with Noise.”

Vaca-Rubio, Blanco, Pereira, et al. 2024. “Kolmogorov-Arnold Networks (KANs) for Time Series Analysis.”

Wang, Sun, Bai, et al. 2024. “Kolmogorov Arnold Informed Neural Network: A Physics-Informed Deep Learning Framework for Solving Forward and Inverse Problems Based on Kolmogorov Arnold Networks.”

Xu, Chen, and Wang. 2024. “Kolmogorov-Arnold Networks for Time Series: Bridging Predictive Power and Interpretability.”

Yang, and Wang. 2024. “Kolmogorov-Arnold Transformer.”