Neural tangent kernel
December 9, 2020 — October 14, 2022
See also: infinite width networks, statistical mechanics of statistics.
Good starting points: Lilian Weng, Some Math behind Neural Tangent Kernel. Ferenc Huszár provides some Intuition on the Neural Tangent Kernel, i.e. the paper (Lee et al. 2019).
It turns out the neural tangent kernel becomes particularly useful when studying learning dynamics in infinitely wide feed-forward neural networks. Why? Because in this limit, two things happen:
- First: if we initialize \(θ_0\) randomly from appropriately chosen distributions, the initial NTK of the network \(k_{θ_0}\) approaches a deterministic kernel as the width increases. This means, that at initialization, \(k_{θ_0}\) doesn’t really depend on \(k_{θ_0}\) but is a fixed kernel independent of the specific initialization.
- Second: in the infinite limit the kernel \(k_{θ_t}\) stays constant over time as we optimise \(\theta_t\). This removes the parameter dependence during training.
These two facts put together imply that gradient descent in the infinitely wide and infinitesimally small learning rate limit can be understood as a pretty simple algorithm called kernel gradient descent with a fixed kernel function that depends only on the architecture (number of layers, activations, etc).
These results, taken together with an older known result (Neal 1996), allow us to characterise the probability distribution of minima that gradient descent converges to in this infinite limit as a Gaussian process.
google/neural-tangents: Fast and Easy Infinite Neural Networks in Python (paper), (poster), (blog post)
When are Neural Networks more powerful than Neural Tangent Kernels? introduces two recent interesting takes, Bai and Lee (2020);M. Chen et al. (2021) which consider quadratic approximations instead of merely linearization. Sequel to Ultra-Wide Deep Nets and Neural Tangent Kernel.
Reverse engineering the NTK: towards first-principles architecture design