Neural tangent kernel

December 9, 2020 — October 14, 2022

algebra
Bayes
functional analysis
Gaussian
Hilbert space
kernel tricks
machine learning
metrics
model selection
neural nets
nonparametric
optimization
probabilistic algorithms
SDEs
stochastic processes
Figure 1

See also: infinite width networks.

Good starting points: Lilian Weng, Some Math behind Neural Tangent Kernel. Ferenc Huszár provides some Intuition on the Neural Tangent Kernel, i.e. the paper (Lee et al. 2019).

It turns out the neural tangent kernel becomes particularly useful when studying learning dynamics in infinitely wide feed-forward neural networks. Why? Because in this limit, two things happen:

  1. First: if we initialize \(θ_0\) randomly from appropriately chosen distributions, the initial NTK of the network \(k_{θ_0}\) approaches a deterministic kernel as the width increases. This means, that at initialization, \(k_{θ_0}\) doesn’t really depend on \(k_{θ_0}\) but is a fixed kernel independent of the specific initialization.
  2. Second: in the infinite limit the kernel \(k_{θ_t}\) stays constant over time as we optimise \(\theta_t\). This removes the parameter dependence during training.

These two facts put together imply that gradient descent in the infinitely wide and infinitesimally small learning rate limit can be understood as a pretty simple algorithm called kernel gradient descent with a fixed kernel function that depends only on the architecture (number of layers, activations, etc).

These results, taken together with an older known result (Neal 1996), allow us to characterise the probability distribution of minima that gradient descent converges to in this infinite limit as a Gaussian process.

1 References

Bai, and Lee. 2020. Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks.” arXiv:1910.01619 [Cs, Math, Stat].
Chen, Minshuo, Bai, Lee, et al. 2021. Towards Understanding Hierarchical Learning: Benefits of Neural Representations.” arXiv:2006.13436 [Cs, Stat].
Chen, Lin, and Xu. 2020. Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS.” arXiv:2009.10683 [Cs, Math, Stat].
Fan, and Wang. 2020. Spectra of the Conjugate Kernel and Neural Tangent Kernel for Linear-Width Neural Networks.” In Advances in Neural Information Processing Systems.
Fort, Dziugaite, Paul, et al. 2020. Deep Learning Versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems.
Geifman, Yadav, Kasten, et al. 2020. On the Similarity Between the Laplace and Neural Tangent Kernels.” In arXiv:2007.01580 [Cs, Stat].
He, Lakshminarayanan, and Teh. 2020. Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems.
Jacot, Gabriel, and Hongler. 2018. Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems. NIPS’18.
Lee, Xiao, Schoenholz, et al. 2019. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent.” In Advances in Neural Information Processing Systems.
Liu, Zhu, and Belkin. 2020. On the Linearity of Large Non-Linear Models: When and Why the Tangent Kernel Is Constant.” In Advances in Neural Information Processing Systems.
Neal. 1996. Priors for Infinite Networks.” In Bayesian Learning for Neural Networks. Lecture Notes in Statistics.
Novak, Xiao, Hron, et al. 2019. Neural Tangents: Fast and Easy Infinite Neural Networks in Python.” arXiv:1912.02803 [Cs, Stat].
Sachdeva, Dhaliwal, Wu, et al. 2022. Infinite Recommendation Networks: A Data-Centric Approach.”
Simon, Anand, and DeWeese. 2022. Reverse Engineering the Neural Tangent Kernel.”
Xu, Zhang, Li, et al. 2021. How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks.”