Large-width limits of neural nets.
Neural Network Gaussian Process
See Neural network Gaussian process on Wikipedia.
The field that sprang from the insight (Neal 1996a) that in the infinite limit deep NNs asymptotically approach Gaussian processes, and there are theories we can draw from that. Far from the infinite limit there are neural nets which exploit this. See random neural nets.
\[\begin{aligned} k(\mathbf{x}_p, \mathbf{x}_q) &= \mathbb{E}\big[ \psi(Z_p) \psi(Z_q) \big], \quad \text{ where} \\ \begin{pmatrix} Z_p \\ Z_q \end{pmatrix} &\sim \mathcal{N} \Bigg( \mathbf{0}, \underbrace{\begin{pmatrix} \mathbf{x}_p^\top \mathbf{x}_p & \mathbf{x}_p^\top \mathbf{x}_q \\ \mathbf{x}_q^\top \mathbf{x}_p & \mathbf{x}_q^\top \mathbf{x}_q \end{pmatrix}}_{:=\Sigma} \Bigg).\end{aligned}\] Note that \(\begin{pmatrix} Z_p \\ Z_q \end{pmatrix}\overset{d}{=} \operatorname{Chol}(\Sigma)\mathbf{Z}\) where \(\mathbf{Z}\sim \mathcal{N} \Bigg( \mathbf{0}, \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \Bigg).\)
Explicitly, \(\operatorname{Chol}(\Sigma)= \begin{pmatrix} \|\mathbf{x}_p\| & \|\mathbf{x}_q\|\cos \theta \\ 0 & \|\mathbf{x}_q\|\sqrt{1-\cos^2 \theta} \end{pmatrix}.\)
The arc-cosine kernel of order \(1\) corresponding to the case where \(\psi\) is the ReLU is given by Cho and Saul (2009) \[\begin{aligned} k(\mathbf{x}_p, \mathbf{x}_q) &= \frac{\sigma_w^2 \Vert \mathbf{x}_p \Vert \Vert \mathbf{x}_q \Vert }{2\pi} \Big( \sin |\theta| + \big(\pi - |\theta| \big) \cos\theta \Big) & \eqref{eq:arc-cos}\\\end{aligned}\]
We have that \[\begin{aligned} \kappa &= \mathbb{E} \big[ \psi\big(Z_p\big) \psi\big(Z_q \big) \big] \\ \frac{\partial \kappa}{\partial x_{pi}} x_{pi} &= \mathbb{E} \big[ \psi'\big(Z_p\big) \psi\big(Z_q \big) Z_{pi}\big] \\ \frac{\partial^2 \kappa}{\partial x_{pi} \partial x_{qj}} x_{pi} x_{qj} &= \mathbb{E} \big[ \psi'\big(Z_p\big) \psi'\big(Z_q \big) Z_{pi} Z_{qj} \big] \\ \frac{\partial^2 \kappa}{\partial x_{pi} \partial x_{pj}} x_{pi}x_{pj} &= \mathbb{E} \big[ \psi''\big(Z_p\big) \psi\big(Z_q \big) Z_{pi} Z_{pj} \big]\end{aligned}\]
For absolutely homogeneous activation we can sum the derivatives over the coordinate indices \[\begin{aligned} \sum_{i,j=1}^d \frac{\partial^2 \kappa}{\partial x_{pi} \partial x_{pj}} x_{pi} x_{pj} &= \mathbb{E} \big[ \psi''\big(Z_p\big) \psi\big(Z_q \big) (Z_p)^2 \big] = 0 &\eqref{eq:expectation1_unwarped} \\ \sum_{i,j=1}^d \frac{\partial^2 \kappa}{\partial x_{qi} \partial x_{qj}} x_{qi} x_{qj} &= \mathbb{E} \big[ \psi''\big(Z_q\big) \psi\ big(Z_p \big) (Z_q)^2 \big] = 0 &\eqref{eq:expectation2_unwarped} \\ \sum_{i,j=1}^d \frac{\partial^2 \kappa}{\partial x_{pi} \partial x_{qj}} x_{pi} x_{qj}&= \kappa. \end{aligned}\] i.e. \[\begin{aligned} \mathbf{x}_p\frac{\partial^2 \kappa}{ \partial \mathbf{x}_{p} \partial \mathbf{x}_{q}^\top} \mathbf{x}_q^{\top} &=\kappa\\ \mathbf{x}_p\frac{\partial^2 \kappa}{ \partial \mathbf{x}_{p} \partial \mathbf{x}_{p}^\top} \mathbf{x}_p^{\top} &=0\\ \mathbf{x}_q\frac{\partial^2 \kappa}{ \partial \mathbf{x}_{q} \partial \mathbf{x}_{q}^\top} \mathbf{x}_q^{\top} &=0. \end{aligned}\]
Neural Network Tangent Kernel
See Neural tangent kernel on Wikipedia. Ferenc Huszár provides some Intuition on the Neural Tangent Kernel, i.e. the paper (Lee et al. 2019).
It turns out the neural tangent kernel becomes particularly useful when studying learning dynamics in infinitely wide feed-forward neural networks. Why? Because in this limit, two things happen:
- First: if we initialize \(θ_0\) randomly from appropriately chosen distributions, the initial NTK of the network \(k_{θ_0}\) approaches a deterministic kernel as the width increases. This means, that at initialization, \(k_{θ_0}\) doesn’t really depend on \(k_{θ_0}\) but is a fixed kernel independent of the specific initialization.-
- Second: in the infinite limit the kernel \(k_{θ_t}\) stays constant over time as we optimise \(\theta_t\). This removes the parameter dependence during training.
These two facts put together imply that gradient descent in the infinitely wide and infinitesimally small learning rate limit can be understood as a pretty simple algorithm called kernel gradient descent with a fixed kernel function that depends only on the architecture (number of layers, activations, etc).
These results, taken together with an older known result (Neal 1996b), allows us to characterise the probability distribution of minima that gradient descent converges to in this infinite limit as a Gaussian process.
google/neural-tangents: Fast and Easy Infinite Neural Networks in Python
When are Neural Networks more powerful than Neural Tangent Kernels? – Off the convex path introduces two recent interesting takes, Bai and Lee (2020); M. Chen et al. (2021) which consider quadratic approximations instead of merely linearization. Sequel to Ultra-Wide Deep Nets and Neural Tangent Kernel
Implicit regularization
Here’s one interesting perspective (Zhang et al. 2017).
The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.
Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.
[…] Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error. […] Appealing to linear models, we analyze how SGD acts as an implicit regularizer.
Dropout
See Dropout.
As stochastic processes
We can find an SDE for a given NN-style kernel if we can find Green’s functions \(\sigma^2_\varepsilon \langle G_\cdot(\mathbf{x}_p), G_\cdot(\mathbf{x}_q)\rangle = \mathbb{E} \big[ \psi\big(Z_p\big) \psi\big(Z_q \big) \big].\) We see a solution is given by \[G_\mathbf{s}(\mathbf{x}_p) = \psi(\mathbf{s}^\top \mathbf{x}_p) \sqrt{\phi(\mathbf{s})}.\] Is this unique in some sense? No, my colleague Russell Tsuchida observes: if you set \(G_\mathbf{s}(\mathbf{x}_p) = \psi(\mathbf{s}^\top \mathbf{x}_p) \sqrt{\phi(\mathbf{s})}\), where \(\phi\) is the pdf of an independent standard multivariate normal vector, you get the desired result.
No comments yet!