Infinite width limits of neural networks

Large-width limits of neural nets.

Neural Network Gaussian Process

See Neural network Gaussian process on Wikipedia.

The field that sprang from the insight (Neal 1996a) that in the infinite limit deep NNs asymptotically approach Gaussian processes, and there are theories we can draw from that. Far from the infinite limit there are neural nets which exploit this. See random neural nets.

\[\begin{aligned} k(\mathbf{x}_p, \mathbf{x}_q) &= \mathbb{E}\big[ \psi(Z_p) \psi(Z_q) \big], \quad \text{ where} \\ \begin{pmatrix} Z_p \\ Z_q \end{pmatrix} &\sim \mathcal{N} \Bigg( \mathbf{0}, \underbrace{\begin{pmatrix} \mathbf{x}_p^\top \mathbf{x}_p & \mathbf{x}_p^\top \mathbf{x}_q \\ \mathbf{x}_q^\top \mathbf{x}_p & \mathbf{x}_q^\top \mathbf{x}_q \end{pmatrix}}_{:=\Sigma} \Bigg).\end{aligned}\] Note that \(\begin{pmatrix} Z_p \\ Z_q \end{pmatrix}\overset{d}{=} \operatorname{Chol}(\Sigma)\mathbf{Z}\) where \(\mathbf{Z}\sim \mathcal{N} \Bigg( \mathbf{0}, \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \Bigg).\)

Explicitly, \(\operatorname{Chol}(\Sigma)= \begin{pmatrix} \|\mathbf{x}_p\| & \|\mathbf{x}_q\|\cos \theta \\ 0 & \|\mathbf{x}_q\|\sqrt{1-\cos^2 \theta} \end{pmatrix}.\)

The arc-cosine kernel of order \(1\) corresponding to the case where \(\psi\) is the ReLU is given by Cho and Saul (2009) \[\begin{aligned} k(\mathbf{x}_p, \mathbf{x}_q) &= \frac{\sigma_w^2 \Vert \mathbf{x}_p \Vert \Vert \mathbf{x}_q \Vert }{2\pi} \Big( \sin |\theta| + \big(\pi - |\theta| \big) \cos\theta \Big) & \eqref{eq:arc-cos}\\\end{aligned}\]

We have that \[\begin{aligned} \kappa &= \mathbb{E} \big[ \psi\big(Z_p\big) \psi\big(Z_q \big) \big] \\ \frac{\partial \kappa}{\partial x_{pi}} x_{pi} &= \mathbb{E} \big[ \psi'\big(Z_p\big) \psi\big(Z_q \big) Z_{pi}\big] \\ \frac{\partial^2 \kappa}{\partial x_{pi} \partial x_{qj}} x_{pi} x_{qj} &= \mathbb{E} \big[ \psi'\big(Z_p\big) \psi'\big(Z_q \big) Z_{pi} Z_{qj} \big] \\ \frac{\partial^2 \kappa}{\partial x_{pi} \partial x_{pj}} x_{pi}x_{pj} &= \mathbb{E} \big[ \psi''\big(Z_p\big) \psi\big(Z_q \big) Z_{pi} Z_{pj} \big]\end{aligned}\]

For absolutely homogeneous activation we can sum the derivatives over the coordinate indices \[\begin{aligned} \sum_{i,j=1}^d \frac{\partial^2 \kappa}{\partial x_{pi} \partial x_{pj}} x_{pi} x_{pj} &= \mathbb{E} \big[ \psi''\big(Z_p\big) \psi\big(Z_q \big) (Z_p)^2 \big] = 0 &\eqref{eq:expectation1_unwarped} \\ \sum_{i,j=1}^d \frac{\partial^2 \kappa}{\partial x_{qi} \partial x_{qj}} x_{qi} x_{qj} &= \mathbb{E} \big[ \psi''\big(Z_q\big) \psi\ big(Z_p \big) (Z_q)^2 \big] = 0 &\eqref{eq:expectation2_unwarped} \\ \sum_{i,j=1}^d \frac{\partial^2 \kappa}{\partial x_{pi} \partial x_{qj}} x_{pi} x_{qj}&= \kappa. \end{aligned}\] i.e. \[\begin{aligned} \mathbf{x}_p\frac{\partial^2 \kappa}{ \partial \mathbf{x}_{p} \partial \mathbf{x}_{q}^\top} \mathbf{x}_q^{\top} &=\kappa\\ \mathbf{x}_p\frac{\partial^2 \kappa}{ \partial \mathbf{x}_{p} \partial \mathbf{x}_{p}^\top} \mathbf{x}_p^{\top} &=0\\ \mathbf{x}_q\frac{\partial^2 \kappa}{ \partial \mathbf{x}_{q} \partial \mathbf{x}_{q}^\top} \mathbf{x}_q^{\top} &=0. \end{aligned}\]

Neural Network Tangent Kernel

See Neural tangent kernel on Wikipedia. Ferenc Huszár provides some Intuition on the Neural Tangent Kernel, i.e. the paper (Lee et al. 2019).

It turns out the neural tangent kernel becomes particularly useful when studying learning dynamics in infinitely wide feed-forward neural networks. Why? Because in this limit, two things happen:

  1. First: if we initialize \(θ_0\) randomly from appropriately chosen distributions, the initial NTK of the network \(k_{θ_0}\) approaches a deterministic kernel as the width increases. This means, that at initialization, \(k_{θ_0}\) doesn’t really depend on \(k_{θ_0}\) but is a fixed kernel independent of the specific initialization.-
  2. Second: in the infinite limit the kernel \(k_{θ_t}\) stays constant over time as we optimise \(\theta_t\). This removes the parameter dependence during training.

These two facts put together imply that gradient descent in the infinitely wide and infinitesimally small learning rate limit can be understood as a pretty simple algorithm called kernel gradient descent with a fixed kernel function that depends only on the architecture (number of layers, activations, etc).

These results, taken together with an older known result (Neal 1996b), allows us to characterise the probability distribution of minima that gradient descent converges to in this infinite limit as a Gaussian process.

Implicit regularization

Here’s one interesting perspective (Zhang et al. 2017).

  • The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.

  • Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.

  • Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

[…] Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error. […] Appealing to linear models, we analyze how SGD acts as an implicit regularizer.


See Dropout.

As stochastic processes

We can find an SDE for a given NN-style kernel if we can find Green’s functions \(\sigma^2_\varepsilon \langle G_\cdot(\mathbf{x}_p), G_\cdot(\mathbf{x}_q)\rangle = \mathbb{E} \big[ \psi\big(Z_p\big) \psi\big(Z_q \big) \big].\) We see a solution is given by \[G_\mathbf{s}(\mathbf{x}_p) = \psi(\mathbf{s}^\top \mathbf{x}_p) \sqrt{\phi(\mathbf{s})}.\] Is this unique in some sense? No, my colleague Russell Tsuchida observes: if you set \(G_\mathbf{s}(\mathbf{x}_p) = \psi(\mathbf{s}^\top \mathbf{x}_p) \sqrt{\phi(\mathbf{s})}\), where \(\phi\) is the pdf of an independent standard multivariate normal vector, you get the desired result.


Adlam, Ben, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, and Jasper Snoek. 2020. “Exploring the Uncertainty Properties of Neural NetworksImplicit Priors in the Infinite-Width Limit.” October 14, 2020.
Arora, Sanjeev, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. 2019. “On Exact Computation with an Infinitely Wide Neural Net.” In Advances in Neural Information Processing Systems, 10.
Bai, Yu, and Jason D. Lee. 2020. “Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks.” February 14, 2020.
Belkin, Mikhail, Siyuan Ma, and Soumik Mandal. 2018. “To Understand Deep Learning We Need to Understand Kernel Learning.” In International Conference on Machine Learning, 541–49.
Chen, Lin, and Sheng Xu. 2020. “Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS.” October 1, 2020.
Chen, Minshuo, Yu Bai, Jason D. Lee, Tuo Zhao, Huan Wang, Caiming Xiong, and Richard Socher. 2021. “Towards Understanding Hierarchical Learning: Benefits of Neural Representations.” March 5, 2021.
Cho, Youngmin, and Lawrence K. Saul. 2009. “Kernel Methods for Deep Learning.” In Proceedings of the 22nd International Conference on Neural Information Processing Systems, 22:342–50. NIPS’09. Red Hook, NY, USA: Curran Associates Inc.
Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” November 30, 2020.
Fan, Zhou, and Zhichao Wang. 2020. “Spectra of the Conjugate Kernel and Neural Tangent Kernel for Linear-Width Neural Networks.” In Advances in Neural Information Processing Systems, 33:12.
Fort, Stanislav, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, and Surya Ganguli. 2020. “Deep Learning Versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33.
Gal, Yarin, and Zoubin Ghahramani. 2015. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning (ICML-16).
———. 2016. “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” In.
Garis Matthews, Alexander Graeme de, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahramani. 2018. “Gaussian Process Behaviour in Wide Deep Neural Networks.” August 16, 2018.
Geifman, Amnon, Abhay Yadav, Yoni Kasten, Meirav Galun, David Jacobs, and Ronen Basri. 2020. “On the Similarity Between the Laplace and Neural Tangent Kernels.” In.
Ghahramani, Zoubin. 2013. “Bayesian Non-Parametrics and the Probabilistic Approach to Modelling.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371 (1984): 20110553.
Giryes, R., G. Sapiro, and A. M. Bronstein. 2016. “Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?” IEEE Transactions on Signal Processing 64 (13): 3444–57.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33.
Jacot, Arthur, Franck Gabriel, and Clement Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems, 31:8571–80. NIPS’18. Red Hook, NY, USA: Curran Associates Inc.
Karakida, Ryo, and Kazuki Osawa. 2020. “Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks.” Advances in Neural Information Processing Systems 33.
Lee, Jaehoon, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. “Deep Neural Networks as Gaussian Processes.” In ICLR.
Lee, Jaehoon, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. 2019. “Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent.” In Advances in Neural Information Processing Systems, 8570–81.
Meronen, Lassi, Christabella Irwanto, and Arno Solin. 2020. “Stationary Activations for Uncertainty Calibration in Deep Learning.” In Advances in Neural Information Processing Systems. Vol. 33.
Neal, Radford M. 1996a. “Bayesian Learning for Neural Networks.” Secaucus, NJ, USA: Springer-Verlag New York, Inc.
———. 1996b. “Priors for Infinite Networks.” In Bayesian Learning for Neural Networks, edited by Radford M. Neal, 29–53. Lecture Notes in Statistics. New York, NY: Springer.
Novak, Roman, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. 2019. “Neural Tangents: Fast and Easy Infinite Neural Networks in Python.” December 5, 2019.
Novak, Roman, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2020. “Bayesian Deep Convolutional Networks with Many Channels Are Gaussian Processes.” In The International Conference on Learning Representations.
Pearce, Tim, Russell Tsuchida, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. n.d. “Expressive Priors in Bayesian Neural Networks: Kernel Combinations and Periodic Functions.” Uncertainty in Artificial Intelligence, 11.
Williams, Christopher K. I. 1996. “Computing with Infinite Networks.” In Proceedings of the 9th International Conference on Neural Information Processing Systems, 295–301. NIPS’96. Cambridge, MA, USA: MIT Press.
Yang, Greg. 2019. “Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture Are Gaussian Processes.” December 7, 2019.
Yang, Greg, and Edward J. Hu. 2020. “Feature Learning in Infinite-Width Neural Networks.” November 29, 2020.
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR.

Warning! Experimental comments system! If is does not work for you, let me know via the contact form.

No comments yet!

GitHub-flavored Markdown & a sane subset of HTML is supported.