Neural net kernels



How I imagine the hyperspherical regularity of an NN kernel.

Random infinite-width NN induce covariances which are nearly dot product kernels in the input parameters. Say we wish to compare the outputs given two input examples \(.\) They depend on the several dot products, \(\mathbf{x}^{\top} \mathbf{x}\), \(\mathbf{x}^{\top} \mathbf{y}\) and \(\mathbf{y}^{\top} \mathbf{y}\). Often it is convenient to discuss the angle \(\theta\) between the inputs: \[ \theta=\cos ^{-1}\left(\frac{\mathbf{x} ^{\top} \mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}\right) \]

The classic result here is that in a single layer wide-neural net, \[\begin{aligned} \kappa(\mathbf{x}, \mathbf{y}) &= \mathbb{E}\big[ \psi(Z_x) \psi(Z_y) \big], \quad \text{ where} \\ \begin{pmatrix} Z_x \\ Z_y \end{pmatrix} &\sim \mathcal{N} \Bigg( \mathbf{0}, \underbrace{\begin{pmatrix} \mathbf{x}^\top \mathbf{x} & \mathbf{x}^\top \mathbf{y} \\ \mathbf{y}^\top \mathbf{x} & \mathbf{y}^\top \mathbf{y} \end{pmatrix}}_{:=\Sigma} \Bigg). \end{aligned}\]
It is sometimes useful to note that \(\begin{pmatrix} Z_x \\ Z_y \end{pmatrix}\overset{d}{=} \operatorname{Chol}(\Sigma)\boldsymbol{Z}_1,\) where \(\boldsymbol{Z}_1\sim \mathcal{N} \Bigg( \mathbf{0}, \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \Bigg)\) and \(\operatorname{Chol}(\Sigma)= \begin{pmatrix} \|\mathbf{x}\| & \|\mathbf{y}\|\cos \theta \\ 0 & \|\mathbf{y}\|\sqrt{1-\cos^2 \theta} \end{pmatrix}.\)

These \(Z_{x}\) terms arise from the (appropriately scaled limit of) the random weight matrix
\[\begin{aligned} Z_x &= \mathbf{W}^\top\mathbf{x} \\ Z_y &= \mathbf{W}^\top \mathbf{y}. \end{aligned}\] Now, define \[\begin{aligned} Z_{xi} :&= W_{i} x_{i}, \\ Z_{yj} :&= W_{j} y_{j}, \\ Z'_{xi} :&= W_i, \\ Z'_{yj} :&= W_j. \end{aligned}\] We have that \[\begin{aligned} \kappa &= \mathbb{E} \big[ \psi\big(Z_x\big) \psi\big(Z_y \big) \big] \\ \frac{\partial \kappa}{\partial x_{i}} x_{i} &= \mathbb{E} \big[ \psi'\big(Z_x\big) \psi\big(Z_y \big) Z_{xi}\big] \\ \frac{\partial^2 \kappa}{\partial x_{i} \partial y_{j}} x_{i} y_{j} &= \mathbb{E} \big[ \psi'\big(Z_x\big) \psi'\big(Z_y \big) Z_{xi} Z_{yj} \big] \\ \frac{\partial^2 \kappa}{\partial x_{i} \partial x_{j}} x_{i}x_{j} &= \mathbb{E} \big[ \psi''\big(Z_x\big) \psi\big(Z_y \big) Z_{xi} Z_{xj} \big]\end{aligned}\] and thus \[\begin{align*} \frac{\partial \kappa}{\partial x_{i}} &= \mathbb{E} \big[ \psi'\big(Z_x\big) \psi\big(Z_y \big) Z_{xi}'\big] \\ \frac{\partial^2 \kappa}{\partial x_{i} \partial y_{j}} &= \mathbb{E} \big[ \psi'\big(Z_x\big) \psi'\big(Z_y \big) Z_{xi}' Z_{yj}' \big] \\ \frac{\partial^2 \kappa}{\partial x_{i} \partial x_{j}} &= \mathbb{E} \big[ \psi''\big(Z_x\big) \psi\big(Z_y \big) Z_{xi}' Z_{xj}'\big] . \end{align*}\]

Erf kernel

Williams (1996) recover a kernel that corresponds to the Erf sigmoidal activation in the infinite width limit. Let \(\tilde{\mathbf{x}}=\left(1, x_{1}, \ldots, x_{d}\right)\) be an augmented copy of the inputs with a 1 prepended so that it includes the bias, and let \(\Sigma\) be the covariance matrix of the weights (which are usually isotropic, \(\Sigma=\mathrm{I}\) ). Then \(\kappa_{\mathrm{erf}}\left(\mathbf{x}, \mathbf{y}\right)\) can be written as \[ \kappa_{\mathrm{erf}}\left(\mathbf{x}, \mathbf{y}\right)=\frac{1}{(2 \pi)^{\frac{d+1}{2}}|\Sigma|^{1 / 2}} \int \Phi\left(\mathbf{w}^{\top} \tilde{\mathbf{x}}\right) \Phi\left(\mathbf{w}^{\top} \tilde{\mathbf{y}}\right) \exp \left(-\frac{1}{2} \mathbf{w}^{\top} \Sigma^{-1} \mathbf{w}\right) \mathrm{d}\mathbf{w}. \] This integral can be evaluated analytically to give

\[ \kappa_{\mathrm{erf}}(\mathbf{x}, \mathbf{y}) =\frac{2}{\pi} \sin^{-1} \frac{ 2 \tilde{\mathbf{x}}^{\top} \Sigma \tilde{\mathbf{y}} }{ \sqrt{\left( 1+2 \tilde{\mathbf{x}}^{\top} \Sigma \tilde{\mathbf{x}} \right)\left( 1+2 \tilde{\mathbf{y}}^{\top} \Sigma \tilde{\mathbf{y}} \right)}}. \]

If there is no bias term, you can lop those tildes off and a factor of \(\sqrt{2\pi}\) and the result should still hold. If the weights are isotropic, the \(\Sigma\)s vanish also.

Arc-cosine kernel

An interesting dot-product kernel is the arc-cosine kernel (Cho and Saul 2009):

\[ \kappa_{n}(\mathbf{x}, \mathbf{y})= \frac{2}{(2 \pi)^{\frac{d}{2}}} \int \Theta(\mathbf{w} ^{\top} \mathbf{x}) \Theta(\mathbf{w} ^{\top} \mathbf{y})(\mathbf{w} ^{\top} \mathbf{x})^{n}(\mathbf{w} ^{\top} \mathbf{y})^{n} \exp\left(-\frac{1}{2}\mathbf{w}^{\top}\mathbf{w}\right) \mathrm{d}\mathbf{w} \]

Specifically, \[ \kappa_{n}(\mathbf{x}, \mathbf{y})=\frac{1}{\pi}\|\mathbf{x}\|^{n}\|\mathbf{y}\|^{n} J_{n}(\theta) \] where \(J_{n}(\theta)\) is given by: \[ J_{n}(\theta)=(-1)^{n}(\sin \theta)^{2 n+1}\left(\frac{1}{\sin \theta} \frac{\partial}{\partial \theta}\right)^{n}\left(\frac{\pi-\theta}{\sin \theta}\right) \] The first few \(J_{n}\) are \[ \begin{array}{l} J_{0}(\theta)=\pi-\theta \\ J_{1}(\theta)=\sin \theta+(\pi-\theta) \cos \theta. \end{array} \] \(J_{1}\) recovers the ReLU activation in the infinite width limit. i.e. The arc-cosine kernel of order \(1\) corresponding to the case where \(\psi\) is the ReLU is \[\begin{aligned} k(\mathbf{x}, \mathbf{y}) &= \frac{\sigma_w^2 \Vert \mathbf{x} \Vert \Vert \mathbf{y} \Vert }{2\pi} \Big( \sin |\theta| + \big(\pi - |\theta| \big) \cos\theta \Big) \end{aligned} \]

Observation: This appears related to Grothendieck’s identity, that any fixed vectors \(u, v \in \mathbb{S}^{n-1},\) we have \[ \mathbb{E} \operatorname{sign}X_{u} \operatorname{sign}X_{v}=\frac{2}{\pi} \arcsin u^{\top} v. \] I don’t have any use for that, it is just a cool identity I wanted to note down. In an aside Djalil Chafaï observes that the Rademacher RV is the distribution over the 1 dimensional sphere, \(\in \mathbb{S}^{0}.\) Is that what makes this go?

Absolutely homogenous

Activation functions which are absolutely homogeneous of degree \(r\) satisfying \(\psi(|a|z)=|a|^r\psi(z)\) have additional structure. This class includes the ReLU and leaky ReLU activations (which are also included as the first order arc-cosine kernel above.) It follows from the definition that functions \(f\) drawn from an NN with such an activation a.s. satisfy \(f(|a|\mathbf{x}) = |a|^r f(\mathbf{x})\).

For absolutely homogeneous activation we can sum the derivatives over the coordinate indices \[\begin{aligned} \sum_{i,j=1}^d \frac{\partial^2 \kappa}{\partial x_{i} \partial x_{j}} x_{i} x_{j} &= \mathbb{E} \big[ \psi''\big(Z_x\big) \psi\big(Z_y \big) (Z_x)^2 \big] = 0 \\ \sum_{i,j=1}^d \frac{\partial^2 \kappa}{\partial y_{i} \partial y_{j}} y_{i} y_{j} &= \mathbb{E} \big[ \psi''\big(Z_y\big) \psi\big(Z_x \big) (Z_y)^2 \big] = 0 \\ \sum_{i,j=1}^d \frac{\partial^2 \kappa}{\partial x_{i} \partial y_{j}} x_{i} y_{j}&= \kappa. \end{aligned}\] i.e. \[\begin{aligned} \mathbf{x}\frac{\partial^2 \kappa}{ \partial \mathbf{x}_{p} \partial \mathbf{x}_{q}^\top} \mathbf{y}^{\top} &=\kappa\\ \mathbf{x}\frac{\partial^2 \kappa}{ \partial \mathbf{x}_{p} \partial \mathbf{x}_{p}^\top} \mathbf{x}^{\top} &=0\\ \mathbf{y}\frac{\partial^2 \kappa}{ \partial \mathbf{x}_{q} \partial \mathbf{x}_{q}^\top} \mathbf{y}^{\top} &=0. \end{aligned}\]

References

Adlam, Ben, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, and Jasper Snoek. 2020. “Exploring the Uncertainty Properties of Neural NetworksImplicit Priors in the Infinite-Width Limit.” October 14, 2020. http://arxiv.org/abs/2010.07355.
Arora, Sanjeev, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. 2019. “On Exact Computation with an Infinitely Wide Neural Net.” In Advances in Neural Information Processing Systems, 10.
Belkin, Mikhail, Siyuan Ma, and Soumik Mandal. 2018. “To Understand Deep Learning We Need to Understand Kernel Learning.” In International Conference on Machine Learning, 541–49. http://arxiv.org/abs/1802.01396.
Chen, Lin, and Sheng Xu. 2020. “Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS.” October 1, 2020. http://arxiv.org/abs/2009.10683.
Cho, Youngmin, and Lawrence K. Saul. 2009. “Kernel Methods for Deep Learning.” In Proceedings of the 22nd International Conference on Neural Information Processing Systems, 22:342–50. NIPS’09. Red Hook, NY, USA: Curran Associates Inc. https://papers.nips.cc/paper/2009/hash/5751ec3e9a4feab575962e78e006250d-Abstract.html.
Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” November 30, 2020. http://arxiv.org/abs/2012.00152.
Fan, Zhou, and Zhichao Wang. 2020. “Spectra of the Conjugate Kernel and Neural Tangent Kernel for Linear-Width Neural Networks.” In Advances in Neural Information Processing Systems, 33:12. https://proceedings.neurips.cc//paper_files/paper/2020/hash/572201a4497b0b9f02d4f279b09ec30d-Abstract.html.
Fort, Stanislav, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, and Surya Ganguli. 2020. “Deep Learning Versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/405075699f065e43581f27d67bb68478-Abstract.html.
Geifman, Amnon, Abhay Yadav, Yoni Kasten, Meirav Galun, David Jacobs, and Ronen Basri. 2020. “On the Similarity Between the Laplace and Neural Tangent Kernels.” In. http://arxiv.org/abs/2007.01580.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/0b1ec366924b26fc98fa7b71a9c249cf-Abstract.html.
Jacot, Arthur, Franck Gabriel, and Clement Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems, 31:8571–80. NIPS’18. Red Hook, NY, USA: Curran Associates Inc. https://papers.nips.cc/paper/2018/hash/5a4be1fa34e62bb8a6ec6b91d2462f5a-Abstract.html.
Neal, Radford M. 1996. “Priors for Infinite Networks.” In Bayesian Learning for Neural Networks, edited by Radford M. Neal, 29–53. Lecture Notes in Statistics. New York, NY: Springer. https://doi.org/10.1007/978-1-4612-0745-0_2.
Pearce, Tim, Russell Tsuchida, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. 2019. “Expressive Priors in Bayesian Neural Networks: Kernel Combinations and Periodic Functions.” In Uncertainty in Artificial Intelligence, 11.
Tsuchida, Russell, Fred Roosta, and Marcus Gallagher. 2018. “Invariance of Weight Distributions in Rectified MLPs.” In International Conference on Machine Learning, 4995–5004. PMLR; PMLR. http://proceedings.mlr.press/v80/tsuchida18a.html.
Williams, Christopher K. I. 1996. “Computing with Infinite Networks.” In Proceedings of the 9th International Conference on Neural Information Processing Systems, 295–301. NIPS’96. Cambridge, MA, USA: MIT Press. https://openreview.net/forum?id=S1V2dDbdZH.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.