Random infinite-width NN induce covariances which are nearly dot product kernels in the input parameters. Say we wish to compare the outputs given two input examples. They depend on several dot products: , , and . Often it is convenient to discuss the angle between the inputs:
The classic result is that in a single-layer wide neural net, It is sometimes useful to note that where and
These terms arise from the (appropriately scaled limit of) the random weight matrix Now, define We have that and thus
Erf kernel
Williams (1996) recover a kernel that corresponds to the Erf sigmoidal activation in the infinite width limit. Let be an augmented copy of the inputs with a 1 prepended so that it includes the bias, and let be the covariance matrix of the weights (which are usually isotropic, ). Then can be written as This integral can be evaluated analytically to give
If there is no bias term, you can lop those tildes off and a factor of and the result should still hold. If the weights are isotropic, the s vanish also.
Arc-cosine kernel
An interesting dot-product kernel is the arc-cosine kernel (Cho and Saul 2009):
Specifically, where is given by: The first few are recovers the ReLU activation in the infinite width limit. i.e. The arc-cosine kernel of order corresponding to the case where is the ReLU is
Observation: This appears related to Grothendieck’s identity, that any fixed vectors we have I don’t have any use for that, it is just a cool identity I wanted to note down. In an aside Djalil Chafaï observes that the Rademacher RV is the distribution over the 1-dimensional sphere, Is that what makes this go?
Absolutely homogenous
Activation functions which are absolutely homogeneous of degree satisfying have additional structure. This class includes the ReLU and leaky ReLU activations (which are also included as the first-order arc-cosine kernel above). It follows from the definition that functions drawn from an NN with such an activation a.s. satisfy .
For absolutely homogeneous activation we can sum the derivatives over the coordinate indices i.e.
References
Arora, Du, Hu, et al. 2019. “On Exact Computation with an Infinitely Wide Neural Net.” In Advances in Neural Information Processing Systems.
Belkin, Ma, and Mandal. 2018.
“To Understand Deep Learning We Need to Understand Kernel Learning.” In
International Conference on Machine Learning.
Cho, and Saul. 2009.
“Kernel Methods for Deep Learning.” In
Proceedings of the 22nd International Conference on Neural Information Processing Systems. NIPS’09.
Geifman, Yadav, Kasten, et al. 2020.
“On the Similarity Between the Laplace and Neural Tangent Kernels.” In
arXiv:2007.01580 [Cs, Stat].
He, Lakshminarayanan, and Teh. 2020.
“Bayesian Deep Ensembles via the Neural Tangent Kernel.” In
Advances in Neural Information Processing Systems.
Jacot, Gabriel, and Hongler. 2018.
“Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In
Advances in Neural Information Processing Systems. NIPS’18.
Marteau-Ferey, Bach, and Rudi. 2020.
“Non-Parametric Models for Non-Negative Functions.” In
Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20.
Neal. 1996.
“Priors for Infinite Networks.” In
Bayesian Learning for Neural Networks. Lecture Notes in Statistics.
Pearce, Tsuchida, Zaki, et al. 2019. “Expressive Priors in Bayesian Neural Networks: Kernel Combinations and Periodic Functions.” In Uncertainty in Artificial Intelligence.
Tsuchida, Roosta, and Gallagher. 2018.
“Invariance of Weight Distributions in Rectified MLPs.” In
International Conference on Machine Learning.
Williams. 1996.
“Computing with Infinite Networks.” In
Proceedings of the 9th International Conference on Neural Information Processing Systems. NIPS’96.