Last-layer Bayes neural nets

Bayesian and other probabilistic inference in overparameterized ML

Consider the original linear model. We have a (column) vector \(\mathbf{y}=[y_1,y_2,\dots,t_n]^T\) of \(n\) observations, an \(n\times p\) matrix \(\mathbf{X}\) of \(p\) covariates, where each column corresponds to a different covariate and each row to a different observation.

We assume the observations are assumed to related to the covariates by \[ \mathbf{y}=\mathbf{Xb}+\mathbf{e} \] where \(\mathbf{b}=[b_1,y_2,\dots,b_p]\) gives the parameters of the model which we don’t yet know, We call \(\mathbf{e}\) the β€œresidual” vector. Legendre and Gauss pioneered the estimation of the parameters of a linear model by minimising the squared residuals, \(\mathbf{e}^T\mathbf{e}\), i.e. \[ \begin{aligned}\hat{\mathbf{b}} &=\operatorname{arg min}_\mathbf{b} (\mathbf{y}-\mathbf{Xb})^T (\mathbf{y}-\mathbf{Xb})\\ &=\operatorname{arg min}_\mathbf{b} \|\mathbf{y}-\mathbf{Xb}\|_2\\ &=\mathbf{X}^+\mathbf{y} \end{aligned} \] where we find the pseudo inverse \(\mathbf{X}^+\) using a numerical solver of some kind, using one of many carefully optimised methods that exists for least squares.

So far there is no statistical argument, merely function approximation.

However it turns out that if you assume that the \(\mathbf{e}_i\) are distributed randomly and independently i.i.d. errors in the observations (or at least indepenedent with constant variance), then there is also a statistical justification for this idea;

πŸ— more exposition of these. Linkage to Maximum likelihood.

For now, handball to Lu (2022).


Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer.
Buja, Andreas, Trevor Hastie, and Robert Tibshirani. 1989. β€œLinear Smoothers and Additive Models.” Annals of Statistics 17 (2): 453–510.
Hoaglin, David C., and Roy E. Welsch. 1978. β€œThe Hat Matrix in Regression and ANOVA.” The American Statistician 32 (1): 17–22.
Lu, Jun. 2022. β€œA Rigorous Introduction to Linear Models.” arXiv.
Riutort-Mayol, Gabriel, Paul-Christian BΓΌrkner, Michael R. Andersen, Arno Solin, and Aki Vehtari. 2020. β€œPractical Hilbert Space Approximate Bayesian Gaussian Processes for Probabilistic Programming.” arXiv:2004.11408 [Stat], April.
Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. β€œLinformer: Self-Attention with Linear Complexity.” arXiv.
Wilson, James T, Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, and Marc Deisenroth. 2020. β€œEfficiently Sampling Functions from Gaussian Process Posteriors.” In Proceedings of the 37th International Conference on Machine Learning, 10292–302. PMLR.
Zammit-Mangion, Andrew, and Noel Cressie. 2021. β€œFRK: An R Package for Spatial and Spatio-Temporal Prediction with Large Datasets.” Journal of Statistical Software 98 (May): 1–48.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.