Consider the original linear model. We have a (column) vector of observations, an matrix of covariates, where each column corresponds to a different covariate and each row to a different observation.
We assume the observations are related to the covariates by where gives the parameters of the model which we don’t yet know. We call the “residual” vector. Legendre and Gauss pioneered the estimation of the parameters of a linear model by minimising the squared residuals, , i.e. where we find the pseudo-inverse using a numerical solver of some kind, using one of many carefully optimised methods that exist for least squares.
So far there is no statistical argument, merely function approximation.
However, it turns out that if you assume that the are distributed randomly and independently i.i.d. errors in the observations (or at least independent with constant variance), then there is also a statistical justification for this idea;
🏗 more exposition of these. Linkage to Maximum likelihood.
For now, handball to Lu (2022).
References
Bishop. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics.
Buja, Hastie, and Tibshirani. 1989.
“Linear Smoothers and Additive Models.” Annals of Statistics.
Hoaglin, and Welsch. 1978.
“The Hat Matrix in Regression and ANOVA.” The American Statistician.
Kailath. 1980. Linear Systems. Prentice-Hall Information and System Science Series.
Kailath, Sayed, and Hassibi. 2000. Linear Estimation. Prentice Hall Information and System Sciences Series.
Mandelbaum. 1984.
“Linear Estimators and Measurable Linear Transformations on a Hilbert Space.” Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete.
Rao, Toutenburg, Shalabh, et al. 2008. Linear models and generalizations: least squares and alternatives. Edited by Michael Shomaker. Springer series in statistics.
Wilson, Borovitskiy, Terenin, et al. 2020.
“Efficiently Sampling Functions from Gaussian Process Posteriors.” In
Proceedings of the 37th International Conference on Machine Learning.