Gaussian process inference by partial updates



\[\renewcommand{\var}{\operatorname{Var}} \renewcommand{\cov}{\operatorname{Cov}} \renewcommand{\corr}{\operatorname{Corr}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\rv}[1]{\mathsf{#1}} \renewcommand{\vrv}[1]{\vv{\rv{#1}}} \renewcommand{\disteq}{\stackrel{d}{=}} \renewcommand{\dif}{\backslash} \renewcommand{\gvn}{\mid} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}}\]

Notoriously, GP regression scales badly with dataset size, requiring us to invert a matrix full of observation covariances. But inverting a matrix is just solving a least square optimisation, when you think about it. So can we solve it by gradient descent and have it somehow come out cheaper? Can we incorporate updates from only some margins at a time and still converge to the same answer? More cheaply? Maybe.

Subsampling sites

Chen et al. (2020) shows that we can optimise hyperparametrs by subsampling sites and performing SGD if the kernels are smooth enough and the batch sizes large enough. At inference time we are still in trouble though.

Subsampling observations and sites

Minh (2022) bounds Wasserstein when we subsample observations and sites. e.g their algorithm 5.1 goes:

  • Input: Finite samples \(\left\{\xi_k^i\left(x_j\right)\right\}\), from \(N_i\) realizations \(\xi_k^i, 1 \leq k \leq N_i\), of processes \(\xi^i, i=1,2\), sampled at \(m\) points \(x_j, 1 \leq j \leq m\)

  • Procedure:

    1. Form \(m \times N_i\) data matrices \(Z_i\), with \(\left(Z_i\right)_{j k}=\xi_k^i\left(x_j\right), i=1,2,1 \leq j \leq m, 1 \leq k \leq N_i\)
    2. Compute \(m \times m\) empirical covariance matrices \(\hat{K}^i=\frac{1}{N} Z_i Z_i^T, i=1,2\)
    3. Compute \(W=W_2\left[\mathcal{N}\left(0, \frac{1}{m} \hat{K}^1\right), \mathcal{N}\left(0, \frac{1}{m} \hat{K}^2\right)\right]\) according to \[ W_2^2\left(\nu_0, \nu_1\right)=\left\|m_0-m_1\right\|^2+\operatorname{tr}\left(C_0\right)+\operatorname{tr}\left(C_1\right)-2 \operatorname{tr}\left(C_0^{1 / 2} C_1 C_0^{1 / 2}\right)^{1 / 2}\]

This is pretty much as we would expect to do it naively by plugging in the sample estimates of the target quantity, except they provide bounds for the quality of the estimate we get. AFAICS the bounds are trivial for Wasserstein in infinite-dimensional Hilbert spaces. But if we care about Sinkhorn divergences they seem to have useful bounds?

Anyway, sometimes subsampling is OK, it seems, if we want to approximate some GP in Sinkhorn divergence. Does this tell us anything about the optimisation problem?

Random projections

Song et al. (2019) maybe? I wonder if the Slice Score Matching approach is feasible here? Works great in diffusion.

References

Chen, Hao, Lili Zheng, Raed Al Kontar, and Garvesh Raskutti. 2020. “Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes.” In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2722–33. NIPS’20. Red Hook, NY, USA: Curran Associates Inc.
Gardner, Jacob R., Geoff Pleiss, David Bindel, Kilian Q. Weinberger, and Andrew Gordon Wilson. 2018. GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 31:7587–97. NIPS’18. Red Hook, NY, USA: Curran Associates Inc.
Hensman, James, Nicolò Fusi, and Neil D. Lawrence. 2013. Gaussian Processes for Big Data.” In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, 282–90. UAI’13. Arlington, Virginia, USA: AUAI Press.
Minh, Hà Quang. 2022. Finite Sample Approximations of Exact and Entropic Wasserstein Distances Between Covariance Operators and Gaussian Processes.” SIAM/ASA Journal on Uncertainty Quantification, February, 96–124.
Song, Yang, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. 2019. Sliced Score Matching: A Scalable Approach to Density and Score Estimation.” arXiv.
Ubaru, Shashanka, Jie Chen, and Yousef Saad. 2017. Fast Estimation of \(tr(f(A))\) via Stochastic Lanczos Quadrature.” SIAM Journal on Matrix Analysis and Applications 38 (4): 1075–99.
Wang, Ke, Geoff Pleiss, Jacob Gardner, Stephen Tyree, Kilian Q. Weinberger, and Andrew Gordon Wilson. 2019. Exact Gaussian Processes on a Million Data Points.” In Advances in Neural Information Processing Systems, 32:14648–59. Red Hook, NY, USA.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.