Notoriously, GP regression scales badly with dataset size, requiring us to invert a matrix full of observation covariances. But inverting a matrix is just solving a least square optimisation, when you think about it. So can we solve it by gradient descent and have it somehow come out cheaper? Can we incorporate updates from only some margins at a time and still converge to the same answer? More cheaply? Maybe.
1 Subsampling sites
Chen et al. (2020) shows that we can optimise hyperparameters by subsampling sites and performing SGD if the kernels are smooth enough and the batch sizes large enough. At inference time we are still in trouble though.
2 Subsampling observations and sites
Minh (2022) bounds Wasserstein when we subsample observations and sites. e.g. their algorithm 5.1 goes:
Input: Finite samples
, from realizations , of processes , sampled at pointsProcedure:
- Form
data matrices , with - Compute
empirical covariance matrices - Compute
according to
- Form
This is pretty much as we would expect to do it naively by plugging in the sample estimates of the target quantity, except they provide bounds for the quality of the estimate we get. AFAICS the bounds are trivial for Wasserstein in infinite-dimensional Hilbert spaces. But if we care about Sinkhorn divergences they seem to have useful bounds?
Anyway, sometimes subsampling is OK, it seems, if we want to approximate some GP in Sinkhorn divergence. Does this tell us anything about the optimisation problem?
3 Random projections
Song et al. (2019) maybe? I wonder if the Slice Score Matching approach is feasible here? Works great in diffusion.