Statistical learning theory for dependent data
November 3, 2016 — December 2, 2024
Statistical learning theory for dependent data such as time series (and possibly other dependency structures but I only know about results for time series.)
Non-stationary, non-asymptotic bounds please. Keywords: Ergodic, α-, β-mixing.
1 Time series
Mohri and Kuznetsov have done lots of work here; See, e.g. their NIPS2016 tutorial. There seem to be a lot of results depending on stringent ergodicity. Overview to 2011 in McDonald, Shalizi, and Schervish (2011b):
Yu (1994) sets forth many of the uniform ergodic theorems that are needed to derive generalization error bounds for stochastic processes. Meir (2000) is one of the first papers to construct risk bounds for time series. […]
More recently, others have provided PAC results for non-IID data. Steinwart and Christmann (2009) prove an oracle inequality for generic regularized empirical risk minimisation algorithms learning from α-mixing processes, a fairly general sort of weak serial dependence, getting learning rates for least-squares support vector machines (SVMs) close to the optimal IID rates. Mohri and Rostamizadeh (2009b) prove stability-based generalization bounds when the data are stationary and φ-mixing or β-mixing, strictly generalising IID results and applying to all stable learning algorithms. […] Karandikar and Vidyasagar (n.d.) show that if an algorithm is “subadditive” and yields a predictor whose risk can be upper bounded when the data are IID, then the same algorithm yields predictors whose risk can be bounded if data are β-mixing. They use this result to derive generalization error bounds in terms of the learning rates for IID data and the β-mixing coefficients.
Bounds that depend on mixing coefficients can be unsatisfying, since even when the mixing coefficient can be estimated from the data, it will depend on the parameters fit to the data. Which will depend on the mixing coefficient.
One possible resolution sidesteps mixing coefficients entirely (Kuznetsov and Mohri 2015, 2014).
Read those last two papers in reverse order of publication date to reduce confusion.
Time series forecasting plays a crucial role in a number of domains ranging from weather forecasting and earthquake prediction to applications in economics and finance. The classical statistical approaches to time series analysis are based on generative models such as the autoregressive moving average (ARMA) models, or their integrated versions (ARIMA) and several other extensions […] . Most of these models rely on strong assumptions about the noise terms, often assumed to be i.i.d. random variables sampled from a Gaussian distribution, and the guarantees provided in their support are only asymptotic. An alternative non-parametric approach to time series analysis consists of extending the standard i.i.d. statistical learning theory framework to that of stochastic processes.
[…] we consider the general case of non-stationary non-mixing processes. We are not aware of any prior work providing generalization bounds in this setting. In fact, our bounds appear to be novel even when the process is stationary (but not mixing). The learning guarantees that we present hold for both bounded and unbounded memory models. […] Our guarantees cover the majority of approaches used in practice, including various autoregressive and state space models. The key ingredients of our generalization bounds are a data-dependent measure of sequential complexity (expected sequential covering number or sequential Rademacher complexity [Rakhlin et al., 2010]) and a measure of discrepancy between the sample and target distributions. Kuznetsov and Mohri (2014) also give generalization bounds in terms of discrepancy. However, unlike [that result], our analysis does not require any mixing assumptions which are hard to verify in practice. More importantly, under some additional mild assumption, the discrepancy measure that we propose can be estimated from data, which leads to data-dependent learning guarantees for non-stationary non-mixing case.
They still want their time series to have a discrete time index, which can be unsatisfying for continuous processes.
They also work with mixing coefficients (e.g. Kuznetsov and Mohri 2016) but last time I saw them speak, they were critical of the whole mixing coefficient setting.
2 General graph dependencies
I just saw these works presented: R. (Ray). Zhang et al. (2019) and R.-R. Zhang and Amini (2024), which suggest we can achieve results under more complicated graphs than those induced by time series.
I am not quite sure how this works in practice; the D-graph they use is not the same as the more conventional I-graph that I’m used to in graphical models, although they are both graphical. I am having a hard time understanding what dependencies are actually encoded in the D-graph from the brief presentation I just saw; I would need to think about this further to have an opinion. In particular, I do not see how to recover even the above time series results in terms of D-graphs immediately; they seem to produce dense graphs.