Rough path theory and signature methods
April 2, 2021 — April 30, 2024
I am not sure yet what this is. Do they mean rough in the sense of approximate or the sense of not smooth? Or maybe both?
Seems to originate in a fairly impenetrable body of work by Lyons, e.g. T. Lyons (1994) but modern recommendations are to read more approachable stuff. Friz and Hairer (2020), available free online, as an introduction, which covers the simplest (?) case of Gaussian noise.
1 Rough differential equations
Try Morrill et al. (2021)?
2 Discrete approximation
Wong-Zakai approximations Twardowska (1996). (Martin Hairer recommendation.)
Possibly compact refs: (Kelly 2016; Kelly and Melbourne 2014).
3 In learning
Hodgkinson, Roosta, and Mahoney (2021) makes use of rough path integrals to justify learning by the adjoint method in stochastic differential equations. Cass and Salvi (2024) is a friendly introduction to this area.
4 Signatures
Chevyrev and Kormilitzin (2016) discusses path signatures in particular, which is something arising in the theory about which I know little. Bonnier et al. (2019) summarises:
When data is ordered sequentially then it comes with a natural path-like structure: the data may be thought of as a discretisation of a path \(X:[0,1] \rightarrow V\), where \(V\) is some Banach space. In practice we shall always take \(V=\mathbb{R}^d\) for some \(d \in \mathbb{N}\). For example the changing air pressure at a particular location may be thought of as a path in \(\mathbb{R}\); the motion of a pen on paper may be thought of as a path in \(\mathbb{R}^2\); the changes within financial markets may be thought of as a path in \(\mathbb{R}^d\), with \(d\) potentially very large.
Given a path, we may define its signature, which is a collection of statistics of the path. The map from a path to its signature is called the signature transform. Definition 1.1. Let \(\mathbf{x}=\left(x_1, \ldots, x_n\right)\), where \(x_i \in \mathbb{R}^d\). Let \(f=\left(f_1, \ldots, f_d\right):[0,1] \rightarrow \mathbb{R}^d\) be continuous, such that \(f\left(\frac{i-1}{n-1}\right)=x_i\), and linear on the intervals in between. Then the signature of \(\mathbf{x}\) is defined as the collection of iterated integrals \[ \operatorname{Sig}(\mathbf{x})=\left(\left(\int_{0<t_1<\cdots<t_k<1} \cdots \prod_{j=1}^k \frac{\mathrm{d} f_{i_j}}{\mathrm{~d} t}\left(t_j\right) \mathrm{d} t_1 \cdots \mathrm{d} t_k\right)_{1 \leq i_1, \ldots, i_k \leq d}\right)_{k \geq 0} \]
…In short, the signature of a path determines the path essentially uniquely, and does so in an efficient, computable way. Furthermore, the signature is rich enough that every continuous function of the path may be approximated arbitrarily well by a linear function of its signature; it may be thought of as a ‘universal nonlinearity’. Taken together these properties make the signature an attractive tool for machine learning. The most simple way to use the signature is as feature transformation, as it may often be simpler to learn a function of the signature than of the original path.
This makes it sound like we have a connection to koopman operators?
5 Code
- patrick-kidger/Deep-Signature-Transforms: Code for “Deep Signature Transforms” (Bonnier et al. 2019)
The signature of a stream of data is essentially a collection of statistics about that stream of data. This collection of statistics does such a good job of capturing the information about the stream of data that it actually determines the stream of data uniquely. (Up to something called ’tree-like equivalence’ anyway, which is really just a technicality. It’s an equivalence relation that matters about as much as two functions being equal almost everywhere. That is to say, not much at all.) The signature transform is a particularly attractive tool in machine learning because it is what we call a ’universal nonlinearity’: it is sufficiently rich that it captures every possible nonlinear function of the original stream of data. Any function of a stream is linear on its signature. Now for various reasons this is a mathematical idealisation not borne out in practice (which is why we put them in a neural network and don’t just use a simple linear model), but they still work very well!