Using statistical/ML approaches to solve PDEs, and indeed perform inference through them.
There are various approaches here, which I will document on an *ad hoc* basis as I need them.
No claim is made to completeness

## The Fourier neural operator approach

Zongyi Li blogs a neat trick here: We use Fourier transforms to capture resolution-invariant and non-local behaviour in PDE forward-solvers. There is a bouquet of papers designed to leverage this (Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. 2020b, 2020a; Li, Kovachki, Azizzadenesheli, Liu, Stuart, et al. 2020).

Learning a PDE operator is naturally expressed by a layer which acts on functions, and which is indifferent to how those functions are discretized, and in particular the scale at which they are discretized. Because this is neural net land we can be a certain type of sloppy and we don’t worry overmuch about the bandwidth of the functions to approximate; We will keep some conservative number of low harmonics from the Fourier transform and use a fairly arbitrary non-Fourier maps too, and conduct the whole thing in a “high” dimensional space with vague relationship to the original problem without worrying too much about what it means.

Code is at zongyi-li/fourier_neural_operator.

The basic concept is heuristic and the main trick is doing lots of accounting of dimensions, and being ok with an arbitrary-dimensional Fourier transform \(\mathcal{F}:D \rightarrow \mathbb{R}^{d} \to D \rightarrow \mathbb{C}^{d}\), which, applied to a function \(f: D \rightarrow \mathbb{R}^{d}\) looks like this:

\[\begin{aligned} (\mathcal{F} f)_{j}(k) &=\int_{D} f_{j}(x) e^{-2 i \pi\langle x, k\rangle} \mathrm{d} x \\ \left(\mathcal{F}^{-1} f\right)_{j}(x) &=\int_{D} f_{j}(k) e^{2 i \pi\langle x, k\rangle} \mathrm{d} k \end{aligned}\] for each dimension \(j=1, \ldots, d\).

We notionally use the property convolutions become multiplications under Fourier transform, which motivates the use of use convolutions to construct our operators. Good.

So the problem set up is that we are going to assume that there is a map, \(G^{\dagger}\) which arises from the solution of a PDE. Let \(D \subset \mathbb{R}^{d}\) be a bounded, open set which is the index set of our PDE solutions. We consider the input space \(\mathcal{A}=\mathcal{A}\left(D ; \mathbb{R}^{d_{a}}\right)\) of \(D\to\mathbb{R}^{d_{a}}\), and the solution space of \(\mathcal{U}=\mathcal{U}\left(D ; \mathbb{R}^{d_{u}}\right)\) of \(D\to\mathbb{R}^{d_{u}}\) functions. They are both Banach spaces. \(G^{\dagger}: \mathcal{A} \rightarrow \mathcal{U}\) is a map taking input functions to solutions functions which we learn from function pairs \(\left\{a_{j}, u_{j}\right\}_{j=1}^{N}\) where \(a_{j}\in \mathcal{A}\) and \(u_{j}=G^{\dagger}(a_{j})+\epsilon_{j}\) where \(\epsilon{j}\) is a corrupting noise. Further, we observe these functions only at certain points \(\{x_{1},\dots,x_{n}\}\subset D.\)

We approximate \(G^{\dagger}\) by choosing good parameters \(\theta^{\dagger} \in \Theta\) from the parameter space in some parametric family of maps \(G\left(\cdot, \theta\right)\) so that \(G\left(\cdot, \theta^{\dagger}\right)=G_{\theta^{\dagger}} \approx G^{\dagger}\). Specifically we minimise some cost functional \(C: \mathcal{U} \times \mathcal{U} \rightarrow \mathbb{R}\) to find \[ \min _{\theta \in \Theta} \mathbb{E}_{a \sim \mu}\left[C\left(G(a, \theta), G^{\dagger}(a)\right)\right]. \]

For the neural Fourier operator in particular we assume that \(G_{\theta}\) has a particular iterative form, i.e. \(G= Q \circ V_{T} \circ V_{T-1} \circ \ldots \circ V_{1} \circ P\). We introduce a new space \(\mathcal{V}=\mathcal{V}\left(D ; \mathbb{R}^{d_{v}}\right)\). \(P\) is a map \(\mathcal{A}\to\mathcal{V}\) and \(Q\) is a map \(\mathcal{V}\to\mathcal{U}\), and each \(V_{t}\) is a map \(\mathcal{V}\to\mathcal{U}\). Each of \(P\) and \(Q\) is ‘local’ in that they depend only upon pointwise evaluations of the function, e.g. for \(a\in\mathcal{A}\), \((Pa)(x)=p(a(x)\) for some \(p:\mathbb{R}^{d_{a}}\to\mathbb{R}^{d_{v}}\). Each \(v_{j}\) is function \(\mathbb{R}^{d_{v}}\). As a rule we are assuming \(\mathbb{R}^{d_{v}}>\mathbb{R}^{d_{a}}>\mathbb{R}^{d_{u}}.\) \(V_{t}\) is not local. In fact, we define \[ (V_{t}v)(x):=\sigma\left(W v(x)+\left(\mathcal{K}(a ; \phi) v\right)(x)\right), \quad \forall x \in D \] where \(\mathcal{K}(\cdot;\phi): \mathcal{A} \rightarrow \mathcal{L}\left(\mathcal{V}, \mathcal{V}\right)\). This map is parameterized by \(\phi \in \Theta_{\mathcal{K}}\). \(W: \mathbb{R}^{d_{v}} \rightarrow \mathbb{R}^{d_{v}}\) is a linear transformation, and \(\sigma: \mathbb{R} \rightarrow \mathbb{R}\) is a local, component-wise, non-linear activation function. \(\mathcal{K}(a ; \phi)\) is a kernel integral transformation, by which is meant \[ \left(\mathcal{K}(a ; \phi) v\right)(x):=\int_{D} \kappa_{\phi}(x, a(x), y, a(y)) v(y) \mathrm{d} y, \quad \forall x \in D \] where \(\kappa_{\phi}: \mathbb{R}^{2\left(d+d_{a}\right)} \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) is some mapping parameterized by \(\phi \in \Theta_{\mathcal{K}}\).

Anyway, it looks like this:

NB: I am not sure why there is a dependence on \(a\) in the kernel definition, because the authors immediately throw it out and replace it with
\[\kappa_{\phi}(x, a(x), y, a(y)) := \kappa_{R}(x-y)\] so that the integral operator becomes a convolution.
This convolution is can be calculated cheaply in Fourier space, which suggests we may as well define and calculate it also in Fourier space.
Thus they define the Fourier integral operator
\[
\left(\mathcal{K}(\phi) v\right)(x)=\mathcal{F}^{-1}\left(R \cdot\left(\mathcal{F} v\right)\right)(x) \quad \forall x \in D
\]
where \(R\) is the Fourier transform of a periodic function \(\kappa: D \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) parameterized by \(\phi \in \Theta_{\mathcal{K}}\).
Checking our units here, we have that
\(\left(\mathcal{F} v\right):D \to \mathbb{C}^{d_{v}}\)
and \(R (k): D \to \mathbb{C}^{d_{v} \times d_{v}}\).
In practice, since we can work with a Fourier series rather than a continuous transform, we will choose \(k\in\{0,1,2,\dots,k_{\text{max}}\}\) and then \(R\) can be represented by a tensor \(\mathbb{C}^{k_{\text{max}}\times d_{v}\times d_{v}}.\)
NB - my calculations occasionally came out differing from the versions the authors gave in the paper with regards to the dimensionality of the spaces. Not quite sure who is right here. *Caveat emptor*.
We can use a different \(W\) and \(R\) for each iteration if we want, say \(\{W_t,R_t\}_{1\leq t \leq T}\).
So the parameters of each of these, plus those of the maps \(P,Q\) comprise the parameters of the whole process.

Anyway, every step in this construction is differentiable in those parameters, and some of them can even be optimised with some FFTs and so on, so we are done with the setup, and have an operator that can be learned from data. Is it any good? Empirically the authors report that it is fast and precise, but I have yet to try it myself.

Quibble: They use the term ‘resolution-invariant’ loosely.
They do not actually prove that things are resolution *invariant* per se.
Rather, it is not even clear what that would specifically mean in this context — no attempt to prove Nyqvist conditions or other sampling-theoretic properties).
What *is* clear is that there is a an obvious interpretation of their solution as a continuous operator, in the sense that it can be evaluated at arbitrary points for the same computational cost as evaluating it at the training points.
Thus there is a sense in which it does not depend upon the resolution of the training set, in that we don’t need to rescale the coefficients of the solution in any sense to evaluate the functions our operator produces at unobserved coordinates.
You can, in a certain sense, treat *many* network structures as discrete approximations to PDE operators with a certain resolution (at least, deep Resnets with ReLU activations have such an interpretation, possibly others) and then use resampling methods to evaluate them at a different resolution, which is a more laborious process that potentially gets a similar result —
see the notebook on deep learning as dynamical system for an example of doing that.

Next quibble: Why *Fourier* transforms, instead of a different basis?
The Fourier transforms as used in this paper are truncated Fourier series, which amounts to an assumption that the basis functions are periodic, which in turn amounts to assuming that the domain of functions is toroidal, or, I suppose, a box with uniform boundary conditions.
This seems like an implausible restriction.
Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. (2020b) argue that does not matter because the edge effects can be more-or-less ignored and that things will still work OK because of the “local linear projection” part, which is … fine I guess?
Somewhat vague though.
Would alternative bases fix that, or just be more of an annoying PITA?
It is suggestive that some other basis expansion might be nicer.
Would there be anything to gain from learning a basis for the expansion?

Also, this uses a lot of the right buzzwords to sound like a kernel trick approach, and one can’t help but feel there might be a logical Gaussian process regression formulation with nice Bayesian guarantees. But I can’t see a simple one; the operator passes though a lot of nonlinear functions and thus the pushforward measure will get gnarly. I suppose it can be ‘just’ another deep Gaussian process with a tactically chosen kernel.

*Journal of Nonlinear Science*29 (4): 1563–1619. https://doi.org/10.1007/s00332-018-9525-3.

*Journal of Computational Physics*384 (May): 1–15. https://doi.org/10.1016/j.jcp.2019.02.002.

*ACM Transactions on Graphics*38 (6): 1–16. https://doi.org/10.1145/3355089.3356506.

*Advances in Neural Information Processing Systems*. Vol. 33. tensor.

*Probabilistic Engineering Mechanics*57 (July): 14–25. https://doi.org/10.1016/j.probengmech.2019.05.001.

*Journal of Computational Physics*378 (February): 686–707. https://doi.org/10.1016/j.jcp.2018.10.045.

*Imageio/Imageio V0.9.0*(version v0.9.0). Zenodo. https://doi.org/10.5281/ZENODO.1488561.

*SIAM Journal on Scientific Computing*42 (1): A292–317. https://doi.org/10.1137/18M1225409.

*Journal of Computational Physics*411 (June): 109409. https://doi.org/10.1016/j.jcp.2020.109409.

*SIAM Journal on Scientific Computing*42 (2): A639–65. https://doi.org/10.1137/19M1260141.

*Journal of Computational Physics*397 (November): 108850. https://doi.org/10.1016/j.jcp.2019.07.048.