Using statistical or machine learning approaches to solve PDEs, and maybe even to perform inference through them.
There are various approaches here, which I will document on an *ad hoc* basis as I need them.
No claim is made to completeness.

There are presumably many approaches to ML learning of PDEs.

A classic is to learn a PDE on a grid of values and treat it as a classic convnet regression, and indeed the dynamical treatment of neural nets takes that to an extreme. For various practical reasons I would like to avoid assuming a grid. For one thing, grid systems are memory intensive and need expensive GPUs. For another, it is hard to integrate observations at multiple resolutions into a gridded data system. For a third, the research field of image prediction is too crowded for easy publications.

A grid free approach is graph networks that learn a topology and interaction system.
Nothing wrong with this idea *per se*, but it does not seem to be the most compelling approach to me for my domain of spatiotemporal prediction where we already know the topology and can avoid all the complicated bits of this approach.
So this I will also ignore for now.

Two other approaches will be handled here:

One learns an entire network which defines a function mapping from output-space coordinates to solution values (Raissi, Perdikaris, and Karniadakis 2019). This is the annoyingly-named implicit representation trick, whcih comes out very simply and elegantly although it has some limitations.

Another method is used in networks like Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. (2020b) which learn a *PDE propagator* or *operator* which produce a representation of the *entire solution surface* in terms of some basis functions (e.g. a Fourier basis).

So! Let us examine some methods which use each of these last two approaches.

## Learning a PDE

This might seem weird if you are used to typical ML research. Unlike the usual neural network setting, we start by not trying to solve a statistical inference problem, where we have to learn an unknown prediction function from data, but here we have a partially or completely known function (PDE solver) that we are trying to approximate with a more convenient substitute (a neural approximation to that PDE solver).

That approximant is, IMO not that exciting as a PDE solver in itself. Probably you could have implemented your reference PDE solver on the GPU, or tweaked it a little, and got a faster PDE solver so the speed benefit is not so useful.

However, I would like it if the reference solvers were easier to differentiate through, and to construct posteriors with - what you might call tomography, or inverse problem. Enabling advanced tomography is what I would like to do here; but first we need to approximate the operator… or do we? In fact, if I already know the PDE operator and am implementing it in any case, I could avoid this learning step and simply implement the PDE solver using an off-the-shelf differentiable solver, which should have few disadvantages.

That said, let us now assume that we are learning to approximate a PDE, for whatever reason. In my case it is that am required to match an industry-standard black-box solver, which is a common reason.

There are several approaches to learning the dynamics of a PDE solver for given parameters.

## The PINN lineage

TODO: Harmonise the notation used in this section with the Fourier neural operator section; right now they match the papers’ notation but not each other.

This body of literature encompasses both “DeepONet” (operator learning) and “PINN” (physics informed neural nets) approaches. Distinctions TBD.

### Deterministic PINN

Archetypally, the PINN. Recently these have been hip (Raissi, Perdikaris, and Karniadakis 2017a, 2017b; Raissi, Perdikaris, and Karniadakis 2019; Yang, Zhang, and Karniadakis 2020; Zhang, Guo, and Karniadakis 2020; Zhang et al. 2019). Zhang et al. (2019) credits Lagaris, Likas, and Fotiadis (1998) with originating the idea in 1998, so I suppose this is not super fresh.

Let me introduce the basic “forward” PINN setup as given in Raissi, Perdikaris, and Karniadakis (2019):

In the basic model we have the following problem \[ u +\mathcal{N}[u;\eta]=0, x \in \Omega, t \in[0, T] \] where \(u(t, x)\) denotes the latent (hidden) solution. We assume the differential operator \(\mathcal{N}\) is parameterised by some \(\eta\) which for now we take to be known and suppress. We also assume we have some observations from the true PDE solutions, presumably simulated or analytically tractable enough to be given analytically. The latter case is presumably for benchmarking, as it makes this entire approach pointless AFAICS if the analytic version is easy.

We define *residual network* \(f(t, x)\) to be given by the left-hand side of the above
\[
f:=u +\mathcal{N}[u]
\]
and proceed by approximating \(u(t, x;\theta)\) with a deep neural network \(f(t, x;\theta) .\)

The approximation is data-driven, with samples set \(\mathcal{S}_{t}\) from a run of the PDE solver, \[ \mathcal{S}=\left\{ \left\{ u( {t}_u^{(i)}, {x}_u^{(i)}) \right\}_{i=1}^{N_{u}}, \left\{ f(t_{f}^{(i)}, (x_{f}^{(i)}) \right\}_{i=1}^{N_{f}} \right\}. \]

\(u(t, x;\theta)\) and \(f(t, x;\theta)\) share parameters \(\theta\) (but differ in output).
This seems to be a neural implicit representation-style approach, were we learn functions on *coordinates*.
Each parameter set for the simulator to be approximated is a new dataset, and training examples are pointwise-sampled from the solution.

We train by minimising a combined loss, \[ \mathcal{L}(\theta)=\operatorname{MSE}_{u}(\theta)+\operatorname{MSE}_{f}(\theta) \] where \[ \operatorname{MSE}_{u}=\frac{1}{N_{u}} \sum_{i=1}^{N_{u}}\left|u\left(t_{u}^{(i)}, x_{u}^{(i)}\right)-u^{(i)}\right|^{2} \] and \[ \operatorname{MSE}_{f}=\frac{1}{N_{f}} \sum_{i=1}^{N_{f}}\left|f\left(t_{f}^{(i)}, x_{f}^{(i)}\right)\right|^{2} \] Loss \(\operatorname{MSE}_{u}\) corresponds to the initial and boundary data while \(\operatorname{MSE}_{f}\) enforces the structure imposed by the defining differential operator at a finite set of collocation points. This trick allows us to learn while nearly enforcing a conservation law

The key insight is that if we are elbows-deep in a neural network framework anyway, we already have access to automatic differentiation, so differential operations over the input field are basically free.

An example is illustrative. Here is the reference Tensorflow interpretation from Raissi, Perdikaris, and Karniadakis (2019) for the Burger’s equation. In one space dimension, the Burger’s equation with Dirichlet boundary conditions reads \[ \begin{array}{l} u +u u_{x}-(0.01 / \pi) u_{x x}=0, \quad x \in[-1,1], \quad t \in[0,1] \\ u(0, x)=-\sin (\pi x) \\ u(t,-1)=u(t, 1)=0 \end{array} \] We define \(f(t, x)\) to be given by \[ f:=u +u u_{x}-(0.01 / \pi) u_{x x} \]

The python implementation of these two parts is essentially a naïve transcription of those equations.

```
def u(t, x):
u = neural_net(tf.concat([t,x],1), weights, biases)
return u
def f(t, x):
u = u(t, x)
u_t = tf.gradients(u, t)[0]
u_x = tf.gradients(u, x)[0]
u_xx = tf.gradients(u_x, x)[0]
f = u_t + u∗u_x − (0.01/ tf.pi)∗u_xx
return f
```

Because the outputs are parameterised by coordinates, the built-in autodiff does all the work.

The authors summarise the resulting network topology so:

What has this gained us?
So far, we have acquired a model which can, the authors assert, solve deterministic PDEs, which is nothing we could not do before.
We have sacrificed any guarantee that our method will in fact do well on data from outside our observations.
Also, I do not understand how I can plug alternative initial or boundary conditions in to this.
There is no *data* input, as such, at inference time, merely coordinates.
On the other hand, the author assert that this is faster *and* more stable than traditional solvers.
It has the nice feature that the solution is continuous in its arguments; there is no grid.
As far as NN things go, it has some weird and refreshing properties: it is simple, has small data, and has few tuning parameters.

But! what if we don’t know the parameters of the PDE Assume the differential operator has parameter \(\eta\) which is not in fact known. \[ u +\mathcal{N}[u;\eta]=0, x \in \Omega, t \in[0, T] \] The trick, as far as I can tell, is simply to include \(\eta\) in trainable parameters. \[ f(\eta):=u (\eta)+\mathcal{N}[u;\eta] \] and proceed by approximating \(u(t, x;\theta,\eta)\) with a deep neural network \(f(t, x;\theta,\eta) .\) Everything else proceeds as before.

Fine; now what? Two obvious challenges from where I am sitting.

- No way of changing inputs in the sense of initial or boundary conditions, without re-training the model
- Point predictions. No accounting for randomness or uncertainty.

### Stochastic PINN

Zhang et al. (2019) address point 2 via chaos expansions to handle the PDE emulation as a stochastic process regression, which apparently gives us estimates of parametric and process uncertainty. All diagrams in this section come from that paper.

The extended model adds a random noise parameter \(k(x ; \omega)\):

\[\begin{array}{c} \mathcal{N}_{x}[u(x ; \omega) ; k(x ; \omega)]=0, \quad x \in \mathcal{D}, \quad \omega \in \Omega \\ \text { B.C. } \quad \mathcal{B}_{x}[u(x ; \omega)]=0, \quad x \in \Gamma \end{array}\]The randomness in this could indicate a random coupling term, or uncertainty in some parameter of the model. Think of a Gaussian process prior over the forcing term of the PDE.

We sample this noise parameter also and augment the data set with it, over \(N\) distinct realisations, giving a data set like this:

\[ \mathcal{S}=\left\{ \left\{ k( {t}_u^{(i)}, {x}_u^{(i)}; \omega_{s}) \right\}_{i=1}^{N_{u}}, \left\{ u( {t}_u^{(i)}, {x}_u^{(i)}; \omega_{s}) \right\}_{i=1}^{N_{u}}, \left\{ f(t_{f}^{(i)}, (x_{f}^{(i)}) \right\}_{i=1}^{N_{f}} \right\}_{s=1}^{N}. \]

Note that I have kept the time variable explicit, unlike the paper to match the previous section, but it gets cluttered if we continue to do this, so let’s suppress \(t\) hereafter, and make it *just another* axis of a multidimensional \(x\).

So now we approximate \(k\). Why? AFAICT that is because we are going to make a polynomial basis for \(\xi\) which means that we will want few dimensions.

We let \(K\) be the \(N_{k} \times N_{k}\) covariance matrix for the sensor measurements on \(k,\) i.e., \[ K_{i, j}=\operatorname{Cov}\left(k^{(i)}, k^{(j)}\right) \] We take an eigendemposition of \(K\). Let \(\lambda_{l}\) and \(\phi_{l}\) denote \(l\)-th largest eigenvalue and its associated normalized eigenvector of Then we have \[ K=\Phi^{T} \Lambda \Phi \] where \(\mathbf{\Phi}=\left[\phi_{1}, \phi_{2}, \ldots, \phi_{N_{k}}\right]\) is an orthonormal matrix and \(\boldsymbol{\Lambda}=\operatorname{diag}\left(\lambda_{1}, \lambda_{2}, \ldots \lambda_{N_{k}}\right)\) is a diagonal matrix. Let \(\boldsymbol{k}_{s}=\left[k_{s}^{(1)}, k_{s}^{(2)}, \ldots, k_{s}^{\left(N_{k}\right)}\right]^{T}\) be the results of the \(k\) measurements of the \(s\)-th snapshot, then \[ \boldsymbol{\xi}_{s}=\boldsymbol{\Phi}^{T} \sqrt{\boldsymbol{\Lambda}}^{-1} \boldsymbol{k}_{s} \] is an whitened, i.e. uncorrelated, random vector, and hence \(\boldsymbol{k}_{s}\) can be rewritten as a reduced dimensional expansion \[ \boldsymbol{k}_{s} \approx \boldsymbol{k}_{0}+\sqrt{\boldsymbol{\Lambda}^{M}} \boldsymbol{\Phi}^{M} \boldsymbol{\xi}_{s}^{M}, \quad M<N_{k} \] where \(\boldsymbol{k}_0=\mathbb{E}\boldsymbol{k}.\) We fix \(M\ll N_k\) and suppress it herafter.

Now we have approximated away the correlated \(\omega\) noise and in favour of this \(\xi\) which we have finite-dimensional representations of. \[k\left(x_{k}^{(i)} ; \omega_{s}\right) \approx k_{0}\left(x_{k}^{(i)}\right)+\sum_{l=1}^{M} \sqrt{\lambda_{l}} k_{l}\left(x_{k}^{(i)}\right) \xi_{s, l}, \quad M<N_{k}\] Note that this is defined only at the observation points, though.

Next is where we use the *chaos expansion* trick to construct an interpolant.
Suppose the measure of RV \(\xi\) is \(\rho\).
We approximate this unknown measure by its empirical measure \(\nu_{S}\).
\[
\rho(\boldsymbol{\xi}) \approx \nu_{S}(\boldsymbol{\xi})=\frac{1}{N} \sum_{\boldsymbol{\xi}_{s} \in S} \delta_{\xi_{s}}(\boldsymbol{\xi})
\]
where \(\delta_{\xi_{s}}\) is the Dirac measure.

We construct a polynomial basis which is orthogonal with respect to the inner product associated to this measure, specifically \[\begin{aligned} \langle \phi, \psi\rangle &:= \int \phi(x)\psi(x)\rho(x)\mathrm{d}x\\ &\approx \int \phi(x)\psi(x)\nu_{S}(x)\mathrm{d}x \end{aligned}\]

OK, so we construct an orthonormal polynomial basis
\(\left\{\psi_{\alpha}(\boldsymbol{\xi})\right\}_{\alpha=0}^{P}\) via Gram-Schmidt orthogonalization process.^{1}
With the polynomial basis \(\left\{\psi_{\alpha}(\boldsymbol{\xi})\right\}\) we can write a function \(g(x ; \boldsymbol{\xi})\) in the form of the aPC expansion,
\[
g(x ; \boldsymbol{\xi})=\sum_{\alpha=0}^{P} g_{\alpha}(x) \psi_{\alpha}(\boldsymbol{\xi})
\]
where the each \(g_{\alpha}(x)\) is calculated by
\[
g_{\alpha}(x)=\frac{1}{N} \sum_{s=1}^{N} \psi_{\alpha}\left(\boldsymbol{\xi}_{s}\right) g\left(x ; \boldsymbol{\xi}_{s}\right).
\]

So we are going to pick \(g\) to be some quantity of interest in our sim, and in fact, we will take it top be two separate quantities, \(u\) and \(k\).

Then, we can approximate \(k\) and \(u\) at the \(s\)-th snapshot by \[ \tilde{k}\left(x ; \omega_{s}\right)=\widehat{k_{0}}(x)+\sum_{i=1}^{M} \sqrt{\lambda_{i}} \widehat{k_{i}}(x) \xi_{s, i} \] and \[ \tilde{u}\left(x ; \omega_{s}\right)=\sum_{\alpha=0}^{P} \widehat{u_{\alpha}}(x) \psi_{\alpha}\left(\boldsymbol{\xi}_{s}\right). \]

We construct two networks here,

- the network \(\widehat{u_{\alpha}}\), which takes the coordinate \(x\) as the input and outputs a \((P+1) \times 1\) vector of the aPC modes of \(u\) evaluated at \(x,\) and
- the network \(\widehat{k_{i}}\) that takes the coordinate \(x\) as the input and outputs a \((M+1) \times 1\) vector of the \(k\) modes.

The resulting network topology is

For concreteness, here is the topology for an example problem \(\mathcal{N}:=-\frac{\mathrm{d}}{\mathrm{d} x}\left(k(x ; \omega) \frac{\mathrm{d}}{\mathrm{d} x} u\right)-f\):

At inference time we take observations of \(k\), calculate the whitened \(\xi\), then use the chaos expansion representation to calculate the values at unobserved locations. \[ \mathcal{L}\left(\mathcal{S}_{t}\right)=\operatorname{MSE}_{u}+\operatorname{MSE}_{k}+\operatorname{MSE}_{f} \] where \[ \begin{array}{l} \operatorname{MSE}_{u}=\frac{1}{N N_{u}} \sum_{s=1}^{N} \sum_{i=1}^{N_{u}}\left[\left(\tilde{u}\left(x_{u}^{(i)} ; \omega_{s}\right)-u\left(x_{u}^{(i)} ; \omega_{s}\right)\right)^{2}\right] \\ \operatorname{MSE}_{k}=\frac{1}{N N_{k}} \sum_{s=1}^{N} \sum_{i=1}^{N_{k}}\left[\left(\tilde{k}\left(x_{k}^{(i)} ; \omega_{s}\right)-k\left(x_{k}^{(i)} ; \omega_{s}\right)\right)^{2}\right] \end{array} \] and \[ \operatorname{MSE}_{f}=\frac{1}{N N_{f}} \sum_{s=1}^{N} \sum_{i=1}^{N_{f}}\left[\left(\mathcal{N}_{x}\left[\tilde{u}\left(x_{f}^{(i)} ; \omega_{s}\right) ; \tilde{k}\left(x_{f}^{(i)} ; \omega_{s}\right)\right]\right)^{2}\right] \]

After all that I would describe this as a method to *construct a stochastic PDE with the desired covariance structure*.

OK, all that was very complicated. Presuming the neural networks are perfect, we have a good estimate of the distribution of random parameters and random output of a stochastic PDE evaluated over the whole surface from partial discrete measurements.

How do we estimate the uncertainty introduce by the neural net? Dropout.

Further questions:

- Loss scale; gradient errors may not be comparable to value errors in the loss function.
- Network capacity: What size networks are necessary (not the ones we lear here are tiny, with only hundreds of parameters)
- How do we generalise this to different initial conditions? Can we learn an observation-conditional PDE?
- After all this work it looks like I still can’t do
*inference*on this thing. How do I update a distribution over \(k\) by this method from observations of a new PDE? - Notice how the parameter inference problem for \(\eta\) vanished for the stochastic PDE? Can we learn an estimate for \(u\), \(\eta\) and \(k\) simultaneously in this setting? I imagine we repeat the trick where that parameter is learned along with the \(u\) network.

### Weak formulation

A different network topology using the implicit representaiton trick is is explored in Zang et al. (2020) and extended to inverse problems in Bao et al. (2020),
They consider the *weak* formulation of a PDE.

We start with the example second-order elliptic PDE with on domain \(\Omega \subset \mathbb{R}^{d}\) given \[\mathcal{N}[u]-f:=-\sum_{i=1}^{d} \partial_{i}\left(\sum_{j=1}^{d} a_{i j} \partial_{j} u\right)+\sum_{i=1}^{d} b_{i} \partial_{i} u+c u-f=0\] where \(a_{i j}, b_{i}, c: \Omega \rightarrow \mathbb{R}\) for \(i, j \in[d] \triangleq\{1, \ldots, d\}, f: \Omega \rightarrow \mathbb{R}\) and \(g: \partial \Omega \rightarrow \mathbb{R}\) are all given. We start by assuming Dirichlet boundary conditions, \(u(x)-g(x)=0,\) although this is rapidly generalised.

By multiplying both sides by a test function \(\varphi \in H_{0}^{1}(\Omega ; \mathbb{R})\) and integrating by parts: \[\left\{\begin{array}{l}\langle\mathcal{N}[u], \varphi\rangle \triangleq \int_{\Omega}\left(\sum_{j=1}^{d} \sum_{i=1}^{d} a_{i j} \partial_{j} u \partial_{i} \varphi+\sum_{i=1}^{d} b_{i} \varphi \partial_{i} u+c u \varphi-f \varphi\right) \mathrm{d} x=0 \\ \mathcal{B}[u]=0, \quad \text { on } \partial \Omega\end{array}\right.\]

The clever insight is that this inspires an adversarial problem to find the weak solutions, by considering the \(L^2\) operator norm of \(\mathcal{N}[u](\varphi) \triangleq\langle\mathcal{N}[u], \varphi\rangle\). Then the operator norm of \(\mathcal{N}[u]\) is defined \[\|\mathcal{N}[u]\|_{o p} \triangleq \max \left\{\langle\mathcal{N}[u], \varphi\rangle /\|\varphi\|_{2} \mid \varphi \in H_{0}^{1}, \varphi \neq 0\right\}.\] Therefore, \(u\) is a weak solution of of the PDE if and only if \(\|\mathcal{N}[u]\|_{o p}=0\) and the boundary condition \(\mathscr{B}[u]=0\) is satisfied on \(\delta \Omega\). As \(\|\mathcal{N}[u]\|_{o p} \geq 0\), we know that a weak solution \(u\) thus solves the following two equivalent problems in observation: \[\min _{u \in H^{1}}\|\mathcal{N}[u]\|_{o p}^{2} \Longleftrightarrow \min _{u \in H^{1}} \max _{\varphi \in H_{0}^{1}}|\langle\mathcal{N}[u], \varphi\rangle|^{2} /\|\varphi\|_{2}^{2}.\]

Specifically the solutions \(u_{\theta}: \mathbb{R}^{d} \rightarrow \mathbb{R}\) are realized as a deep neural network with parameter \(\theta\) to be learned, such that \(\mathscr{N}\left[u_{\theta}\right]\) minimizes the operator norm. The test function \(\varphi\), is a deep adversarial network with parameter \(\eta\), whcih adversarially challenges \(u_{\theta}\) by maximizing \(\left\langle\mathcal{N}\left[u_{\theta}\right], \varphi_{\eta}\right\rangle/\left\|\varphi_{\eta}\right\|_{2}\) for every given \(u_{\theta}\).

To train the deep neural network \(u_{\theta}\) and the adversarial network \(\varphi_{\eta}\) we construct appropriate loss functions \(u_{\theta}\) and \(\varphi_{\eta}\). Since logarithm function is monotone and strictly increasing, we can for convenience formulate the objective of \(u_{\theta}\) and \(\varphi_{\eta}\) in the interior of \(\Omega\) as \[L_{\text {int }}(\theta, \eta) \triangleq \log \left|\left\langle\mathcal{N}\left[u_{\theta}\right], \varphi_{\eta}\right\rangle\right|^{2}-\log \left\|\varphi_{\eta}\right\|_{2}^{2}.\] In addition, the weak solution \(u_{\theta}\) must also satisfy the boundary condition \(\mathscr{B}[u]=0\) on \(\delta \Omega\) which we fill in as above, calling it \(L_{\text {bdry }}(\theta).\) The total adversarial objective function is the weighted sum of the two objectives for which we seek for a saddle point that solves the minimax problem: \[\min _{o} \max L(\theta, \eta), \text{ where } L(\theta, \eta) \triangleq L_{\text {int }}(\theta, \eta)+\alpha L_{\text {bdry }}(\theta).\] \(\alpha\) might seem arbitrary; apparently it is useful as a tuning parameter.

This is a very elegant idea, although the implicit representation thing is still a problem for my use cases.

## Learning a PDE forward operator

Learning to solve a known PDE using a neural network was interesting but it left us somehow divorced from the inference problem of learning dynamics. Perhaps it would be nice to learn an operator that projects estimates of current states forward.

### Fourier neural operator

Zongyi Li blogs a neat trick here: We use Fourier transforms to capture resolution-invariant and non-local behaviour in PDE forward-solvers. There is a bouquet of papers designed to leverage this (Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. 2020b, 2020a; Li, Kovachki, Azizzadenesheli, Liu, Stuart, et al. 2020). See also Anima Anandkumar’s presentation, which phrases this in terms of Green’s functions relationship with Fourier transforms.

See also nostalgebraist who has added some interesting commentary and also fixed up a typo in this page for me (Thanks!).

Learning a PDE operator is naturally expressed by a layer which acts on functions, and which is indifferent to how those functions are discretized, and in particular the scale at which they are discretized. Because this is neural net land we can be a certain type of sloppy and we don’t worry overmuch about the bandwidth of the signals we approximate; We keep some conservative number of low harmonics from the Fourier transform and use a fairly arbitrary non-Fourier maps too, and conduct the whole thing in a “high” dimensional space with vague relationship to the original problem without worrying too much about what it means.

Code is at zongyi-li/fourier_neural_operator.

The basic concept is heuristic and the main trick is doing lots of accounting of dimensions, and being ok with an arbitrary-dimensional Fourier transform \(\mathcal{F}:D \rightarrow \mathbb{R}^{d} \to D \rightarrow \mathbb{C}^{d}\), which, applied to a function \(f: D \rightarrow \mathbb{R}^{d}\) looks like this:

\[\begin{aligned} (\mathcal{F} f)_{j}(k) &=\int_{D} f_{j}(x) e^{-2 i \pi\langle x, k\rangle} \mathrm{d} x \\ \left(\mathcal{F}^{-1} f\right)_{j}(x) &=\int_{D} f_{j}(k) e^{2 i \pi\langle x, k\rangle} \mathrm{d} k \end{aligned}\] for each dimension \(j=1, \ldots, d\).

We notionally use the property that convolutions become multiplications under Fourier transform, which motivates the use of use convolutions to construct our operators. Good.

Problem set up: We assume that there is a map, \(G^{\dagger}\) which arises from the solution of a PDE. Let \(D \subset \mathbb{R}^{d}\) be a bounded, open set which is the argument set of our PDE solutions. We consider the input space \(\mathcal{A}=\mathcal{A}\left(D ; \mathbb{R}^{d_{a}}\right)\) of \(D\to\mathbb{R}^{d_{a}}\), and the solution space of \(\mathcal{U}=\mathcal{U}\left(D ; \mathbb{R}^{d_{u}}\right)\) of \(D\to\mathbb{R}^{d_{u}}\) functions. They are both Banach spaces. \(G^{\dagger}: \mathcal{A} \rightarrow \mathcal{U}\) is a map taking input functions to solutions functions which we learn from function pairs \(\left\{a_{j}, u_{j}\right\}_{j=1}^{N}\) where \(a_{j}\in \mathcal{A}\) and \(u_{j}=G^{\dagger}(a_{j})+\epsilon_{j}\) where \(\epsilon{j}\) is a corrupting noise. Further, we observe these functions only at certain points \(\{x_{1},\dots,x_{n}\}\subset D.\)

We approximate \(G^{\dagger}\) by choosing good parameters \(\theta^{\dagger} \in \Theta\) from the parameter space in some parametric family of maps \(G\left(\cdot, \theta\right)\) so that \(G\left(\cdot, \theta^{\dagger}\right)=G_{\theta^{\dagger}} \approx G^{\dagger}\). Specifically we minimise some cost functional \(C: \mathcal{U} \times \mathcal{U} \rightarrow \mathbb{R}\) to find \[ \min _{\theta \in \Theta} \mathbb{E}_{a \sim \mu}\left[C\left(G(a, \theta), G^{\dagger}(a)\right)\right]. \]

For the neural Fourier operator in particular we assume that \(G_{\theta}\) has a particular iterative form, i.e. \(G= Q \circ V_{T} \circ V_{T-1} \circ \ldots \circ V_{1} \circ P\). We introduce a new space \(\mathcal{V}=\mathcal{V}\left(D ; \mathbb{R}^{d_{v}}\right)\). \(P\) is a map \(\mathcal{A}\to\mathcal{V}\) and \(Q\) is a map \(\mathcal{V}\to\mathcal{U}\), and each \(V_{t}\) is a map \(\mathcal{V}\to\mathcal{V}\). Each of \(P\) and \(Q\) is ‘local’ in that they depend only upon pointwise evaluations of the function, e.g. for \(a\in\mathcal{A}\), \((Pa)(x)=p(a(x)\) for some \(p:\mathbb{R}^{d_{a}}\to\mathbb{R}^{d_{v}}\). Each \(v_{j}\) is function \(\mathbb{R}^{d_{v}}\). As a rule we are assuming \(\mathbb{R}^{d_{v}}>\mathbb{R}^{d_{a}}>\mathbb{R}^{d_{u}}.\) \(V_{t}\) is not local. In fact, we define \[ (V_{t}v)(x):=\sigma\left(W v(x)+\left(\mathcal{K}(a ; \phi) v\right)(x)\right), \quad \forall x \in D \] where \(\mathcal{K}(\cdot;\phi): \mathcal{A} \rightarrow \mathcal{L}\left(\mathcal{V}, \mathcal{V}\right)\). This map is parameterized by \(\phi \in \Theta_{\mathcal{K}}\). \(W: \mathbb{R}^{d_{v}} \rightarrow \mathbb{R}^{d_{v}}\) is a linear transformation, and \(\sigma: \mathbb{R} \rightarrow \mathbb{R}\) is a local, component-wise, non-linear activation function. \(\mathcal{K}(a ; \phi)\) is a kernel integral transformation, by which is meant \[ \left(\mathcal{K}(a ; \phi) v\right)(x):=\int_{D} \kappa_{\phi}(x, a(x), y, a(y)) v(y) \mathrm{d} y, \quad \forall x \in D \] where \(\kappa_{\phi}: \mathbb{R}^{2\left(d+d_{a}\right)} \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) is some mapping parameterized by \(\phi \in \Theta_{\mathcal{K}}\).

Anyway, it looks like this:

We immediately throw out the dependence on \(a\) in the kernel definition replacing it with
\[\kappa_{\phi}(x, a(x), y, a(y)) := \kappa_{R}(x-y)\] so that the integral operator becomes a convolution.
This convolution can be calculated cheaply in Fourier space, which suggests we may as well define and calculate it also in Fourier space.
Accordingly, the real work happens when they define the Fourier integral operator
\[
\left(\mathcal{K}(\phi) v\right)(x)=\mathcal{F}^{-1}\left(R \cdot\left(\mathcal{F} v\right)\right)(x) \quad \forall x \in D
\]
where \(R\) is the Fourier transform of a periodic function \(\kappa: D \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) parameterized by \(\phi \in \Theta_{\mathcal{K}}\).
Checking our units here, we have that
\(\left(\mathcal{F} v\right):D \to \mathbb{C}^{d_{v}}\)
and \(R (k): D \to \mathbb{C}^{d_{v} \times d_{v}}\).
In practice, since we can work with a Fourier series rather than a continuous transform, we will choose \(k\in\{0,1,2,\dots,k_{\text{max}}\}\) and then \(R\) can be represented by a tensor \(\mathbb{C}^{k_{\text{max}}\times d_{v}\times d_{v}}.\)
NB - my calculations occasionally came out differing from the versions the authors gave in the paper with regards to the dimensionality of the spaces.
Not quite sure who is right here. *Caveat emptor*.
We can use a different \(W\) and \(R\) for each iteration if we want, say \(\{W_t,R_t\}_{1\leq t \leq T}\).
So, the parameters of each of these, plus those of the maps \(P,Q\) comprise the parameters of the whole process.

Anyway, every step in this construction is differentiable in those parameters, and some of the steps can even be found rapidly using FFTs and so on, so we are done with the setup, and have an operator that can be learned from data. Is it any good? Empirically the authors report that it is fast and precise, but I have yet to try it myself.

Quibble: They use the term ‘resolution-invariant’ loosely at a few points in the paper, and it takes some work to understand the actual defensible claim here.
They do not actually prove that things are resolution *invariant* per se.
Rather, it is not even clear what that would specifically mean in this context — no attempt to prove Nyquist conditions or other sampling-theoretic properties over the spatial domain.
And, AFAICT in the time domain their method is resolution-dependent in every sense.
Step size is fixed.

What *is* clear and what I think they mean is that there is an obvious interpretation of the solution as a continuous operator, in the sense that it can be evaluated at arbitrary (spatial) points for the same computational cost as evaluating it at the training points.
Thus there is a sense in which it does not depend upon the resolution of the training set, in that we don’t any resampling of the solution to evaluate the functions our operator produces at unobserved coordinates.

What does this get us?
You can, in a certain sense, treat *many* network structures as discrete approximations to PDE operators with a certain resolution (at least, deep Resnets with ReLU activations have such an interpretation, presumably others) and then use resampling methods to evaluate them at a different resolution, which is a more laborious process that potentially gets a similar result —
see the notebook on deep learning as dynamical system for an example of doing that.

Next quibble: Why *Fourier* transforms, instead of a different basis?
Their rationale is that it integral operators are basically convolutions in the linear case, and therefore a bit like convolutions more generally.
Heuristically, stacked convolutions and non-linearities might well-approximate a nonlinear PDE.
So convolutions are what we want, and Fourier transforms turn convolutions into multiplications so we should use them.
For me a better rationale is that these are “nice” bases for PDEs because Fourier transforms have well-understood interpretations which encode derivatives, translations, interpolations and other useful operations, which is why we use them normally in classic PDE solvers. The convolution thing is more “deep net”-like though.

Also Fourier transforms are notionally fast, which is another popular deep net rationale. AFAICT the authors do not make any particular effort to be maximally fast by using dilation to truncate the Fourier series. Instead they calculate the whole Fourier transform then throw some bits out. Presumably because they were already fast enough without being clever.

OTOH Fourier series also encode some troubling assumptions. Essentially amounts to an assumption that the basis functions are periodic, which in turn amounts to assuming that the domain of functions is toroidal, or, I suppose, a box with uniform boundary conditions. This is pretty common, but also annoying. Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. (2020b) argue that does not matter because the edge effects can be more-or-less ignored and that things will still work OK because of the “local linear projection” part, which is … fine I guess? Maybe true. It still grates. Certainly the boundary conditions situation will just be messy AFAICT for any neural operator so I am at ease with the idea they would like to simply ignore it for now.

Would alternative bases fix that boundary problem, or just be more of an annoying PITA?
Would there be anything to gain from *learning* a basis for the expansion?
Certainly you would lose speed.

Also, this method uses a lot of the right buzzwords to sound like a kernel trick approach, and one can’t help but feel there might be a logical Gaussian process regression formulation with nice Bayesian interpretation. But I can’t see an obvious way of making that fly; the operator passes though a lot of nonlinear functions and thus the pushforward measure will get gnarly, not to mention the correlation kernel. I suppose it could possibly be approximated as ‘just’ another deep Gaussian process with some tactical assumptions, perhaps infinite-width asymptotics. Or a variational approximation? But what is a plausible variational family here?

### DeepONet

Lu, Jin, and Karniadakis (2020) is from the people who brought you PINN, above. The setup is related, but AFAICT differs in a few ways in that

- we don’t (necessarily?) use the derivative information at the sensor locations
- we learn an operator mapping initial/latent conditions to output functions
- we decompose the input function space into a basis and them sample randomly from the bases in order to span the input space at training time

The authors argue they have found a good topology for this

A DeepONet consists of two sub-networks, one for encoding the input function at a fixed number of sensors \(x_i, i = 1, \dots, m\) (branch net), and another for encoding the locations for the output functions (trunk net).

This addresses some problems with generalisation that make the PINN setup seem unsatisfactory; in particular we can change the inputs, or project arbitrary inputs forward.

The boundary conditions and input points appear to stay fixed though, and inference of the unknowns is still vexed.

## Advection-diffusion PDEs in particular

F. Sigrist, Künsch, and Stahel (2015b) finds a nice spectral representation of certain classes of stochastic PDE. These are extended in Liu, Yeo, and Lu (2020) to non-stationary operators. By being less generic, these come out with computationally convenient spectral representations.

## Boundary conditions

Are hard to handle in a neural network context. Solutions seem diverse. Liu, Yeo, and Lu (2020) just assumes that there are some initial conditions and boundaries can be “handled” by hoping that simulating a toroidal domain then cutting out a segment of interest.

## Inverse problems

Tomography through PDEs.

Suppose I have a PDE, possibly with some unknown parameters in the driving equation. All being equal I can do not too badly at approximating that with tools already mentioned. What if I wish simultaneously infer some unknown inputs, and in fact learn to solve for a such inputs? Then we consider it as an inverse problem. The methods so far do not solve that kind of problem, but if we squint at them a little we might hope they do better.

Liu, Yeo, and Lu (2020) is one approach, generalizing the approach of F. Sigrist, Künsch, and Stahel (2015b) to a system with noise.

## Differentiable solvers

OK, suppose I am keen to make another method again that will do clever things to augment the potential of PDE solvers. To this end it would be nice to have a PDE solver that is not a completely black box but which you can interrogate for useful gradients. Obviously all PDE solvers use derivative information, but only some of them expose that to users so that we can use them as an implicit step in machine learning algorithms.

OTOH, there is a lot of stuff that needs doing in PDEs that maybe we would like a more full-featured library to solve. That is why PDE libraries are a thing. A common thing. A massively common thing which is endlessly re-implemented. Some of the solvers listed at PDE solvers are do expose the information we want; and some have been re-reimplemented inside machine learning frameworks. How about I list some of each?

### DeepXDE

DeepXDE is the reference solver for PINN and DeepONet.

Use DeepXDE if you need a deep learning library that

- solves forward and inverse partial differential equations (PDEs) via physics-informed neural network (PINN),
- solves forward and inverse integro-differential equations (IDEs) via PINN,
- solves forward and inverse fractional partial differential equations (fPDEs) via fractional PINN (fPINN),
- approximates functions from multi-fidelity data via multi-fidelity NN (MFNN),
- approximates nonlinear operators via deep operator network (DeepONet),
- approximates functions from a dataset with/without constraints.

You might need to moderate your expectations a little. This is an impressive library, but as covered above, some of the types of problems that it can solve are more limited than you might hope upon reading the description. Think of it as a neural network library that handles certain PDEs and you will not go too far astray.

### ADCME

ADCME is suitable for conducting inverse modeling in scientific computing; specifically, ADCME targets physics informed machine learning, which leverages machine learning techniques to solve challenging scientific computing problems. The purpose of the package is to:

- provide differentiable programming framework for scientific computing based on TensorFlow automatic differentiation (AD) backend;
- adapt syntax to facilitate implementing scientific computing, particularly for numerical PDE discretization schemes;
- supply missing functionalities in the backend (TensorFlow) that are important for engineering, such as sparse linear algebra, constrained optimization, etc.
Applications include

- physics informed machine learning (a.k.a., scientific machine learning, physics informed learning, etc.)
- coupled hydrological and full waveform inversion
- constitutive modeling in solid mechanics
- learning hidden geophysical dynamics
- parameter estimation in stochastic processes
The package inherits the scalability and efficiency from the well-optimized backend TensorFlow. Meanwhile, it provides access to incorporate existing C/C++ codes via the custom operators. For example, some functionalities for sparse matrices are implemented in this way and serve as extendable “plugins” for ADCME.

### TenFEM

TenFEM offers a small selection of differentiable FEM solvers fpr Tensorflow.

### JuliaFEM

Julaifem is an umbrella organisation supporting julia-backed FEM solvers. The documentation is tricksy, but check out the examples. Analyses and solvers · JuliaFEM.jl. I assume these are all differentiable since that is a selling point of the SciML.jl ecosystem they spring from.

### Trixi

Trixi.jl is a numerical simulation framework for hyperbolic conservation laws written in Julia. A key objective for the framework is to be useful to both scientists and students. Therefore, next to having an extensible design with a fast implementation, Trixi is focused on being easy to use for new or inexperienced users, including the installation and postprocessing procedures.

### FEniCS

Also seems to be a friendly PDE solver, lacking in GPU support. However, it does have an interface to pytorch, barkm/torch-fenics.

### taichi

“Sparse simulator” Tai Chi is presumably also able to solve PDEs? 🤷🏼♂️ If so that would be nifty because it is also differentiable. I suspect it is more of a graph network approach.

## References

*Proceedings of The 28th Conference on Learning Theory*, 40:113–49. Paris, France: PMLR. http://proceedings.mlr.press/v40/Arora15.html.

*Inverse Problems*36 (11): 115003. https://doi.org/10.1088/1361-6420/abb447.

*Proceedings of the National Academy of Sciences*116 (31): 15344–49. https://doi.org/10.1073/pnas.1814058116.

*Journal of Nonlinear Science*29 (4): 1563–1619. https://doi.org/10.1007/s00332-018-9525-3.

*Advances in Computational Mathematics*45 (5-6): 2503–32. https://doi.org/10.1007/s10444-019-09723-8.

*Journal of Computational Physics*384 (May): 1–15. https://doi.org/10.1016/j.jcp.2019.02.002.

*ACM Transactions on Graphics*38 (6): 1–16. https://doi.org/10.1145/3355089.3356506.

*The Journal of Machine Learning Research*17 (1): 613–66. http://arxiv.org/abs/1510.08231.

*IEEE Transactions on Neural Networks*9 (5): 987–1000. https://doi.org/10.1109/72.712178.

*Advances in Neural Information Processing Systems*. Vol. 33. tensor.

*Canadian Journal of Statistics*35 (4): 597–606. https://doi.org/10.1002/cjs.5550350410.

*Journal of the American Statistical Association*0 (0): 1–18. https://doi.org/10.1080/01621459.2020.1863223.

*Probabilistic Engineering Mechanics*57 (July): 14–25. https://doi.org/10.1016/j.probengmech.2019.05.001.

*Reliability Engineering & System Safety*106 (October): 179–90. https://doi.org/10.1016/j.ress.2012.05.002.

*Physica D: Nonlinear Phenomena*406 (May): 132401. https://doi.org/10.1016/j.physd.2020.132401.

*Journal of Computational Physics*378 (February): 686–707. https://doi.org/10.1016/j.jcp.2018.10.045.

*Advances in Neural Information Processing Systems*. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/9f319422ca17b1082ea49820353f14ab-Abstract.html.

*Journal of Statistical Software*63 (14). https://doi.org/10.18637/jss.v063.i14.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*77 (1): 3–33. https://doi.org/10.1111/rssb.12061.

*Imageio/Imageio V0.9.0*(version v0.9.0). Zenodo. https://doi.org/10.5281/ZENODO.1488561.

*Statistics and Computing*30 (2): 419–46. https://doi.org/10.1007/s11222-019-09886-w.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*42 (8): 1968–80. https://doi.org/10.1109/TPAMI.2019.2904255.

*SIAM Journal on Scientific Computing*42 (1): A292–317. https://doi.org/10.1137/18M1225409.

*Journal of Computational Physics*411 (June): 109409. https://doi.org/10.1016/j.jcp.2020.109409.

*SIAM Journal on Scientific Computing*42 (2): A639–65. https://doi.org/10.1137/19M1260141.

*Journal of Computational Physics*397 (November): 108850. https://doi.org/10.1016/j.jcp.2019.07.048.

\(P\), the size of the basis, depends on the highest allowed polynomial order \(r\) in \(\psi_{\alpha}(\boldsymbol{\xi}),\) following the formula \[ P+1=\frac{(r+M) !}{r ! M !}. \]↩︎

## No comments yet. Why not leave one?