Machine learning for partial differential equations



\(\newcommand{\solop}{\mathcal{G}^{\dagger}}\)

Using statistical or machine learning approaches to solve PDEs, and maybe even to perform inference through them. There are many approaches to ML learning of PDEs and I will document on an ad hoc basis as I need them. No claim is made to completeness.

TODO: To avoid proliferation of unclear symbols by introducing a specific example; which neural nets represent operators, which represent specific functions, between which spaces etc.

TODO: Harmonise the notation used in this section with subsections below; right now they match the papers’ notation but not each other.

TODO: should the intro section actually be filed under PDEs?

TODO: introduce a consistent notation for coordinate space, output spaces, and function space?

Background

Suppose we have a PDE defined over some input domain, which we presume is a time dimension and some number of spatial dimensions. The PDE is specified by some differential operator \(\mathcal{D}\) and some forcing or boundary condition \(u\in \mathscr{U},\) as \[\mathcal{D}[f]=u.\] These functions will map from some coordinate space \(C\) to some output space \(O\). At the moment we consider only compact sets of positive Lebesgue measure, \(C\subseteq\mathbb{R}^{d_C}\) and \(O\subseteq\mathbb{R}^{d_O}.\) The first coordinate of the input space often has the special interpretation as time \(t\in \mathbb{R}\) and the subsequent coordinates are then spatial coordinate \(x\in D\subseteq \mathbb{R}^{d_{D}}\) where \(d_{D}=d_{C}-1.\) Sometimes we make this explicit by writing the time coordinate separately as \(f(t,x).\) A common case, concretely, is \(C=\mathbb{R} \times \mathbb{R}^2=\mathbb{R} \times D\) and \(O=\mathbb{R}.\) For each time \(t\in \mathbb{R}\) we assume the instantaneous solution \(f(t, \cdot)\) to be an element of some Banach space \(f\in \mathscr{A}\) of functions \(f(t, \cdot): D\to O.\) The overall solutions \(f: C\to O\) have their own Banach space \(\mathscr{F}\). More particularly, we might consider solutions a restricted time domain \(t\in [0,T]\) and some spatial domain \(D\subseteq \mathbb{R}^2\) where a solution is a function \(f\) that maps \([0,T] \times D \to \mathbb{R}.\) This would naturally model, say, a 2D height-field evolving over time.

We have thrown the term Banach space about without making it clear which one we mean. There are usually some implied smoothness properties and of course we would want to include some kind of metric to fully specify these spaces, but we gloss over that for now.

We have introduced one operator, the defining operator \(\mathcal{D}\) . Another that we think about a lot is the PDE propagator or forward operator \(\mathcal{P}_s,\) which produces a representation of the entire solution surface at some future moment, given current and boundary conditions. \[\mathcal{P}_s[f(t, \cdot)]=f( t+s, \cdot).\] We might also discuss a solution operator \[\solop:\begin{array}{l}\mathscr{U}\to\mathscr{F}\\ u\mapsto f\end{array}\] such that \[\mathcal{D}\left[\solop[u]\right]=u.\]

Handling all these weird, and presumably infinite-dimensional, function spaces \(\mathscr{A},\mathscr{U},\mathscr{F},\dots\) on a finite computer requires use to introduce a notion of discretisation. We need to find some finite-dimensional representations of these functions so that they can be computed in a finite machine. PDE solvers use various tricks to do that, and each one is its own research field. Finite difference approximations treat all the solutions as values on a grid, effectively approximating \(\mathscr{F}\) with some new space of functions \(\mathbb{Z}^2 \times \mathbb{Z} \to \mathbb{R},\) or, if you’d like, in terms of “bar chart” basis functions. Finite element methods define the PDE over a more complicated indexing system of compactly-supported basis functions which form a mesh. Particle systems approximate PDEs with moving particle who define their own adaptive basis. If there is some other natural (preferably orthogonal) basis of functions on the solution surface we might use those, for example with the right structure the eigenfunctions of the defining operator might give us such a basis. Fourier bases are famous in this case.

A classic for neural nets is to learn a finite-difference approximation of the PDE on a grid of values and treat it as a convnet regression, and indeed the dynamical treatment of neural nets is based on that. For various practical reasons I would like to avoid requiring a grid on my input values as much as possible. For one thing, grid systems are memory intensive and need expensive GPUs. For another, it is hard to integrate observations at multiple resolutions into a gridded data system. For a third, the research field of image prediction is too crowded for easy publications. Thus, that will not be treated further.

A grid-free approach is graph networks that learn a topology and interaction system. This seems to naturally map on to PDEs of the kind that we usually solve by particle systems, e.g. fluid dynamics with immiscible substances. Nothing wrong with this idea per se, but it does not seem to be the most compelling approach to me for my domain of spatiotemporal prediction where we already know the topology and can avoid all the complicated bits of graph networks. So this I will also ignore for now.

There are a few options. For an overview of many other techniques see Physics-based Deep Learning by Philipp Holl, Maximilian Mueller, Patrick Schnell, Felix Trost, Nils Thuerey, Kiwon Um (Thuerey et al. 2021). Also, see Brunton and Kutz, Data-Driven Science and Engineering. (Brunton and Kutz 2019) covers related material; both go farther thank mere PDEs and consider general general scientific settings. Also, the eeminar series by the authors of that latter book is a moving feast of the latest results in this area.

Here we look in depth mainly at two important ones.

One approach learns a network \(\hat{f}\in \mathscr{F}, \hat{f}: C \to O\) such that \(\hat{f}\approx f\) (Raissi, Perdikaris, and Karniadakis 2019). This is the annoyingly-named implicit representation trick. Another approach is used in networks like Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. (2020b) which learn the forward operator \(\mathcal{P}_1: \mathscr{A}\to\mathscr{A}.\) When the papers mentioned talk about operator learning, this is the operator that they seem to mean per default.

Physics-informed approximation of dynamics

This entire idea might seem weird if you are used to typical ML research. Unlike the usual neural network setting, we start by not trying to solve a statistical inference problem, where we have to learn an unknown prediction function from data, but we have a partially or completely known function (PDE solver) that we are trying to approximate with a more convenient substitute (a neural approximation to that PDE solver).

That approximant is, IMO not that exciting as a PDE solver in itself. Probably we could have implemented the reference PDE solver on the GPU, or tweaked it a little, and got a faster PDE solver so the speed benefit is not so useful.

However, I would like it if the reference solvers were easier to differentiate through, and to construct posteriors with - what you might call tomography, or inverse problem. Enabling advanced tomography is what I would like to do. But note that we still do not need to use ML methods to day that. In fact, if I already know the PDE operator and am implementing it in any case, I could avoid the learning step and simply implement the PDE using an off-the-shelf differentiable solver, which would allow us to perform this inference.

Nonetheless, we might wish to learn to approximate a PDE, for whatever reason. In my case it is that am required to match an industry-standard black-box solver that is not flexible, which is a common reason. Sometimes researchers hope it might be faster to use an approximated method. YMMV.

There are several approaches to learning the dynamics of a PDE solver for given parameters.

The PINN lineage

This body of literature encompasses both DeepONet (‘operator learning’) and PINN (‘physics informed neural nets’) approaches. Distinctions TBD.

Deterministic PINN

NB, notation here is not yet harmonised with the notation I used in the intro. Archetypally, the PINN. Recently these have been hip (Raissi, Perdikaris, and Karniadakis 2017a, 2017b; Raissi, Perdikaris, and Karniadakis 2019; Yang, Zhang, and Karniadakis 2020; Zhang, Guo, and Karniadakis 2020; Zhang et al. 2019). Zhang et al. (2019) credits Lagaris, Likas, and Fotiadis (1998) with originating the idea in 1998, so I suppose this is not super fresh. The key insight is that if we are elbows-deep in a neural network framework anyway, we already have access to automatic differentiation, so differential operations over the input field are basically free.

Let us introduce the basic “forward” PINN setup as given in Raissi, Perdikaris, and Karniadakis (2019): In the basic model we have the following problem \[ f +\mathcal{D}[f;\eta]=0, x \in \Omega, t \in[0, T] \] where \(f(t, x)\) denotes the solution (note this \(\mathcal{D}\) has the oppose sign to the convention we used in the intro. We assume the differential operator \(\mathcal{D}\) is parameterised by some \(\eta\) which for now we take to be known and suppress. Time and space axes are not treated specially in this approach but we keep them separately so that we can more closely approximate the terminology of the original paper. Our goal is to learn a neural network \(f:\mathbb{R}^e\to\mathbb{R}^{d}\in\mathscr{A}\) that approximates the solution to this PDE. We also assume we have some observations from the true PDE solutions, presumably simulated or analytically tractable enough to be given analytically. The latter case is presumably for benchmarking, as it makes this entire approach pointless AFAICS if the analytic version is easy.

We define residual network \(r(t, x)\) to be given by the left-hand side of the above \[ r:=f +\mathcal{D}[f] \] and proceed by approximating \(u(t, x;\theta)\) with a deep neural network \(r(t, x;\theta) .\)

The approximation is data-driven, with sample set \(S_{t}\) from a run of the PDE solver, \[ S=\left\{ \left\{ f( {t}_{f}^{(i)}, {x}_{f}^{(i)}) \right\}_{i=1}^{N_{f}}, \left\{ r(t_{r}^{(i)}, (x_{r}^{(i)}) \right\}_{i=1}^{N_{r}} \right\}. \]

\(f(t, x;\theta)\) and \(r(t, x;\theta)\) share parameters \(\theta\) (but differ in output). This seems to be a neural implicit representation-style approach, were we learn functions on coordinates. Each parameter set for the simulator to be approximated is a new dataset, and training examples are pointwise-sampled from the solution.

The key insight is that if we are elbows-deep in a neural network framework anyway, we already have access to automatic differentiation, so differential operations over the input field are basically free.

We train by minimising a combined loss, \[ L(\theta)=\operatorname{MSE}_{f}(\theta)+\operatorname{MSE}_{r}(\theta) \] where \[ \operatorname{MSE}_{f}=\frac{1}{N_{f}} \sum_{i=1}^{N_{f}}\left|f\left(t_{f}^{(i)}, x_{f}^{(i)}\right)-f^{(i)}\right|^{2} \] and \[ \operatorname{MSE}_{r}=\frac{1}{N_{r}} \sum_{i=1}^{N_{r}}\left|r\left(t_{r}^{(i)}, x_{r}^{(i)}\right)\right|^{2} \] Loss \(\operatorname{MSE}_{f}\) corresponds to the initial and boundary data while \(\operatorname{MSE}_{r}\) enforces the structure imposed by the defining differential operator at a finite set of collocation points. This trick allows us to learn an approximate solution operator which nearly enforces the desired conservation law.

An example is illustrative. Here is the reference Tensorflow interpretation from Raissi, Perdikaris, and Karniadakis (2019) for the Burger’s equation. In one space dimension, the Burger’s equation with Dirichlet boundary conditions reads \[ \begin{array}{l} f +f f_{x}-(0.01 / \pi) f_{x x}=0, \quad x \in[-1,1], \quad t \in[0,1] \\ f(0, x)=-\sin (\pi x) \\ f(t,-1)=f(t, 1)=0 \end{array} \] We define \(r(t, x)\) to be given by \[ r:=f +f f_{x}-(0.01 / \pi) f_{x x} \]

The python implementation of these two parts is essentially a naïve transcription of those equations.

def f(t, x):
    f = neural_net(tf.concat([t,x],1), weights, biases)
    return f

def r(t, x):
    f = f(t, x)
    f_t = tf.gradients(f, t)[0]
    f_x = tf.gradients(f, x)[0]
    f_xx = tf.gradients(f_x, x)[0]
    r = f_t + f∗f_x − (0.01/ tf.pi)∗f_xx
    return r

Because the outputs are parameterised by coordinates, the built-in autodiff does all the work. The authors summarise the resulting network topology so:

In the terminology of Rassi’s paper \(\solop_x,\) corresponds to \(\mathcal{D}[f]-f,\) and \(u\) to \(f,\) in the terminology of this post.

What has this gained us? So far, we have acquired a model which can, the authors assert, solve deterministic PDEs, which is nothing we could not do before. We have sacrificed any guarantee that our method will in fact do well on data from outside our observations. Also, I do not understand how I can plug alternative initial or boundary conditions in to this. There is no data input, as such, at inference time, merely coordinates. On the other hand, the author assert that this is faster and more stable than traditional solvers. It has the nice feature that the solution is continuous in its arguments; there is no grid. As far as NN things go, it has some weird and refreshing properties: it is simple, has small data, and has few tuning parameters.

But! what if we don’t know the parameters of the PDE? Assume the differential operator has parameter \(\eta\) which is not in fact known. \[ f +\mathcal{D}[f;\eta]=0, x \in \Omega, t \in[0, T] \] The trick, as far as I can tell, is simply to include \(\eta\) in trainable parameters. \[ r(\eta):=f (\eta)+\mathcal{D}[f;\eta] \] and proceed by approximating \(f(t, x;\theta,\eta)\) with a deep neural network \(r(t, x;\theta,\eta) .\) Everything else proceeds as before.

Fine; now what? Two obvious challenges from where I am sitting.

  1. No way of changing inputs in the sense of initial or boundary conditions, without re-training the model
  2. Point predictions. No accounting for randomness or uncertainty.

Stochastic PINN

Zhang et al. (2019) address point 2 via chaos expansions to handle the PDE emulation as a stochastic process regression, which apparently gives us estimates of parametric and process uncertainty. All diagrams in this section come from that paper.

🏗️ Terminology warning: I have not yet harmonised the terminology of this section with the rest of the page.

The extended model adds a random noise parameter \(k(x ; \omega)\): \[ \begin{array}{c} \mathcal{D}_{x}[u(x ; \omega) ; k(x ; \omega)]=0, \quad x \in \mathcal{D}, \quad \omega \in \Omega \\ \text { B.C. } \quad \mathcal{B}_{x}[u(x ; \omega)]=0, \quad x \in \Gamma \end{array} \]

The randomness in this could indicate a random coupling term, or uncertainty in some parameter of the model. Think of a Gaussian process prior over the forcing term of the PDE. We sample this noise parameter also and augment the data set with it, over \(N\) distinct realisations, giving a data set like this:

\[ S=\left\{ \left\{ k( {t}_u^{(i)}, {x}_u^{(i)}; \omega_{s}) \right\}_{i=1}^{N_{u}}, \left\{ u( {t}_u^{(i)}, {x}_u^{(i)}; \omega_{s}) \right\}_{i=1}^{N_{u}}, \left\{ r(t_{r}^{(i)}, (x_{r}^{(i)}) \right\}_{i=1}^{N_{r}} \right\}_{s=1}^{N}. \]

Note that I have kept the time variable explicit, unlike the paper, to match the previous section, but it gets cluttered if we continue to do this, so let’s suppress \(t\) hereafter, and make it just another axis of a multidimensional \(x\).

So now we approximate \(k\). Why? AFAICT that is because we are going to make a polynomial basis for \(\xi\) which means that we want few dimensions.

We let \(K\) be the \(N_{k} \times N_{k}\) covariance matrix for the sensor measurements on \(k,\) i.e., \[ K_{i, j}=\operatorname{Cov}\left(k^{(i)}, k^{(j)}\right) \] We take an eigendecomposition of \(K\). Let \(\lambda_{l}\) and \(\phi_{l}\) denote \(l\)-th largest eigenvalue and its associated normalized eigenvector of Then we have \[ K=\Phi^{T} \Lambda \Phi \] where \(\mathbf{\Phi}=\left[\phi_{1}, \phi_{2}, \ldots, \phi_{N_{k}}\right]\) is an orthonormal matrix and \(\boldsymbol{\Lambda}=\operatorname{diag}\left(\lambda_{1}, \lambda_{2}, \ldots \lambda_{N_{k}}\right)\) is a diagonal matrix. Let \(\boldsymbol{k}_{s}=\left[k_{s}^{(1)}, k_{s}^{(2)}, \ldots, k_{s}^{\left(N_{k}\right)}\right]^{T}\) be the results of the \(k\) measurements of the \(s\)-th snapshot, then \[ \boldsymbol{\xi}_{s}=\boldsymbol{\Phi}^{T} \sqrt{\boldsymbol{\Lambda}}^{-1} \boldsymbol{k}_{s} \] is an whitened, i.e. uncorrelated, random vector, and hence \(\boldsymbol{k}_{s}\) can be rewritten as a reduced dimensional expansion \[ \boldsymbol{k}_{s} \approx \boldsymbol{k}_{0}+\sqrt{\boldsymbol{\Lambda}^{M}} \boldsymbol{\Phi}^{M} \boldsymbol{\xi}_{s}^{M}, \quad M<N_{k} \] where \(\boldsymbol{k}_0=\mathbb{E}\boldsymbol{k}.\) We fix \(M\ll N_k\) and suppress it herafter.

Now we have approximated away the correlated \(\omega\) noise and in favour of this \(\xi\) which we have finite-dimensional representations of. \[k\left(x_{k}^{(i)} ; \omega_{s}\right) \approx k_{0}\left(x_{k}^{(i)}\right)+\sum_{l=1}^{M} \sqrt{\lambda_{l}} k_{l}\left(x_{k}^{(i)}\right) \xi_{s, l}, \quad M<N_{k}\] Note that this is defined only at the observation points, though.

Next is where we use the chaos expansion trick to construct an interpolant. Suppose the measure of RV \(\xi\) is \(\rho\). We approximate this unknown measure by its empirical measure \(\nu_{S}\). \[ \rho(\boldsymbol{\xi}) \approx \nu_{S}(\boldsymbol{\xi})=\frac{1}{N} \sum_{\boldsymbol{\xi}_{s} \in S} \delta_{\xi_{s}}(\boldsymbol{\xi}) \] where \(\delta_{\xi_{s}}\) is the Dirac measure.

We construct a polynomial basis which is orthogonal with respect to the inner product associated to this measure, specifically \[\begin{aligned} \langle \phi, \psi\rangle &:= \int \phi(x)\psi(x)\rho(x)\mathrm{d}x\\ &\approx \int \phi(x)\psi(x)\nu_{S}(x)\mathrm{d}x \end{aligned}\]

OK, so we construct an orthonormal polynomial basis \(\left\{\psi_{\alpha}(\boldsymbol{\xi})\right\}_{\alpha=0}^{P}\) via Gram-Schmidt orthogonalization process.1 With the polynomial basis \(\left\{\psi_{\alpha}(\boldsymbol{\xi})\right\}\) we can write a function \(g(x ; \boldsymbol{\xi})\) in the form of the aPC expansion, \[ g(x ; \boldsymbol{\xi})=\sum_{\alpha=0}^{P} g_{\alpha}(x) \psi_{\alpha}(\boldsymbol{\xi}) \] where the each \(g_{\alpha}(x)\) is calculated by \[ g_{\alpha}(x)=\frac{1}{N} \sum_{s=1}^{N} \psi_{\alpha}\left(\boldsymbol{\xi}_{s}\right) g\left(x ; \boldsymbol{\xi}_{s}\right). \]

So we are going to pick \(g\) to be some quantity of interest in our sim, and in fact, we take it top be two separate quantities, \(u\) and \(k\).

Then, we can approximate \(k\) and \(u\) at the \(s\)-th snapshot by \[ \tilde{k}\left(x ; \omega_{s}\right)=\widehat{k_{0}}(x)+\sum_{i=1}^{M} \sqrt{\lambda_{i}} \widehat{k_{i}}(x) \xi_{s, i} \] and \[ \tilde{u}\left(x ; \omega_{s}\right)=\sum_{\alpha=0}^{P} \widehat{u_{\alpha}}(x) \psi_{\alpha}\left(\boldsymbol{\xi}_{s}\right). \]

We construct two networks,

  1. the network \(\widehat{u_{\alpha}}\), which takes the coordinate \(x\) as the input and outputs a \((P+1) \times 1\) vector of the aPC modes of \(u\) evaluated at \(x,\) and
  2. the network \(\widehat{k_{i}}\) that takes the coordinate \(x\) as the input and outputs a \((M+1) \times 1\) vector of the \(k\) modes.

The resulting network topology is

For concreteness, here is the topology for an example problem \(\mathcal{D}:=-\frac{\mathrm{d}}{\mathrm{d} x}\left(k(x ; \omega) \frac{\mathrm{d}}{\mathrm{d} x} u\right)-f\):

At inference time we take observations of \(k\), calculate the whitened \(\xi\), then use the chaos expansion representation to calculate the values at unobserved locations. \[ \mathcal{L}\left(S_{t}\right)=\operatorname{MSE}_{u}+\operatorname{MSE}_{k}+\operatorname{MSE}_{f} \] where \[ \begin{array}{l} \operatorname{MSE}_{u}=\frac{1}{N N_{u}} \sum_{s=1}^{N} \sum_{i=1}^{N_{u}}\left[\left(\tilde{u}\left(x_{u}^{(i)} ; \omega_{s}\right)-u\left(x_{u}^{(i)} ; \omega_{s}\right)\right)^{2}\right] \\ \operatorname{MSE}_{k}=\frac{1}{N N_{k}} \sum_{s=1}^{N} \sum_{i=1}^{N_{k}}\left[\left(\tilde{k}\left(x_{k}^{(i)} ; \omega_{s}\right)-k\left(x_{k}^{(i)} ; \omega_{s}\right)\right)^{2}\right] \end{array} \] and \[ \operatorname{MSE}_{f}=\frac{1}{N N_{f}} \sum_{s=1}^{N} \sum_{i=1}^{N_{f}}\left[\left(\mathcal{D}_{x}\left[\tilde{u}\left(x_{f}^{(i)} ; \omega_{s}\right) ; \tilde{k}\left(x_{f}^{(i)} ; \omega_{s}\right)\right]\right)^{2}\right] \]

After all that I would describe this as a method to construct a stochastic PDE with the desired covariance structure, which is a hard thing to do. OK, all that was very complicated. Although, it was a complicated thing to do; Consider the mess this gets us into in the Karhunen Loéve expansion and spectral expansion Anyway, after all this, presuming the neural networks are perfect, we have a good estimate of the distribution of random parameters and random output of a stochastic PDE evaluated over the whole surface from partial discrete measurements.

How do we estimate the uncertainty introduce by the neural net? Dropout.

Further questions:

  1. Loss scale; gradient errors may not be comparable to value errors in the loss function.
  2. Network capacity: What size networks are necessary (not the ones we lear are tiny, with only hundreds of parameters)
  3. How do we generalise this to different initial conditions? Can we learn an observation-conditional PDE?
  4. After all this work it looks like I still can’t do inference on this thing. How do I update a distribution over \(k\) by this method from observations of a new PDE?
  5. Notice how the parameter inference problem for \(\eta\) vanished for the stochastic PDE? Can we learn an estimate for \(u\), \(\eta\) and \(k\) simultaneously in this setting? I imagine we repeat the trick where that parameter is learned along with the \(u\) network.

Weak formulation

A different network topology using the implicit representation trick is explored in Zang et al. (2020) and extended to inverse problems in Bao et al. (2020), They discuss this in terms of a weak formulation of a PDE.

🏗️ Terminology warning: I have not yet harmonised the terminology of this section with the rest of the page.

We start with the example second-order elliptic2 PDE with on domain \(\Omega \subset \mathbb{R}^{d}\) given \[\mathcal{D}[u]-f:=-\sum_{i=1}^{d} \partial_{i}\left(\sum_{j=1}^{d} a_{i j} \partial_{j} u\right)+\sum_{i=1}^{d} b_{i} \partial_{i} u+c u-f=0\] where \(a_{i j}, b_{i}, c: \Omega \rightarrow \mathbb{R}\) for \(i, j \in[d] \triangleq\{1, \ldots, d\}, f: \Omega \rightarrow \mathbb{R}\) and \(g: \partial \Omega \rightarrow \mathbb{R}\) are all given. We start by assuming Dirichlet boundary conditions, \(u(x)-g(x)=0,\) although this is rapidly generalised.

By multiplying both sides by a test function \(\varphi \in H_{0}^{1}(\Omega ; \mathbb{R})\) and integrating by parts: \[\left\{\begin{array}{l}\langle\mathcal{D}[u], \varphi\rangle \triangleq \int_{\Omega}\left(\sum_{j=1}^{d} \sum_{i=1}^{d} a_{i j} \partial_{j} u \partial_{i} \varphi+\sum_{i=1}^{d} b_{i} \varphi \partial_{i} u+c u \varphi-f \varphi\right) \mathrm{d} x=0 \\ \mathcal{B}[u]=0, \quad \text { on } \partial \Omega\end{array}\right.\]

The clever insight is that this inspires an adversarial problem to find the weak solutions, by considering the \(L^2\) operator norm of \(\mathcal{D}[u](\varphi) \triangleq\langle\mathcal{D}[u], \varphi\rangle\). Then the operator norm of \(\mathcal{D}[u]\) is defined \[\|\mathcal{D}[u]\|_{o p} \triangleq \max \left\{\langle\mathcal{D}[u], \varphi\rangle /\|\varphi\|_{2} \mid \varphi \in H_{0}^{1}, \varphi \neq 0\right\}.\] Therefore, \(u\) is a weak solution of the PDE if and only if \(\|\mathcal{D}[u]\|_{o p}=0\) and the boundary condition \(\mathscr{B}[u]=0\) is satisfied on \(\delta \Omega\). As \(\|\mathcal{D}[u]\|_{o p} \geq 0\), we know that a weak solution \(u\) thus solves the following two equivalent problems in observation: \[\min _{u \in H^{1}}\|\mathcal{D}[u]\|_{o p}^{2} \Longleftrightarrow \min _{u \in H^{1}} \max _{\varphi \in H_{0}^{1}}|\langle\mathcal{D}[u], \varphi\rangle|^{2} /\|\varphi\|_{2}^{2}.\]

Specifically the solutions \(u_{\theta}: \mathbb{R}^{d} \rightarrow \mathbb{R}\) are realized as a deep neural network with parameter \(\theta\) to be learned, such that \(\mathscr{S}\left[u_{\theta}\right]\) minimizes the (estimated) operator norm. The test function \(\varphi\), is a deep adversarial network with parameter \(\eta\), which adversarially challenges \(u_{\theta}\) by maximizing \(\left\langle\mathcal{D}\left[u_{\theta}\right], \varphi_{\eta}\right\rangle/\left\|\varphi_{\eta}\right\|_{2}\) for every given \(u_{\theta}\).

To train the deep neural network \(u_{\theta}\) and the adversarial network \(\varphi_{\eta}\) we construct appropriate loss functions \(u_{\theta}\) and \(\varphi_{\eta}\). Since logarithm function is monotone and strictly increasing, we can for convenience formulate the objective of \(u_{\theta}\) and \(\varphi_{\eta}\) in the interior of \(\Omega\) as \[L_{\text {int }}(\theta, \eta) \triangleq \log \left|\left\langle\mathcal{D}\left[u_{\theta}\right], \varphi_{\eta}\right\rangle\right|^{2}-\log \left\|\varphi_{\eta}\right\|_{2}^{2}.\] In addition, the weak solution \(u_{\theta}\) must also satisfy the boundary condition \(\mathscr{B}[u]=0\) on \(\delta \Omega\) which we fill in as above, calling it \(L_{\text {bdry }}(\theta).\) The total adversarial objective function is the weighted sum of the two objectives for which we seek for a saddle point that solves the minimax problem: \[\min _{o} \max L(\theta, \eta), \text{ where } L(\theta, \eta) \triangleq L_{\text {int }}(\theta, \eta)+\alpha L_{\text {bdry }}(\theta).\] \(\alpha\) might seem arbitrary; apparently it is useful as a tuning parameter.

This is a very elegant idea, although the implicit representation thing is still a problem for my use cases.

Learning a PDE forward operator

Learning to solve a known PDE using a neural network was interesting but it left us somehow divorced from the inference problem of learning dynamics. Perhaps it would be nice to learn an operator that projects estimates of current states forward, by learning to approximate the series of solver steps that a PDE solver would. This clearly rhymes with the idea of implicit neural networks.

Fourier neural operator

Zongyi Li blogs a neat trick: We use Fourier transforms to capture resolution-invariant and non-local behaviour in PDE forward-propagators. Essentially, we learn to approximate \(\mathcal{P}_1: \mathscr{A}\to\mathscr{A}.\) There is a bouquet of papers designed to leverage this (Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. 2020b, 2020a; Li, Kovachki, Azizzadenesheli, Liu, Stuart, et al. 2020). See also Anima Anandkumar’s presentation, which phrases this in terms of Green’s functions and their relationship with Fourier transforms. See also nostalgebraist who has added some interesting commentary and also fixed up a typo in this page for me (Thanks!). Yannic Kilcher’s Fourier Neural Operator for Parametric Partial Differential Equations Explained is popular but I have not watched it. Code is at zongyi-li/fourier_neural_operator.

Ok, so what is going on ? Learning a PDE operator is naturally expressed by a layer which acts on functions, and which is indifferent to how those functions are discretized, and in particular the scale at which they are discretized. Because this is neural net land we can be a certain type of sloppy and we don’t worry overmuch about the bandwidth of the signals we approximate; We keep some conservative number of low harmonics from the Fourier transform and use a fairly arbitrary non-Fourier maps too, and conduct the whole thing in a “high” dimensional space with vague relationship to the original problem without worrying too much about what it means.

The basic concept is heuristic and the main trick is doing lots of accounting of dimensions, and being ok with an arbitrary-dimensional Fourier transform \(\mathcal{F}:(D \rightarrow \mathbb{R}^{d}) \to (D \rightarrow \mathbb{C}^{d})\), which looks like this:

\[\begin{aligned} (\mathcal{F} f)_{j}(k) &=\int_{D} f_{j}(x) e^{-2 i \pi\langle x, k\rangle} \mathrm{d} x \\ \left(\mathcal{F}^{-1} f\right)_{j}(x) &=\int_{D} f_{j}(k) e^{2 i \pi\langle x, k\rangle} \mathrm{d} k \end{aligned}\] for each dimension \(j=1, \ldots, d\).

We notionally use the property that convolutions become multiplications under Fourier transform, which motivates the use of use convolutions to construct our operators. Good.

Terminology warning: \(\mathscr{A},\mathscr{U}\) are not the same as in the first section. I need to disambiguate what I mean by input space and solution space here.

Problem set up: We assume that there is a map, \(\mathcal{G}^{\dagger}\) solves PDEs with given focring condions. We consider PDEs defined in the following informal fashion: Solutions are functions from some Banach space \(\mathscr{A}\) comprising functions \(f:C\to O\) functions, where \(C\subseteq\mathbb{R}^{d_C}\) is a compact sets of positive Lebesgue measure and \(O=\mathbb{R}^{d_O}.\) We are also concerned with a Banach space \(\mathscr{U}\) of forcing functions \(f:C\to O'\). The PDE is specified by a differential operator \(\mathcal{D}:\mathcal{U}\to\mathcal{A}\) and such that \[\mathcal{D}[f]=u.\]

The first coordinate of the input space \(C\) often has the special interpretation as time \(t\in \mathbb{R}\) and the subsequent coordinates are then spatial coordinate \(x\in D\subseteq \mathbb{R}^{d_{D}}\) where \(d_{D}=d_{C}-1.\) Sometimes we make this explicit by writing the time coordinate separately, as \(f(t,x).\) A common case, concretely, is \(C=\mathbb{R} \times \mathbb{R}^2=\mathbb{R} \times D\) and \(O=\mathbb{R}.\) For each time \(t\in \mathbb{R}\) we assume the instantaneous solution \(f(t, \cdot)\) to be an element of some Banach space \(f\in \mathscr{A}\) of functions \(f(t, \cdot): D\to O.\) The overall solutions \(f: C\to O\) have their own Banach space \(\mathscr{F}\). More particularly, we might consider solutions a restricted time domain \(t\in [0,T]\) and some spatial domain \(D\subseteq \mathbb{R}^2\) where a solution is a function \(f\) that maps \([0,T] \times D \to \mathbb{R}.\) This would naturally model, say, a 2D height-field evolving over time.

We have introduced one operator, the defining operator \(\mathcal{D}\) . Another operator of interest a \[\solop:\begin{array}{l}\mathscr{U}\to\mathscr{F}\\ u\mapsto f\end{array}\] such that \[\mathcal{D}\left[\solop[u]\right]=u.\] In the case that the values of the solution for all time are not known, but rather must be predicted forward it is natural to decompose the forward operator into one that generates a solution (how is theis phrased in the fNO paper?) We cal this the or \(\mathcal{P}_s,\) which produces the entire solution surface at some future time \(t+s\), given current and boundary conditions. \[\mathcal{P}_s[f(t, \cdot)]=f( t+s, \cdot).\] The Fourier Neural Operator is a method for approximating the operators \(\solop\) and \(\mathcal{P}_{1}\).

\(\mathcal{G}^{\dagger}: \mathscr{A} \rightarrow \mathscr{U}\) is a map taking input functions to solutions functions. We learn to approxiamte this from function pairs \(\left\{a_{j}, u_{j}\right\}_{j=1}^{N}\) where \(a_{j}\in \mathscr{A}\) and \(u_{j}=\mathcal{G}^{\dagger}(a_{j})+\epsilon_{j}\) where \(\epsilon{j}\) is a corrupting noise. Further, we observe these functions only at certain points \(\{x_{1},\dots,x_{n}\}\subset D.\) We approximate \(\mathcal{G}^{\dagger}\) by choosing good parameters \(\theta^{\dagger} \in \Theta\) from the parameter space in some parametric family of maps \(\mathcal{G}\left(\cdot, \theta\right)\) so that \(\mathcal{G}\left(\cdot, \theta^{\dagger}\right)=\mathcal{\mathcal{G}}_{\theta^{\dagger}} \approx \mathcal{G}^{\dagger}\). Specifically we minimise some cost functional \(C: \mathscr{U} \times \mathscr{U} \rightarrow \mathbb{R}\) to find \[ \min _{\theta \in \Theta} \mathbb{E}_{a \sim \mu}\left[C\left(\mathcal{G}(a, \theta), \mathcal{G}^{\dagger}(a)\right)\right]. \]

For the neural Fourier operator in particular we assume that \(\mathcal{\mathcal{G}}_{\theta}\) has a particular iterative form, i.e. \(\mathcal{G}= Q \circ V_{T} \circ V_{T-1} \circ \ldots \circ V_{1} \circ P\). We introduce a new space \(\mathscr{V}=\mathscr{V}\left(D ; \mathbb{R}^{d_{v}}\right)\). \(P\) is a map \(\mathscr{A}\to\mathscr{V}\) and \(Q\) is a map \(\mathscr{V}\to\mathscr{U}\), and each \(V_{t}\) is a map \(\mathscr{V}\to\mathscr{V}\). Each of \(P\) and \(Q\) is ‘local’ in that they depend only upon pointwise evaluations of the function, e.g. for \(a\in\mathscr{A}\), \((Pa)(x)=p(a(x)\) for some \(p:\mathbb{R}^{d_{a}}\to\mathbb{R}^{d_{v}}\). Each \(v_{j}\) is function \(\mathbb{R}^{d_{v}}\). As a rule we are assuming \(d_{v}>d_{a}>d_{u}.\) \(V_{t}\) is not local. In fact, we define \[ (V_{t}v)(x):=\sigma\left(W v(x)+\left(\mathcal{K}(a ; \phi) v\right)(x)\right), \quad \forall x \in D \] where \(\mathcal{K}(\cdot;\phi): \mathscr{A} \rightarrow \mathcal{L}\left(\mathscr{V}, \mathscr{V}\right)\). This map is parameterized by \(\phi \in \Theta_{\mathcal{K}}\). \(W: \mathbb{R}^{d_{v}} \rightarrow \mathbb{R}^{d_{v}}\) is a linear transformation, and \(\sigma: \mathbb{R} \rightarrow \mathbb{R}\) is a local, component-wise, non-linear activation function. \(\mathcal{K}(a ; \phi)\) is a kernel integral transformation, by which is meant \[ \left(\mathcal{K}(a ; \phi) v\right)(x):=\int_{D} \kappa_{\phi}(x, a(x), y, a(y)) v(y) \mathrm{d} y, \quad \forall x \in D \] where \(\kappa_{\phi}: \mathbb{R}^{2\left(d+d_{a}\right)} \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) is some mapping parameterized by \(\phi \in \Theta_{\mathcal{K}}\).

Anyway, it looks like this:

Li’s Fourier neural operator layer.

We immediately throw out the dependence on \(a\) in the kernel definition replacing it with \[\kappa_{\phi}(x, a(x), y, a(y)) := \kappa_{R}(x-y)\] so that the integral operator becomes a convolution. This convolution can be calculated cheaply in Fourier space, which suggests we may as well define and calculate it also in Fourier space. Accordingly, the real work happens when they define the Fourier integral operator \[ \left(\mathcal{K}(\phi) v\right)(x)=\mathcal{F}^{-1}\left(R \cdot\left(\mathcal{F} v\right)\right)(x) \quad \forall x \in D \] where \(R\) is the Fourier transform of a periodic function \(\kappa: D \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) parameterized by \(\phi \in \Theta_{\mathcal{K}}\). Checking our units, we have that \(\left(\mathcal{F} v\right):D \to \mathbb{C}^{d_{v}}\) and \(R (k): D \to \mathbb{C}^{d_{v} \times d_{v}}\). In practice, since we can work with a Fourier series rather than a continuous transform, we will choose \(k\in\{0,1,2,\dots,k_{\text{max}}\}\) and then \(R\) can be represented by a tensor \(\mathbb{C}^{k_{\text{max}}\times d_{v}\times d_{v}}.\) 3 Not quite sure who is right. Caveat emptor. We can use a different \(W\) and \(R\) for each iteration if we want, say \(\{W_t,R_t\}_{1\leq t \leq T}\). So, the parameters of each of these, plus those of the maps \(P,Q\) comprise the parameters of the whole process.

Anyway, every step in this construction is differentiable in those parameters, and some of the steps can even be found rapidly using FFTs and so on, so we are done with the setup, and have an operator that can be learned from data. Is it any good? Empirically the authors report that it is fast and precise, but I have yet to try it myself.

Quibble: They use the term ‘resolution-invariant’ loosely at a few points in the paper, and it takes some work to understand the actual defensible claim. They do not actually prove that things are resolution invariant per se. Rather, it is not even clear what that would specifically mean in this context — no attempt to prove Nyquist conditions or other sampling-theoretic properties over the spatial domain. And, AFAICT in the time domain their method is resolution-dependent in every sense. Step size is fixed.

What is clear and what I think they mean is that there is an obvious interpretation of the solution as a continuous operator, in the sense that it can be evaluated at arbitrary (spatial) points for the same computational cost as evaluating it at the training points. Thus there is a sense in which it does not depend upon the resolution of the training set, in that we don’t any resampling of the solution to evaluate the functions our operator produces at unobserved coordinates.

What does this get us? You can, in a certain sense, treat many network structures as discrete approximations to PDE operators with a certain resolution (at least, deep Resnets with ReLU activations have such an interpretation, presumably others) and then use resampling methods to evaluate them at a different resolution, which is a more laborious process that potentially gets a similar result — see the notebook on deep learning as dynamical system for examples of doing that.

Next quibble: Why Fourier transforms, instead of a different basis? Their rationale is that integral operators can be understood as continuous convolutions in the linear case, and therefore a bit like convolutions more generally. Heuristically, stacked continuous convolutions and non-linearities might well-approximate the operations of solving a nonlinear PDE. So convolutions are what we want, and Fourier transforms turn convolutions into multiplications so we can use them. It might sounds like we should use actual convolutional neural networks to solve the PDE, but that would impose a given resolution on the solution which is not what we want. For me, a better rationale is that this Fourier transform puts us in a “nice” basis for PDEs because Fourier transforms have well-understood interpretations which encode derivatives, translations, interpolations and other useful operations, which is why we use them in classic PDE solvers. The convolution thing is more “deep net-like” though.

Also Fourier transforms are notionally fast, which is another popular deep net rationale. Implementation detail: the authors do not make any particular effort to be maximally fast by using dilation to truncate the Fourier series. Instead they calculate the whole Fourier transform then throw some bits out. Presumably because they were already fast enough without bothering with being clever.

OTOH Fourier series also encode some troubling assumptions. Essentially these amount to an assumption that the basis functions are periodic, which in turn amounts to assuming that the domain of functions is toroidal, or, I suppose, a box with uniform boundary conditions. This is pretty common, but also annoying. Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. (2020b) argue that does not matter because the edge effects can be more-or-less ignored and that things will still work OK because of the “local linear projection” part, which is … fine I guess? Maybe true. It still grates. Certainly the boundary conditions situation will just be messy AFAICT for any neural operator so I am at ease with the idea they would like to simply ignore it for now.

Would alternative bases fix that boundary problem, or just be more of an annoying PITA? Would there be anything to gain from learning a basis for the expansion? Certainly you would lose speed.

Also, this method uses a lot of the right buzzwords to sound like a kernel trick approach, and one can’t help but feel there might be a logical Gaussian process regression formulation with nice Bayesian interpretation. But I can’t see an obvious way of making that fly; the operator passes though a lot of nonlinear functions and thus the pushforward measure will get gnarly, not to mention the correlation kernel. I suppose it could possibly be approximated as ‘just’ another deep Gaussian process with some tactical assumptions, perhaps infinite-width asymptotics. Laplace approximation? Or a variational approximation? But what is a plausible variational family?

Update: See Kovachki, Lanthaler, and Mishra (2021) for some analysis of the performance of this method. TBD: read that mammoth paper.

DeepONet

From the people who brought you PINN, above, comes the paper of Lu, Jin, and Karniadakis (2020). The setup is related, but AFAICT differs in a few ways in that

  1. we don’t (necessarily?) use the derivative information at the sensor locations
  2. we learn an operator mapping initial/latent conditions to output functions
  3. we decompose the input function space into a basis and them sample randomly from the bases in order to span (in some sense) the input space at training time

The authors argue they have found a good topology for a network that does this

A DeepONet consists of two sub-networks, one for encoding the input function at a fixed number of sensors \(x_i, i = 1, \dots, m\) (branch net), and another for encoding the locations for the output functions (trunk net).

This addresses some problems with generalisation that make the PINN setup seem unsatisfactory; in particular we can change the inputs, or project arbitrary inputs forward.

The boundary conditions and input points appear to stay fixed though, and inference of the unknowns is still vexed.

🏗️

GAN approaches

One approach I am less familiar with advocates for conditional GAN models to simulate conditional latent distributions. I’m curious about these but they look more computationally expensive and specific than I need at the moment, so I’m filing for later (Bao et al. 2020; Yang, Zhang, and Karniadakis 2020; Zang et al. 2020).

A recent examples from fluid-flow dynamics (Chu et al. 2021) has particularly beautiful animations attached:

Advection-diffusion PDEs in particular

F. Sigrist, Künsch, and Stahel (2015b) finds a nice spectral representation of certain classes of stochastic PDE. These are extended in Liu, Yeo, and Lu (2020) to non-stationary operators. By being less generic, these come out with computationally convenient spectral representations.

Boundary conditions

Are hard to handle in a neural network context. Solutions seem diverse. Liu, Yeo, and Lu (2020) just assumes that there are some initial conditions and boundaries can be “handled” by hoping that simulating a toroidal domain then cutting out a segment of interest.

Inverse problems

Tomography through PDEs.

See Inverse problems in PDEs.

As implicit representations

Many of these PDE methods effectively use the “implicit representation” trick, i.e. they produce networks that map from input coordinates to values of solutions at those coordinates. This means we share some intersting tools with those networks, such as position encodings. TBD.

Differentiable solvers

Suppose we are keen to devise yet another method that will do clever things to augment PDE solvers with ML somehow. To that end it would be nice to have a PDE solver that was not a completely black box but which we could interrogate for useful gradients. Obviously all PDE solvers use gradient information, but only some of them expose that to us as users; e.g. MODFLOW will give me a solution filed but not the gradients of the field that were used to calculate that gradient. In ML toolkits accessing this information is easy.

TODO: define adjoint method etc.

OTOH, there is a lot of sophisticatd work done by PDE solvers that is hard for ML toolkits to recreate. That is why PDE solvers are a thing.

Tools which combine both worlds, PDE solutions and ML optimisations, do exist; there are adjoint method systems for mainstream PDE solvers just as there are PDE solvers for ML frameworks. Let us list some of the options under differentiable PDE solvers.

References

Alexanderian, Alen. 2021. Optimal Experimental Design for Infinite-Dimensional Bayesian Inverse Problems Governed by PDEs: A Review.” arXiv:2005.12998 [Math], January.
Alexanderian, Alen, Noemi Petra, Georg Stadler, and Omar Ghattas. 2016. A Fast and Scalable Method for A-Optimal Design of Experiments for Infinite-Dimensional Bayesian Nonlinear Inverse Problems.” SIAM Journal on Scientific Computing 38 (1): A243–72.
Arora, Sanjeev, Rong Ge, Tengyu Ma, and Ankur Moitra. 2015. Simple, Efficient, and Neural Algorithms for Sparse Coding.” In Proceedings of The 28th Conference on Learning Theory, 40:113–49. Paris, France: PMLR.
Atkinson, Steven, Waad Subber, and Liping Wang. 2019. “Data-Driven Discovery of Free-Form Governing Differential Equations.” In, 7.
Bao, Gang, Xiaojing Ye, Yaohua Zang, and Haomin Zhou. 2020. Numerical Solution of Inverse Problems by Weak Adversarial Networks.” Inverse Problems 36 (11): 115003.
Bar-Sinai, Yohai, Stephan Hoyer, Jason Hickey, and Michael P. Brenner. 2019. Learning Data-Driven Discretizations for Partial Differential Equations.” Proceedings of the National Academy of Sciences 116 (31): 15344–49.
Basir, Shamsulhaq, and Inanc Senocak. n.d. Critical Investigation of Failure Modes in Physics-Informed Neural Networks.” In AIAA SCITECH 2022 Forum. American Institute of Aeronautics and Astronautics.
Beck, Christian, Weinan E, and Arnulf Jentzen. 2019. Machine Learning Approximation Algorithms for High-Dimensional Fully Nonlinear Partial Differential Equations and Second-Order Backward Stochastic Differential Equations.” Journal of Nonlinear Science 29 (4): 1563–1619.
Bezgin, Deniz A., Aaron B. Buhendwa, and Nikolaus A. Adams. 2022. JAX-FLUIDS: A Fully-Differentiable High-Order Computational Fluid Dynamics Solver for Compressible Two-Phase Flows.” arXiv:2203.13760 [Physics], March.
Bhattacharya, Kaushik, Bamdad Hosseini, Nikola B. Kovachki, and Andrew M. Stuart. 2020. Model Reduction and Neural Networks for Parametric PDEs.” arXiv:2005.03180 [Cs, Math, Stat], May.
Blechschmidt, Jan, and Oliver G. Ernst. 2021. Three Ways to Solve Partial Differential Equations with Neural Networks — A Review.” GAMM-Mitteilungen 44 (2): e202100006.
Brandstetter, Johannes, Daniel Worrall, and Max Welling. 2022. Message Passing Neural PDE Solvers.” arXiv:2202.03376 [Cs, Math], March.
Brehmer, Johann, Kyle Cranmer, Siddharth Mishra-Sharma, Felix Kling, and Gilles Louppe. 2019. “Mining Gold: Improving Simulation-Based Inference with Latent Information.” In, 7.
Brunton, Steven L., and Jose Nathan Kutz. 2019. Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control. Cambridge: Cambridge University Press.
Chu, Mengyu, Nils Thuerey, Hans-Peter Seidel, Christian Theobalt, and Rhaleb Zayer. 2021. Learning Meaningful Controls for Fluids.” ACM Transactions on Graphics 40 (4): 1–13.
Cockayne, Jon, and Andrew B. Duncan. 2020. Probabilistic Gradients for Fast Calibration of Differential Equation Models,” September.
Cranmer, Miles D, Rui Xu, Peter Battaglia, and Shirley Ho. 2019. “Learning Symbolic Physics with Graph Networks.” In Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS), 6.
Dandekar, Raj, Karen Chung, Vaibhav Dixit, Mohamed Tarek, Aslan Garcia-Valadez, Krishna Vishal Vemula, and Chris Rackauckas. 2021. Bayesian Neural Ordinary Differential Equations.” arXiv:2012.07244 [Cs], March.
Duffin, Connor, Edward Cripps, Thomas Stemler, and Mark Girolami. 2021. Statistical Finite Elements for Misspecified Models.” Proceedings of the National Academy of Sciences 118 (2).
E, Weinan, Jiequn Han, and Arnulf Jentzen. 2020. Algorithms for Solving High Dimensional PDEs: From Nonlinear Monte Carlo to Machine Learning.” arXiv:2008.13333 [Cs, Math], September.
Eigel, Martin, Reinhold Schneider, Philipp Trunschke, and Sebastian Wolf. 2019. Variational Monte Carlo — Bridging Concepts of Machine Learning and High Dimensional Partial Differential Equations.” Advances in Computational Mathematics 45 (5-6): 2503–32.
Fan, Yuwei, Cindy Orozco Bohorquez, and Lexing Ying. 2019. BCR-Net: A Neural Network Based on the Nonstandard Wavelet Form.” Journal of Computational Physics 384 (May): 1–15.
Finzi, Marc, Roberto Bondesan, and Max Welling. 2020. Probabilistic Numeric Convolutional Neural Networks.” arXiv:2010.10876 [Cs], October.
Freeman, C Daniel, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. 2021. Brax–A Differentiable Physics Engine for Large Scale Rigid Body Simulation.” arXiv Preprint arXiv:2106.13281.
Gan, Chuang, Jeremy Schwartz, Seth Alter, Martin Schrimpf, James Traer, Julian De Freitas, Jonas Kubilius, et al. 2020. Threedworld: A Platform for Interactive Multi-Modal Physical Simulation.” arXiv Preprint arXiv:2007.04954.
Girolami, Mark, Eky Febrianto, Ge Yin, and Fehmi Cirak. 2021. The Statistical Finite Element Method (statFEM) for Coherent Synthesis of Observation Data and Model Predictions.” Computer Methods in Applied Mechanics and Engineering 375 (March): 113533.
Granas, Andrzej, and James Dugundji. 2003. Fixed Point Theory. Springer Monographs in Mathematics. New York, NY: Springer New York.
Guibas, John, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catanzaro. 2021. Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers,” November.
Gulian, Mamikon, Ari Frankel, and Laura Swiler. 2020. Gaussian Process Regression Constrained by Boundary Value Problems.” arXiv:2012.11857 [Cs, Math, Stat], December.
Han, Jiequn, Arnulf Jentzen, and Weinan E. 2018. Solving High-Dimensional Partial Differential Equations Using Deep Learning.” Proceedings of the National Academy of Sciences 115 (34): 8505–10.
Hennigh, Oliver, Susheela Narasimhan, Mohammad Amin Nabian, Akshay Subramaniam, Kaustubh Tangsali, Max Rietmann, Jose del Aguila Ferrandis, Wonmin Byeon, Zhiwei Fang, and Sanjay Choudhry. 2020. NVIDIA SimNet™️: An AI-Accelerated Multi-Physics Simulation Framework.” arXiv:2012.07938 [Physics], December.
Hoffimann, Júlio, Maciel Zortea, Breno de Carvalho, and Bianca Zadrozny. 2021. Geostatistical Learning: Challenges and Opportunities.” Frontiers in Applied Mathematics and Statistics 7.
Holl, Philipp, and Vladlen Koltun. 2020. Phiflow: A Differentiable PDE Solving Framework for Deep Learning via Physical Simulations,” 5.
Holl, Philipp, Nils Thuerey, and Vladlen Koltun. 2020. Learning to Control PDEs with Differentiable Physics.” In ICLR, 5.
Hu, Yuanming, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. 2019. Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures.” ACM Transactions on Graphics 38 (6): 1–16.
Huang, Zizhou, Teseo Schneider, Minchen Li, Chenfanfu Jiang, Denis Zorin, and Daniele Panozzo. 2021. A Large-Scale Benchmark for the Incompressible Navier-Stokes Equations.” arXiv:2112.05309 [Cs], December.
Innes, Mike, Alan Edelman, Keno Fischer, Chris Rackauckas, Elliot Saba, Viral B. Shah, and Will Tebbutt. 2019. A Differentiable Programming System to Bridge Machine Learning and Scientific Computing.” arXiv.
Jiang, Chiyu Max, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi A. Tchelepi, Philip Marcus, Prabhat, and Anima Anandkumar. 2020. MeshfreeFlowNet: A Physics-Constrained Deep Continuous Space-Time Super-Resolution Framework,” May.
Jo, Hyeontae, Hwijae Son, Hyung Ju Hwang, and Eun Heui Kim. 2020. Deep Neural Network Approach to Forward-Inverse Problems.” Networks & Heterogeneous Media 15 (2): 247.
Kadri, Hachem, Emmanuel Duflos, Philippe Preux, Stéphane Canu, Alain Rakotomamonjy, and Julien Audiffren. 2016. Operator-Valued Kernels for Learning from Functional Response Data.” The Journal of Machine Learning Research 17 (1): 613–66.
Kasim, M. F., D. Watson-Parris, L. Deaconu, S. Oliver, P. Hatfield, D. H. Froula, G. Gregori, et al. 2020. Up to Two Billion Times Acceleration of Scientific Simulations with Deep Neural Architecture Search.” arXiv:2001.08055 [Physics, Stat], January.
Kasim, Muhammad, J Topp-Mugglestone, P Hatfield, D H Froula, G Gregori, M Jarvis, E Viezzer, and Sam Vinko. 2019. “A Million Times Speed up in Parameters Retrieval with Deep Learning.” In, 5.
Kharazmi, E., Z. Zhang, and G. E. Karniadakis. 2019. Variational Physics-Informed Neural Networks For Solving Partial Differential Equations.” arXiv:1912.00873 [Physics, Stat], November.
Khodayi-Mehr, Reza, and Michael M. Zavlanos. 2019. VarNet: Variational Neural Networks for the Solution of Partial Differential Equations.” arXiv:1912.07443 [Physics, Stat], December.
Kochkov, Dmitrii, Alvaro Sanchez-Gonzalez, Jamie Smith, Tobias Pfaff, Peter Battaglia, and Michael P Brenner. 2020. “Learning Latent FIeld Dynamics of PDEs.” In, 7.
Kochkov, Dmitrii, Jamie A. Smith, Ayya Alieva, Qing Wang, Michael P. Brenner, and Stephan Hoyer. 2021. Machine Learning–Accelerated Computational Fluid Dynamics.” Proceedings of the National Academy of Sciences 118 (21).
Kononenko, O., and I. Kononenko. 2018. Machine Learning and Finite Element Method for Physical Systems Modeling.” arXiv:1801.07337 [Physics], March.
Kovachki, Nikola, Samuel Lanthaler, and Siddhartha Mishra. 2021. On Universal Approximation and Error Bounds for Fourier Neural Operators.” arXiv:2107.07562 [Cs, Math], July.
Kovachki, Nikola, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. 2021. Neural Operator: Learning Maps Between Function Spaces.” arXiv:2108.08481 [Cs, Math], September.
Lagaris, I.E., A. Likas, and D.I. Fotiadis. 1998. Artificial Neural Networks for Solving Ordinary and Partial Differential Equations.” IEEE Transactions on Neural Networks 9 (5): 987–1000.
Lei, Huan, Jing Li, Peiyuan Gao, Panos Stinis, and Nathan Baker. 2018. A Data-Driven Framework for Sparsity-Enhanced Surrogates with Arbitrary Mutually Dependent Randomness,” April.
Li, Zongyi, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. 2020a. Neural Operator: Graph Kernel Network for Partial Differential Equations.” arXiv:2003.03485 [Cs, Math, Stat], March.
———. 2020b. Fourier Neural Operator for Parametric Partial Differential Equations.” arXiv:2010.08895 [Cs, Math], October.
Li, Zongyi, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Andrew Stuart, Kaushik Bhattacharya, and Anima Anandkumar. 2020. Multipole Graph Neural Operator for Parametric Partial Differential Equations.” In Advances in Neural Information Processing Systems. Vol. 33.
Lian, Heng. 2007. Nonlinear Functional Models for Functional Responses in Reproducing Kernel Hilbert Spaces.” Canadian Journal of Statistics 35 (4): 597–606.
Lienen, Marten, and Stephan Günnemann. 2021. Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks.” In International Conference on Learning Representations.
Liu, Xiao, Kyongmin Yeo, and Siyuan Lu. 2020. Statistical Modeling for Spatio-Temporal Data From Stochastic Convection-Diffusion Processes.” Journal of the American Statistical Association 0 (0): 1–18.
Long, Zichao, Yiping Lu, Xianzhong Ma, and Bin Dong. 2018. PDE-Net: Learning PDEs from Data.” In Proceedings of the 35th International Conference on Machine Learning, 3208–16. PMLR.
Lu, Lu, Pengzhan Jin, and George Em Karniadakis. 2020. DeepONet: Learning Nonlinear Operators for Identifying Differential Equations Based on the Universal Approximation Theorem of Operators.” arXiv:1910.03193 [Cs, Stat], April.
Lu, Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis. 2021. DeepXDE: A Deep Learning Library for Solving Differential Equations.” SIAM Review 63 (1): 208–28.
Meng, Xuhui, Hessam Babaee, and George Em Karniadakis. 2021. Multi-Fidelity Bayesian Neural Networks: Algorithms and Applications.” Journal of Computational Physics 438 (August): 110361.
Mitusch, Sebastian K., Simon W. Funke, and Jørgen S. Dokken. 2019. Dolfin-Adjoint 2018.1: Automated Adjoints for FEniCS and Firedrake.” Journal of Open Source Software 4 (38): 1292.
Mowlavi, Saviz, and Saleh Nabi. 2021. Optimal Control of PDEs Using Physics-Informed Neural Networks.” arXiv:2111.09880 [Physics], November.
Nabian, Mohammad Amin, and Hadi Meidani. 2019. A Deep Learning Solution Approach for High-Dimensional Random Differential Equations.” Probabilistic Engineering Mechanics 57 (July): 14–25.
Naumann, Uwe. 2011. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. Society for Industrial and Applied Mathematics.
O’Hagan, Anthony. 2013. “Polynomial Chaos: A Tutorial and Critique from a Statistician’s Perspective,” 20.
Oladyshkin, S., and W. Nowak. 2012. Data-Driven Uncertainty Quantification Using the Arbitrary Polynomial Chaos Expansion.” Reliability Engineering & System Safety 106 (October): 179–90.
Otness, Karl, Arvi Gjoka, Joan Bruna, Daniele Panozzo, Benjamin Peherstorfer, Teseo Schneider, and Denis Zorin. 2021. An Extensible Benchmark Suite for Learning to Simulate Physical Systems.” In.
Pathak, Jaideep, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, et al. 2022. Fourcastnet: A Global Data-Driven High-Resolution Weather Model Using Adaptive Fourier Neural Operators,” February, 28.
Perdikaris, Paris, Daniele Venturi, and George Em Karniadakis. 2016. Multifidelity Information Fusion Algorithms for High-Dimensional Systems and Massive Data Sets.” SIAM Journal on Scientific Computing 38 (4): B521–38.
Perdikaris, P., D. Venturi, J. O. Royset, and G. E. Karniadakis. 2015. Multi-Fidelity Modelling via Recursive Co-Kriging and Gaussian–Markov Random Fields.” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 471 (2179): 20150018.
Pestourie, Raphaël, Youssef Mroueh, Christopher Vincent Rackauckas, Payel Das, and Steven Glenn Johnson. 2021. Data-Efficient Training with Physics-Enhanced Deep Surrogates.” In.
Qian, Elizabeth, Boris Kramer, Benjamin Peherstorfer, and Karen Willcox. 2020. Lift & Learn: Physics-Informed Machine Learning for Large-Scale Nonlinear Dynamical Systems.” Physica D: Nonlinear Phenomena 406 (May): 132401.
Rackauckas, Chris, Alan Edelman, Keno Fischer, Mike Innes, Elliot Saba, Viral B Shah, and Will Tebbutt. 2020. Generalized Physics-Informed Learning Through Language-Wide Differentiable Programming.” MIT Web Domain, 6.
Rackauckas, Christopher. 2019. The Essential Tools of Scientific Machine Learning (Scientific ML).” The Winnower.
Raissi, Maziar, Paris Perdikaris, and George Em Karniadakis. 2017a. Physics Informed Deep Learning (Part I): Data-Driven Solutions of Nonlinear Partial Differential Equations,” November.
———. 2017b. Physics Informed Deep Learning (Part II): Data-Driven Discovery of Nonlinear Partial Differential Equations,” November.
Raissi, Maziar, P. Perdikaris, and George Em Karniadakis. 2019. Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations.” Journal of Computational Physics 378 (February): 686–707.
Ramsundar, Bharath, Dilip Krishnamurthy, and Venkatasubramanian Viswanathan. 2021. Differentiable Physics: A Position Piece.” arXiv:2109.07573 [Physics], September.
Rezende, Danilo J, Sébastien Racanière, Irina Higgins, and Peter Toth. 2019. “Equivariant Hamiltonian Flows.” In Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS), 6.
Saha, Akash, and Palaniappan Balamurugan. 2020. Learning with Operator-Valued Kernels in Reproducing Kernel Krein Spaces.” In Advances in Neural Information Processing Systems. Vol. 33.
Sarkar, Soumalya, and Michael Joly. 2019. Multi-FIdelity Learning with Heterogeneous Domains.” In NeurIPS, 5.
Schnell, Patrick, Philipp Holl, and Nils Thuerey. 2022. Half-Inverse Gradients for Physical Deep Learning.” arXiv:2203.10131 [Physics], March.
Shankar, Varun, Gavin D Portwood, Arvind T Mohan, Peetak P Mitra, Christopher Rackauckas, Lucas A Wilson, David P Schmidt, and Venkatasubramanian Viswanathan. 2020. “Learning Non-Linear Spatio-Temporal Dynamics with Convolutional Neural ODEs.” In Third Workshop on Machine Learning and the Physical Sciences (NeurIPS 2020).
Shi, Zheng, Nur Sila Gulgec, Albert S. Berahas, Shamim N. Pakzad, and Martin Takáč. 2020. Finite Difference Neural Networks: Fast Prediction of Partial Differential Equations.” arXiv.
Sigrist, Fabio Roman Albert. 2013. Physics Based Dynamic Modeling of Space-Time Data.” Application/pdf. ETH Zurich.
Sigrist, Fabio, Hans R. Künsch, and Werner A. Stahel. 2015a. Spate : An R Package for Spatio-Temporal Modeling with a Stochastic Advection-Diffusion Process.” Application/pdf. Journal of Statistical Software 63 (14).
———. 2015b. Stochastic Partial Differential Equation Based Modelling of Large Space-Time Data Sets.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77 (1): 3–33.
Silvester, Steven, Anthony Tanbakuchi, Paul Müller, Juan Nunez-Iglesias, Mark Harfouche, Almar Klein, Matt McCormick, et al. 2020. Imageio/Imageio V0.9.0 (version v0.9.0). Zenodo.
Solin, Arno, and Simo Särkkä. 2020. Hilbert Space Methods for Reduced-Rank Gaussian Process Regression.” Statistics and Computing 30 (2): 419–46.
Stachenfeld, Kimberly, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer, Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, and Alvaro Sanchez-Gonzalez. 2022. Learned Coarse Models for Efficient Turbulence Simulation.” arXiv.
Sulam, Jeremias, Aviad Aberdam, Amir Beck, and Michael Elad. 2020. On Multi-Layer Basis Pursuit, Efficient Algorithms and Convolutional Neural Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (8): 1968–80.
Tait, Daniel J., and Theodoros Damoulas. 2020. Variational Autoencoding of PDE Inverse Problems.” arXiv:2006.15641 [Cs, Stat], June.
Tartakovsky, Alexandre M., Carlos Ortiz Marrero, Paris Perdikaris, Guzel D. Tartakovsky, and David Barajas-Solano. 2018. Learning Parameters and Constitutive Relationships with Physics Informed Deep Neural Networks,” August.
Thuerey, Nils, Philipp Holl, Maximilian Mueller, Patrick Schnell, Felix Trost, and Kiwon Um. 2021. Physics-Based Deep Learning. WWW.
Um, Kiwon, Robert Brand, Yun Fei, Philipp Holl, and Nils Thuerey. 2021. Solver-in-the-Loop: Learning from Differentiable Physics to Interact with Iterative PDE-Solvers.” arXiv:2007.00016 [Physics], January.
Um, Kiwon, and Philipp Holl. 2021. “Differentiable Physics for Improving the Accuracy of Iterative PDE-Solvers with Neural Networks.” In, 5.
Wacker, Philipp. 2017. Laplace’s Method in Bayesian Inverse Problems.” arXiv:1701.07989 [Math], April.
Wang, Chulin, Eloisa Bentivegna, Wang Zhou, Levente J Klein, and Bruce Elmegreen. 2020. “Physics-Informed Neural Network Super Resolution for Advection-Diffusion Models.” In, 9.
Wang, Rui, Karthik Kashinath, Mustafa Mustafa, Adrian Albert, and Rose Yu. 2020. Towards Physics-Informed Deep Learning for Turbulent Flow Prediction.” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1457–66. KDD ’20. New York, NY, USA: Association for Computing Machinery.
Wang, Sifan, Xinling Yu, and Paris Perdikaris. 2020. When and Why PINNs Fail to Train: A Neural Tangent Kernel Perspective,” July.
Wen, Gege, Zongyi Li, Kamyar Azizzadenesheli, Anima Anandkumar, and Sally M. Benson. 2022. U-FNO—An Enhanced Fourier Neural Operator-Based Deep-Learning Model for Multiphase Flow.” Advances in Water Resources 163 (May): 104180.
Xu, Kailai, and Eric Darve. 2020. ADCME: Learning Spatially-Varying Physical Fields Using Deep Neural Networks.” In arXiv:2011.11955 [Cs, Math].
Yang, Liu, Dongkun Zhang, and George Em Karniadakis. 2020. Physics-Informed Generative Adversarial Networks for Stochastic Differential Equations.” SIAM Journal on Scientific Computing 42 (1): A292–317.
Zammit-Mangion, Andrew, Michael Bertolacci, Jenny Fisher, Ann Stavert, Matthew L. Rigby, Yi Cao, and Noel Cressie. 2021. WOMBAT v1.0: A fully Bayesian global flux-inversion framework.” Geoscientific Model Development Discussions, July, 1–51.
Zang, Yaohua, Gang Bao, Xiaojing Ye, and Haomin Zhou. 2020. Weak Adversarial Networks for High-Dimensional Partial Differential Equations.” Journal of Computational Physics 411 (June): 109409.
Zhang, Dongkun, Ling Guo, and George Em Karniadakis. 2020. Learning in Modal Space: Solving Time-Dependent Stochastic PDEs Using Physics-Informed Neural Networks.” SIAM Journal on Scientific Computing 42 (2): A639–65.
Zhang, Dongkun, Lu Lu, Ling Guo, and George Em Karniadakis. 2019. Quantifying Total Uncertainty in Physics-Informed Neural Networks for Solving Forward and Inverse Stochastic Problems.” Journal of Computational Physics 397 (November): 108850.

  1. \(P\), the size of the basis, depends on the highest allowed polynomial order \(r\) in \(\psi_{\alpha}(\boldsymbol{\xi}),\) following the formula \[ P+1=\frac{(r+M) !}{r ! M !}. \]↩︎

  2. For five internet points, can you explain to me why it must be elliptic?↩︎

  3. NB — my calculations occasionally came out differing from the versions the authors gave in the paper with regards to the dimensionality of the spaces.↩︎


1 comment

As a geophysicist working with 3D and lots of data with coupled PDEs, a fast solver is nice, but often intractably slow. Even with modern solvers. Even with GPU. Replacing the solver with a NN approximant is potentially much faster, even if the speed is merely amortized. That has so many benefits for real-world modeling work.
Reply to Jack

GitHub-flavored Markdown & a sane subset of HTML is supported.