Machine learning for partial differential equations

Using statistical/ML approaches to solve PDEs, and indeed perform inference through them. There are various approaches here, which I will document on an ad hoc basis as I need them. No claim is made to completeness

The Fourier neural operator approach

Zongyi Li blogs a neat trick here: We use Fourier transforms to capture resolution-invariant and non-local behaviour in PDE forward-solvers. There is a bouquet of papers designed to leverage this (Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. 2020b, 2020a; Li, Kovachki, Azizzadenesheli, Liu, Stuart, et al. 2020).

Learning a PDE operator is naturally expressed by a layer which acts on functions, and which is indifferent to how those functions are discretized, and in particular the scale at which they are discretized. Because this is neural net land we can be a certain type of sloppy and we don’t worry overmuch about the bandwidth of the functions to approximate; We will keep some conservative number of low harmonics from the Fourier transform and use a fairly arbitrary non-Fourier maps too, and conduct the whole thing in a “high” dimensional space with vague relationship to the original problem without worrying too much about what it means.

Code is at zongyi-li/fourier_neural_operator.

The basic concept is heuristic and the main trick is doing lots of accounting of dimensions, and being ok with an arbitrary-dimensional Fourier transform \(\mathcal{F}:D \rightarrow \mathbb{R}^{d} \to D \rightarrow \mathbb{C}^{d}\), which, applied to a function \(f: D \rightarrow \mathbb{R}^{d}\) looks like this:

\[\begin{aligned} (\mathcal{F} f)_{j}(k) &=\int_{D} f_{j}(x) e^{-2 i \pi\langle x, k\rangle} \mathrm{d} x \\ \left(\mathcal{F}^{-1} f\right)_{j}(x) &=\int_{D} f_{j}(k) e^{2 i \pi\langle x, k\rangle} \mathrm{d} k \end{aligned}\] for each dimension \(j=1, \ldots, d\).

We notionally use the property convolutions become multiplications under Fourier transform, which motivates the use of use convolutions to construct our operators. Good.

So the problem set up is that we are going to assume that there is a map, \(G^{\dagger}\) which arises from the solution of a PDE. Let \(D \subset \mathbb{R}^{d}\) be a bounded, open set which is the index set of our PDE solutions. We consider the input space \(\mathcal{A}=\mathcal{A}\left(D ; \mathbb{R}^{d_{a}}\right)\) of \(D\to\mathbb{R}^{d_{a}}\), and the solution space of \(\mathcal{U}=\mathcal{U}\left(D ; \mathbb{R}^{d_{u}}\right)\) of \(D\to\mathbb{R}^{d_{u}}\) functions. They are both Banach spaces. \(G^{\dagger}: \mathcal{A} \rightarrow \mathcal{U}\) is a map taking input functions to solutions functions which we learn from function pairs \(\left\{a_{j}, u_{j}\right\}_{j=1}^{N}\) where \(a_{j}\in \mathcal{A}\) and \(u_{j}=G^{\dagger}(a_{j})+\epsilon_{j}\) where \(\epsilon{j}\) is a corrupting noise. Further, we observe these functions only at certain points \(\{x_{1},\dots,x_{n}\}\subset D.\)

We approximate \(G^{\dagger}\) by choosing good parameters \(\theta^{\dagger} \in \Theta\) from the parameter space in some parametric family of maps \(G\left(\cdot, \theta\right)\) so that \(G\left(\cdot, \theta^{\dagger}\right)=G_{\theta^{\dagger}} \approx G^{\dagger}\). Specifically we minimise some cost functional \(C: \mathcal{U} \times \mathcal{U} \rightarrow \mathbb{R}\) to find \[ \min _{\theta \in \Theta} \mathbb{E}_{a \sim \mu}\left[C\left(G(a, \theta), G^{\dagger}(a)\right)\right]. \]

For the neural Fourier operator in particular we assume that \(G_{\theta}\) has a particular iterative form, i.e. \(G= Q \circ V_{T} \circ V_{T-1} \circ \ldots \circ V_{1} \circ P\). We introduce a new space \(\mathcal{V}=\mathcal{V}\left(D ; \mathbb{R}^{d_{v}}\right)\). \(P\) is a map \(\mathcal{A}\to\mathcal{V}\) and \(Q\) is a map \(\mathcal{V}\to\mathcal{U}\), and each \(V_{t}\) is a map \(\mathcal{V}\to\mathcal{U}\). Each of \(P\) and \(Q\) is ‘local’ in that they depend only upon pointwise evaluations of the function, e.g. for \(a\in\mathcal{A}\), \((Pa)(x)=p(a(x)\) for some \(p:\mathbb{R}^{d_{a}}\to\mathbb{R}^{d_{v}}\). Each \(v_{j}\) is function \(\mathbb{R}^{d_{v}}\). As a rule we are assuming \(\mathbb{R}^{d_{v}}>\mathbb{R}^{d_{a}}>\mathbb{R}^{d_{u}}.\) \(V_{t}\) is not local. In fact, we define \[ (V_{t}v)(x):=\sigma\left(W v(x)+\left(\mathcal{K}(a ; \phi) v\right)(x)\right), \quad \forall x \in D \] where \(\mathcal{K}(\cdot;\phi): \mathcal{A} \rightarrow \mathcal{L}\left(\mathcal{V}, \mathcal{V}\right)\). This map is parameterized by \(\phi \in \Theta_{\mathcal{K}}\). \(W: \mathbb{R}^{d_{v}} \rightarrow \mathbb{R}^{d_{v}}\) is a linear transformation, and \(\sigma: \mathbb{R} \rightarrow \mathbb{R}\) is a local, component-wise, non-linear activation function. \(\mathcal{K}(a ; \phi)\) is a kernel integral transformation, by which is meant \[ \left(\mathcal{K}(a ; \phi) v\right)(x):=\int_{D} \kappa_{\phi}(x, a(x), y, a(y)) v(y) \mathrm{d} y, \quad \forall x \in D \] where \(\kappa_{\phi}: \mathbb{R}^{2\left(d+d_{a}\right)} \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) is some mapping parameterized by \(\phi \in \Theta_{\mathcal{K}}\).

Anyway, it looks like this:

Li’s Fourier neural operator layer.

NB: I am not sure why there is a dependence on \(a\) in the kernel definition, because the authors immediately throw it out and replace it with \[\kappa_{\phi}(x, a(x), y, a(y)) := \kappa_{R}(x-y)\] so that the integral operator becomes a convolution. This convolution is can be calculated cheaply in Fourier space, which suggests we may as well define and calculate it also in Fourier space. Thus they define the Fourier integral operator \[ \left(\mathcal{K}(\phi) v\right)(x)=\mathcal{F}^{-1}\left(R \cdot\left(\mathcal{F} v\right)\right)(x) \quad \forall x \in D \] where \(R\) is the Fourier transform of a periodic function \(\kappa: D \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) parameterized by \(\phi \in \Theta_{\mathcal{K}}\). Checking our units here, we have that \(\left(\mathcal{F} v\right):D \to \mathbb{C}^{d_{v}}\) and \(R (k): D \to \mathbb{C}^{d_{v} \times d_{v}}\). In practice, since we can work with a Fourier series rather than a continuous transform, we will choose \(k\in\{0,1,2,\dots,k_{\text{max}}\}\) and then \(R\) can be represented by a tensor \(\mathbb{C}^{k_{\text{max}}\times d_{v}\times d_{v}}.\) NB - my calculations occasionally came out differing from the versions the authors gave in the paper with regards to the dimensionality of the spaces. Not quite sure who is right here. Caveat emptor. We can use a different \(W\) and \(R\) for each iteration if we want, say \(\{W_t,R_t\}_{1\leq t \leq T}\). So the parameters of each of these, plus those of the maps \(P,Q\) comprise the parameters of the whole process.

Anyway, every step in this construction is differentiable in those parameters, and some of them can even be optimised with some FFTs and so on, so we are done with the setup, and have an operator that can be learned from data. Is it any good? Empirically the authors report that it is fast and precise, but I have yet to try it myself.

Quibble: They use the term ‘resolution-invariant’ loosely. They do not actually prove that things are resolution invariant per se. Rather, it is not even clear what that would specifically mean in this context — no attempt to prove Nyqvist conditions or other sampling-theoretic properties). What is clear is that there is a an obvious interpretation of their solution as a continuous operator, in the sense that it can be evaluated at arbitrary points for the same computational cost as evaluating it at the training points. Thus there is a sense in which it does not depend upon the resolution of the training set, in that we don’t need to rescale the coefficients of the solution in any sense to evaluate the functions our operator produces at unobserved coordinates. You can, in a certain sense, treat many network structures as discrete approximations to PDE operators with a certain resolution (at least, deep Resnets with ReLU activations have such an interpretation, possibly others) and then use resampling methods to evaluate them at a different resolution, which is a more laborious process that potentially gets a similar result — see the notebook on deep learning as dynamical system for an example of doing that.

Next quibble: Why Fourier transforms, instead of a different basis? The Fourier transforms as used in this paper are truncated Fourier series, which amounts to an assumption that the basis functions are periodic, which in turn amounts to assuming that the domain of functions is toroidal, or, I suppose, a box with uniform boundary conditions. This seems like an implausible restriction. Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. (2020b) argue that does not matter because the edge effects can be more-or-less ignored and that things will still work OK because of the “local linear projection” part, which is … fine I guess? Somewhat vague though. Would alternative bases fix that, or just be more of an annoying PITA? It is suggestive that some other basis expansion might be nicer. Would there be anything to gain from learning a basis for the expansion?

Also, this uses a lot of the right buzzwords to sound like a kernel trick approach, and one can’t help but feel there might be a logical Gaussian process regression formulation with nice Bayesian guarantees. But I can’t see a simple one; the operator passes though a lot of nonlinear functions and thus the pushforward measure will get gnarly. I suppose it can be ‘just’ another deep Gaussian process with a tactically chosen kernel.

Atkinson, Steven, Waad Subber, and Liping Wang. 2019. “Data-Driven Discovery of Free-Form Governing Differential Equations.” In, 7.
Beck, Christian, Weinan E, and Arnulf Jentzen. 2019. “Machine Learning Approximation Algorithms for High-Dimensional Fully Nonlinear Partial Differential Equations and Second-Order Backward Stochastic Differential Equations.” Journal of Nonlinear Science 29 (4): 1563–1619.
Bhattacharya, Kaushik, Bamdad Hosseini, Nikola B. Kovachki, and Andrew M. Stuart. 2020. “Model Reduction and Neural Networks for Parametric PDEs.” May 6, 2020.
Fan, Yuwei, Cindy Orozco Bohorquez, and Lexing Ying. 2019. BCR-Net: A Neural Network Based on the Nonstandard Wavelet Form.” Journal of Computational Physics 384 (May): 1–15.
Holl, Philipp, Nils Thuerey, and Vladlen Koltun. 2019. “Learning to Control PDEs with Differentiable Physics.” In, 5.
Hu, Yuanming, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. 2019. “Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures.” ACM Transactions on Graphics 38 (6): 1–16.
Jiang, Chiyu Max, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi A. Tchelepi, Philip Marcus, Prabhat, and Anima Anandkumar. 2020. MeshfreeFlowNet: A Physics-Constrained Deep Continuous Space-Time Super-Resolution Framework,” May.
Kononenko, O., and I. Kononenko. 2018. “Machine Learning and Finite Element Method for Physical Systems Modeling.” March 16, 2018.
Li, Zongyi, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. 2020a. “Neural Operator: Graph Kernel Network for Partial Differential Equations.” March 6, 2020.
———. 2020b. “Fourier Neural Operator for Parametric Partial Differential Equations.” October 17, 2020.
Li, Zongyi, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Andrew Stuart, Kaushik Bhattacharya, and Anima Anandkumar. 2020. “Multipole Graph Neural Operator for Parametric Partial Differential Equations.” In Advances in Neural Information Processing Systems. Vol. 33. tensor.
Lu, Lu, Zhiping Mao, and Xuhui Meng. 2019. DeepXDE: A Deep Learning Library for Solving Differential Equations.” In, 6.
Nabian, Mohammad Amin, and Hadi Meidani. 2019. “A Deep Learning Solution Approach for High-Dimensional Random Differential Equations.” Probabilistic Engineering Mechanics 57 (July): 14–25.
Raissi, M., P. Perdikaris, and G. E. Karniadakis. 2019. “Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations.” Journal of Computational Physics 378 (February): 686–707.
Silvester, Steven, Anthony Tanbakuchi, Paul Müller, Juan Nunez-Iglesias, Mark Harfouche, Almar Klein, Matt McCormick, et al. 2020. Imageio/Imageio V0.9.0 (version v0.9.0). Zenodo.
Tait, Daniel J., and Theodoros Damoulas. 2020. “Variational Autoencoding of PDE Inverse Problems.” June 28, 2020.
Tartakovsky, Alexandre M., Carlos Ortiz Marrero, Paris Perdikaris, Guzel D. Tartakovsky, and David Barajas-Solano. 2018. “Learning Parameters and Constitutive Relationships with Physics Informed Deep Neural Networks,” August.
Yang, Liu, Dongkun Zhang, and George Em Karniadakis. 2020. “Physics-Informed Generative Adversarial Networks for Stochastic Differential Equations.” SIAM Journal on Scientific Computing 42 (1): A292–317.
Zang, Yaohua, Gang Bao, Xiaojing Ye, and Haomin Zhou. 2020. “Weak Adversarial Networks for High-Dimensional Partial Differential Equations.” Journal of Computational Physics 411 (June): 109409.
Zhang, Dongkun, Ling Guo, and George Em Karniadakis. 2020. “Learning in Modal Space: Solving Time-Dependent Stochastic PDEs Using Physics-Informed Neural Networks.” SIAM Journal on Scientific Computing 42 (2): A639–65.
Zhang, Dongkun, Lu Lu, Ling Guo, and George Em Karniadakis. 2019. “Quantifying Total Uncertainty in Physics-Informed Neural Networks for Solving Forward and Inverse Stochastic Problems.” Journal of Computational Physics 397 (November): 108850.