Neural operators are a particular way of using statistical or machine learning approaches to solve PDEs, and maybe even to perform inference through them, by learning to predict one step ahead dynamics of the model.

## Fourier neural operator

Zongyi Li blogs a neat trick: We use Fourier transforms to capture resolution-invariant and non-local behaviour in PDE forward-propagators. Essentially, we learn to approximate \(\mathcal{P}_1: \mathscr{A}\to\mathscr{A}.\) There is a bouquet of papers designed to leverage this (Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. 2020b, 2020a; Li, Kovachki, Azizzadenesheli, Liu, Stuart, et al. 2020). See also Anima Anandkumar’s presentation, which phrases this in terms of Green’s functions and their relationship with Fourier transforms. See also nostalgebraist who has added some interesting commentary and also fixed up a typo in this page for me (Thanks!). Yannic Kilcher’s Fourier Neural Operator for Parametric Partial Differential Equations Explained is popular but I have not watched it. Code is at zongyi-li/fourier_neural_operator.

Ok, so what is going on ? Learning a PDE operator is naturally expressed by a layer which acts on functions, and which is indifferent to how those functions are discretized, and in particular the scale at which they are discretized. Because this is neural net land we can be a certain type of sloppy and we don’t worry overmuch about the bandwidth of the signals we approximate; We keep some conservative number of low harmonics from the Fourier transform and use a fairly arbitrary non-Fourier maps too, and conduct the whole thing in a “high” dimensional space with vague relationship to the original problem without worrying too much about what it means.

The basic concept is heuristic and the main trick is doing lots of accounting of dimensions, and being ok with an arbitrary-dimensional Fourier transform \(\mathcal{F}:(D \rightarrow \mathbb{R}^{d}) \to (D \rightarrow \mathbb{C}^{d})\), which looks like this:

\[\begin{aligned} (\mathcal{F} f)_{j}(k) &=\int_{D} f_{j}(x) e^{-2 i \pi\langle x, k\rangle} \mathrm{d} x \\ \left(\mathcal{F}^{-1} f\right)_{j}(x) &=\int_{D} f_{j}(k) e^{2 i \pi\langle x, k\rangle} \mathrm{d} k \end{aligned}\] for each dimension \(j=1, \ldots, d\).

We notionally use the property that convolutions become multiplications under Fourier transform, which motivates the use of use convolutions to construct our operators. Good.

**Terminology warning**: \(\mathscr{A},\mathscr{U}\) are not the same as in the first section. I need to disambiguate what I mean by input space and solution space here.

Problem set up: We assume that there is a map, \(\mathcal{G}^{\dagger}\) solves PDEs with given focring condions. We consider PDEs defined in the following informal fashion: Solutions are functions from some Banach space \(\mathscr{A}\) comprising functions \(f:C\to O\) functions, where \(C\subseteq\mathbb{R}^{d_C}\) is a compact sets of positive Lebesgue measure and \(O=\mathbb{R}^{d_O}.\) We are also concerned with a Banach space \(\mathscr{U}\) of forcing functions \(f:C\to O'\). The PDE is specified by a differential operator \(\mathcal{D}:\mathcal{U}\to\mathcal{A}\) and such that \[\mathcal{D}[f]=u.\]

The first coordinate of the input space \(C\) often has the special interpretation as time \(t\in \mathbb{R}\) and the subsequent coordinates are then spatial coordinate \(x\in D\subseteq \mathbb{R}^{d_{D}}\) where \(d_{D}=d_{C}-1.\) Sometimes we make this explicit by writing the time coordinate separately, as \(f(t,x).\) A common case, concretely, is \(C=\mathbb{R} \times \mathbb{R}^2=\mathbb{R} \times D\) and \(O=\mathbb{R}.\) For each time \(t\in \mathbb{R}\) we assume the instantaneous solution \(f(t, \cdot)\) to be an element of some Banach space \(f\in \mathscr{A}\) of functions \(f(t, \cdot): D\to O.\) The overall solutions \(f: C\to O\) have their own Banach space \(\mathscr{F}\). More particularly, we might consider solutions a restricted time domain \(t\in [0,T]\) and some spatial domain \(D\subseteq \mathbb{R}^2\) where a solution is a function \(f\) that maps \([0,T] \times D \to \mathbb{R}.\) This would naturally model, say, a 2D height-field evolving over time.

We have introduced one operator, the defining operator \(\mathcal{D}\) . Another operator of interest a \[\solop:\begin{array}{l}\mathscr{U}\to\mathscr{F}\\ u\mapsto f\end{array}\] such that \[\mathcal{D}\left[\solop[u]\right]=u.\] In the case that the values of the solution for all time are not known, but rather must be predicted forward it is natural to decompose the forward operator into one that generates a solution (how is theis phrased in the fNO paper?) We cal this the or \(\mathcal{P}_s,\) which produces the entire solution surface at some future time \(t+s\), given current and boundary conditions. \[\mathcal{P}_s[f(t, \cdot)]=f( t+s, \cdot).\] The Fourier Neural Operator is a method for approximating the operators \(\solop\) and \(\mathcal{P}_{1}\).

\(\mathcal{G}^{\dagger}: \mathscr{A} \rightarrow \mathscr{U}\) is a map taking input functions to solutions functions. We learn to approxiamte this from function pairs \(\left\{a_{j}, u_{j}\right\}_{j=1}^{N}\) where \(a_{j}\in \mathscr{A}\) and \(u_{j}=\mathcal{G}^{\dagger}(a_{j})+\epsilon_{j}\) where \(\epsilon{j}\) is a corrupting noise. Further, we observe these functions only at certain points \(\{x_{1},\dots,x_{n}\}\subset D.\) We approximate \(\mathcal{G}^{\dagger}\) by choosing good parameters \(\theta^{\dagger} \in \Theta\) from the parameter space in some parametric family of maps \(\mathcal{G}\left(\cdot, \theta\right)\) so that \(\mathcal{G}\left(\cdot, \theta^{\dagger}\right)=\mathcal{\mathcal{G}}_{\theta^{\dagger}} \approx \mathcal{G}^{\dagger}\). Specifically we minimise some cost functional \(C: \mathscr{U} \times \mathscr{U} \rightarrow \mathbb{R}\) to find \[ \min _{\theta \in \Theta} \mathbb{E}_{a \sim \mu}\left[C\left(\mathcal{G}(a, \theta), \mathcal{G}^{\dagger}(a)\right)\right]. \]

For the neural Fourier operator in particular we assume that \(\mathcal{\mathcal{G}}_{\theta}\) has a particular iterative form, i.e. \(\mathcal{G}= Q \circ V_{T} \circ V_{T-1} \circ \ldots \circ V_{1} \circ P\). We introduce a new space \(\mathscr{V}=\mathscr{V}\left(D ; \mathbb{R}^{d_{v}}\right)\). \(P\) is a map \(\mathscr{A}\to\mathscr{V}\) and \(Q\) is a map \(\mathscr{V}\to\mathscr{U}\), and each \(V_{t}\) is a map \(\mathscr{V}\to\mathscr{V}\). Each of \(P\) and \(Q\) is ‘local’ in that they depend only upon pointwise evaluations of the function, e.g. for \(a\in\mathscr{A}\), \((Pa)(x)=p(a(x)\) for some \(p:\mathbb{R}^{d_{a}}\to\mathbb{R}^{d_{v}}\). Each \(v_{j}\) is function \(\mathbb{R}^{d_{v}}\). As a rule we are assuming \(d_{v}>d_{a}>d_{u}.\) \(V_{t}\) is not local. In fact, we define \[ (V_{t}v)(x):=\sigma\left(W v(x)+\left(\mathcal{K}(a ; \phi) v\right)(x)\right), \quad \forall x \in D \] where \(\mathcal{K}(\cdot;\phi): \mathscr{A} \rightarrow \mathcal{L}\left(\mathscr{V}, \mathscr{V}\right)\). This map is parameterized by \(\phi \in \Theta_{\mathcal{K}}\). \(W: \mathbb{R}^{d_{v}} \rightarrow \mathbb{R}^{d_{v}}\) is a linear transformation, and \(\sigma: \mathbb{R} \rightarrow \mathbb{R}\) is a local, component-wise, non-linear activation function. \(\mathcal{K}(a ; \phi)\) is a kernel integral transformation, by which is meant \[ \left(\mathcal{K}(a ; \phi) v\right)(x):=\int_{D} \kappa_{\phi}(x, a(x), y, a(y)) v(y) \mathrm{d} y, \quad \forall x \in D \] where \(\kappa_{\phi}: \mathbb{R}^{2\left(d+d_{a}\right)} \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) is some mapping parameterized by \(\phi \in \Theta_{\mathcal{K}}\).

Anyway, it looks like this:

We immediately throw out the dependence on \(a\) in the kernel definition replacing it with
\[\kappa_{\phi}(x, a(x), y, a(y)) := \kappa_{R}(x-y)\] so that the integral operator becomes a convolution.
This convolution can be calculated cheaply in Fourier space, which suggests we may as well define and calculate it also in Fourier space.
Accordingly, the real work happens when they define the Fourier integral operator
\[
\left(\mathcal{K}(\phi) v\right)(x)=\mathcal{F}^{-1}\left(R \cdot\left(\mathcal{F} v\right)\right)(x) \quad \forall x \in D
\]
where \(R\) is the Fourier transform of a periodic function \(\kappa: D \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) parameterized by \(\phi \in \Theta_{\mathcal{K}}\).
Checking our units, we have that
\(\left(\mathcal{F} v\right):D \to \mathbb{C}^{d_{v}}\)
and \(R (k): D \to \mathbb{C}^{d_{v} \times d_{v}}\).
In practice, since we can work with a Fourier series rather than a continuous transform, we will choose \(k\in\{0,1,2,\dots,k_{\text{max}}\}\) and then \(R\) can be represented by a tensor \(\mathbb{C}^{k_{\text{max}}\times d_{v}\times d_{v}}.\)
^{1}
Not quite sure who is right. *Caveat emptor*.
We can use a different \(W\) and \(R\) for each iteration if we want, say \(\{W_t,R_t\}_{1\leq t \leq T}\).
So, the parameters of each of these, plus those of the maps \(P,Q\) comprise the parameters of the whole process.

Anyway, every step in this construction is differentiable in those parameters, and some of the steps can even be found rapidly using FFTs and so on, so we are done with the setup, and have an operator that can be learned from data. Is it any good? Empirically the authors report that it is fast and precise, but I have yet to try it myself.

Quibble: They use the term ‘resolution-invariant’ loosely at a few points in the paper, and it takes some work to understand the actual defensible claim.
They do not actually prove that things are resolution *invariant* per se.
Rather, it is not even clear what that would specifically mean in this context — no attempt to prove Nyquist conditions or other sampling-theoretic properties over the spatial domain.
And, AFAICT in the time domain their method is resolution-*dependent* in every sense.
Step size is fixed.

What *is* clear and what I think they mean is that there is an obvious interpretation of the solution as a continuous operator, in the sense that it can be evaluated at arbitrary (spatial) points for the same computational cost as evaluating it at the training points.
Thus there is a sense in which it does not depend upon the resolution of the training set, in that we don’t any resampling of the solution to evaluate the functions our operator produces at unobserved coordinates.

What does this get us?
You can, in a certain sense, treat *many* network structures as discrete approximations to PDE operators with a certain resolution (at least, deep Resnets with ReLU activations have such an interpretation, presumably others) and then use resampling methods to evaluate them at a different resolution, which is a more laborious process that potentially gets a similar result —
see the notebook on deep learning as dynamical system for examples of doing that.

Next quibble: Why *Fourier* transforms, instead of a different basis?
Their rationale is that integral operators can be understood as continuous convolutions in the linear case, and therefore *a bit like convolutions* more generally.
Heuristically, stacked continuous convolutions and non-linearities might well-approximate the operations of solving a nonlinear PDE.
So convolutions are what we want, and Fourier transforms turn convolutions into multiplications so we can use them.
It might sounds like we should use actual convolutional neural networks to solve the PDE, but that would impose a given resolution on the solution which is not what we want.
For me, a better rationale is that this Fourier transform puts us in a “nice” basis for PDEs because Fourier transforms have well-understood interpretations which encode derivatives, translations, interpolations and other useful operations, which is why we use them in classic PDE solvers.
The convolution thing is more “deep net-like” though.

Also Fourier transforms are notionally fast, which is another popular deep net rationale. Implementation detail: the authors do not make any particular effort to be maximally fast by using dilation to truncate the Fourier series. Instead they calculate the whole Fourier transform then throw some bits out. Presumably because they were already fast enough without bothering with being clever.

OTOH Fourier series also encode some troubling assumptions. Essentially these amount to an assumption that the basis functions are periodic, which in turn amounts to assuming that the domain of functions is toroidal, or, I suppose, a box with uniform boundary conditions. This is pretty common, but also annoying. Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. (2020b) argue that does not matter because the edge effects can be more-or-less ignored and that things will still work OK because of the “local linear projection” part, which is … fine I guess? Maybe true. It still grates. Certainly the boundary conditions situation will just be messy AFAICT for any neural operator so I am at ease with the idea they would like to simply ignore it for now.

Would alternative bases fix that boundary problem, or just be more of an annoying PITA?
Would there be anything to gain from *learning* a basis for the expansion?
Certainly you would lose speed.

Also, this method uses a lot of the right buzzwords to sound like a kernel trick approach, and one can’t help but feel there might be a logical Gaussian process regression formulation with nice Bayesian interpretation. But I can’t see an obvious way of making that fly; the operator passes though a lot of nonlinear functions and thus the pushforward measure will get gnarly, not to mention the correlation kernel. I suppose it could possibly be approximated as ‘just’ another deep Gaussian process with some tactical assumptions, perhaps infinite-width asymptotics. Laplace approximation? Or a variational approximation? But what is a plausible variational family?

Update: See Kovachki, Lanthaler, and Mishra (2021) for some analysis of the performance of this method. TBD: read that mammoth paper.

## U-net

TBD.

## Boundary conditions

Are hard to handle in a neural network context. Solutions seem diverse. Liu, Yeo, and Lu (2020) just assumes that there are some initial conditions and boundaries can be “handled” by hoping that simulating a toroidal domain then cutting out a segment of interest.

## References

*GAMM-Mitteilungen*44 (2): e202100006.

*arXiv:2202.03376 [Cs, Math]*, March.

*The Journal of Machine Learning Research*17 (1): 613–66.

*arXiv:2107.07562 [Cs, Math]*, July.

*arXiv:2108.08481 [Cs, Math]*.

*arXiv:2010.08895 [Cs, Math]*, October.

*Advances in Neural Information Processing Systems*. Vol. 33.

*Journal of the American Statistical Association*0 (0): 1–18.

*Proceedings of the 35th International Conference on Machine Learning*, 3208–16. PMLR.

*arXiv:1910.03193 [Cs, Stat]*, April.

*SIAM Review*63 (1): 208–28.

*Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015*. Lecture Notes in Computer Science. Cham: Springer International Publishing.

*Advances in Neural Information Processing Systems*. Vol. 33.

*Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, 1457–66. KDD ’20. New York, NY, USA: Association for Computing Machinery.

*Advances in Water Resources*163 (May): 104180.

*arXiv:2011.11955 [Cs, Math]*.

NB — my calculations occasionally came out differing from the versions the authors gave in the paper with regards to the dimensionality of the spaces.↩︎

## No comments yet. Why not leave one?