Data sets for machine learning for partial differential equations
2017-05-15 — 2025-05-26
Datasets and training harnesses and benchmarks for machine learning on partial differential equations (PDEs).
You’ll notice there’s an emphasis on Computational Fluid Dynamics (CFD) in these problems, especially single-phase problems. That is where the early success of operator learning has been (although, I’d argue, not where it is most needed).
PLAID
Casenave et al. (2025):
Machine learning-based surrogate models have emerged as a powerful tool to accelerate simulation-driven scientific workflows. However, their widespread adoption is hindered by the lack of large-scale, diverse, and standardized datasets tailored to physics-based simulations. While existing initiatives provide valuable contributions, many are limited in scope-focusing on specific physics domains, relying on fragmented tooling, or adhering to overly simplistic datamodels that restrict generalization. To address these limitations, we introduce PLAID (Physics-Learning AI Datamodel), a flexible and extensible framework for representing and sharing datasets of physics simulations. PLAID defines a unified standard for describing simulation data and is accompanied by a library for creating, reading, and manipulating complex datasets across a wide range of physical use cases (DRTI/plaid). We release six carefully crafted datasets under the PLAID standard, covering structural mechanics and computational fluid dynamics, and provide baseline benchmarks using representative learning methods. Benchmarking tools are made available on Hugging Face, enabling direct participation by the community and contribution to ongoing evaluation efforts (PLAIDbenchmarks).
pdebench/PDEBench: An Extensive Benchmark for Scientific Machine Learning (Takamoto et al. 2022) (Disclaimer: I contributed significantly to this project)
PDEArena (Brandstetter et al. 2022; Gupta and Brandstetter 2022)
Johns Hopkins Turbulence Databases (JHTDB) (Li et al. 2008; Yu et al. 2012)
karlotness/nn-benchmark: An extensible benchmark suite to evaluate data-driven physical simulation (Otness et al. 2021)
-
Welcome to the Well, a large-scale collection of machine learning datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain scientists and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite for accelerating research in machine learning and computational sciences.
APEBench / APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs (Koehler et al. 2024)
APEBench is a JAX-based tool to evaluate autoregressive neural emulators for PDEs on periodic domains in 1d, 2d, and 3d. It comes with an efficient reference simulator based on spectral methods that is used for procedural data generation (no need to download large datasets with APEBench). Since this simulator can also be embedded into emulator training (e.g., for a “solver-in-the-loop” correction setting), this is the first benchmark suite to support differentiable physics.
If we have a simulator, we can run it live and generate data on the fly. Here is one tool to facilitate that.
INRIA’s Melissa (Ribés and Raffin 2020; Terraz et al. 2017)
Melissa is a file-avoiding, fault-tolerant, and elastic framework to run large-scale sensitivity analysis (Melissa-SA) and large-scale deep surrogate training (Melissa-DL) on supercomputers. With Melissa-SA, the largest runs so far involved up to 30k cores, executed 80,000 parallel simulations, and generated 288 TB of intermediate data that did not need to be stored on the file system …
Classical sensitivity analysis and deep surrogate training consist of running different instances of a simulation with different sets of input parameters, storing the results to disk to later read them back to train a Neural Network or compute the required statistics. The amount of storage needed can quickly become overwhelming, with the associated long read time making data processing time-consuming. To avoid this pitfall, scientists reduce their study size by running low-resolution simulations or down-sampling output data in space and time.
Melissa (Fig. 1) bypasses this limitation by avoiding intermediate file storage. Melissa processes the data online (in transit), enabling very large-scale data processing:
Working out which data to simulate to optimally train the neural network (active learning) is a key part of the problem, and I’m not aware of much work in that area.
Bhan et al. (2024) tackles the closely related problem of controlling PDEs. Kim, Kim, and Lee (2024) is the only actual active learning approach I have seen in recent literature.