Data sets for machine learning for partial differential equations

2017-05-15 — 2025-05-26

Wherein datasets and benchmarks for machine learning on PDEs are catalogued, and an emphasis on CFD single‑phase problems is noted, with live simulator frameworks enabling on‑the‑fly data generation (Melissa).

calculus

data sets

dynamical systems

geometry

machine learning

neural nets

PDEs

physics

regression

sciml

SDEs

signal processing

statistics

statmech

stochastic processes

surrogate

time series

Datasets and training harnesses and benchmarks for machine learning on partial differential equations (PDEs).

Figure 1: Massive turbulence in the cloud.

You’ll notice there’s an emphasis on Computational Fluid Dynamics (CFD) in these problems, especially single-phase problems. That is where the early success of operator learning has been (although, I’d argue, not where it is most needed).

PLAID

Casenave et al. (2025):

Machine learning-based surrogate models have emerged as a powerful tool to accelerate simulation-driven scientific workflows. However, their widespread adoption is hindered by the lack of large-scale, diverse, and standardized datasets tailored to physics-based simulations. While existing initiatives provide valuable contributions, many are limited in scope-focusing on specific physics domains, relying on fragmented tooling, or adhering to overly simplistic datamodels that restrict generalization. To address these limitations, we introduce PLAID (Physics-Learning AI Datamodel), a flexible and extensible framework for representing and sharing datasets of physics simulations. PLAID defines a unified standard for describing simulation data and is accompanied by a library for creating, reading, and manipulating complex datasets across a wide range of physical use cases (DRTI/plaid). We release six carefully crafted datasets under the PLAID standard, covering structural mechanics and computational fluid dynamics, and provide baseline benchmarks using representative learning methods. Benchmarking tools are made available on Hugging Face, enabling direct participation by the community and contribution to ongoing evaluation efforts (PLAIDbenchmarks).
pdebench/PDEBench: An Extensive Benchmark for Scientific Machine Learning (Takamoto et al. 2022) (Disclaimer: I contributed significantly to this project)
PDEArena (Brandstetter et al. 2022; Gupta and Brandstetter 2022)
Johns Hopkins Turbulence Databases (JHTDB) (Li et al. 2008; Yu et al. 2012)
karlotness/nn-benchmark: An extensible benchmark suite to evaluate data-driven physical simulation (Otness et al. 2021)
The Well

Welcome to the Well, a large-scale collection of machine learning datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain scientists and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite for accelerating research in machine learning and computational sciences.
- PolymathicAI/the_well: A collection of 15TB Physics Simulation Datasets (Ohana et al. 2024)
APEBench / APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs (Koehler et al. 2024)

APEBench is a JAX-based tool to evaluate autoregressive neural emulators for PDEs on periodic domains in 1d, 2d, and 3d. It comes with an efficient reference simulator based on spectral methods that is used for procedural data generation (no need to download large datasets with APEBench). Since this simulator can also be embedded into emulator training (e.g., for a “solver-in-the-loop” correction setting), this is the first benchmark suite to support differentiable physics.

If we have a simulator, we can run it live and generate data on the fly. Here is one tool to facilitate that.

INRIA’s Melissa (Ribés and Raffin 2020; Terraz et al. 2017)

Melissa is a file-avoiding, fault-tolerant, and elastic framework to run large-scale sensitivity analysis (Melissa-SA) and large-scale deep surrogate training (Melissa-DL) on supercomputers. With Melissa-SA, the largest runs so far involved up to 30k cores, executed 80,000 parallel simulations, and generated 288 TB of intermediate data that did not need to be stored on the file system …

Classical sensitivity analysis and deep surrogate training consist of running different instances of a simulation with different sets of input parameters, storing the results to disk to later read them back to train a Neural Network or compute the required statistics. The amount of storage needed can quickly become overwhelming, with the associated long read time making data processing time-consuming. To avoid this pitfall, scientists reduce their study size by running low-resolution simulations or down-sampling output data in space and time.

Melissa (Fig. 1) bypasses this limitation by avoiding intermediate file storage. Melissa processes the data online (in transit), enabling very large-scale data processing:

Working out which data to simulate to optimally train the neural network (active learning) is a key part of the problem, and I’m not aware of much work in that area.

Bhan et al. (2024) tackles the closely related problem of controlling PDEs. Kim, Kim, and Lee (2024) is the only actual active learning approach I have seen in recent literature.

1 References

Bhan, Bian, Krstic, et al. 2024. “PDE Control Gym: A Benchmark for Data-Driven Boundary Control of Partial Differential Equations.” In Proceedings of the 6th Annual Learning for Dynamics & Control Conference.

Brandstetter, Berg, Welling, et al. 2022. “Clifford Neural Layers for PDE Modeling.” In.

Casenave, Roynard, Staber, et al. 2025. “Physics-Learning AI Datamodel (PLAID) Datasets: A Collection of Physics Simulations for Machine Learning.”

Gupta, and Brandstetter. 2022. “Towards Multi-Spatiotemporal-Scale Generalized PDE Modeling.”

Kim, Kim, and Lee. 2024. “Flexible Active Learning of PDE Trajectories.”

Koehler, Niedermayr, Westermann, et al. 2024. “APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs.”

Li, Perlman, Wan, et al. 2008. “A Public Turbulence Database Cluster and Applications to Study Lagrangian Evolution of Velocity Increments in Turbulence.” Journal of Turbulence.

Ohana, McCabe, Meyer, et al. 2024. “The Well: A Large-Scale Collection of Diverse Physics Simulations for Machine Learning.” In The Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

Otness, Gjoka, Bruna, et al. 2021. “An Extensible Benchmark Suite for Learning to Simulate Physical Systems.” In.

Ribés, and Raffin. 2020. “The Challenges of In Situ Analysis for Multiple Simulations.” In.

Takamoto, Praditia, Leiteritz, et al. 2022. “PDEBench: An Extensive Benchmark for Scientific Machine Learning.” In.

Terraz, Ribes, Fournier, et al. 2017. “Melissa: Large Scale in Transit Sensitivity Analysis Avoiding Intermediate Files.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

Yu, Kanov, Perlman, et al. 2012. “Studying Lagrangian Dynamics of Turbulence Using on-Demand Fluid Particle Tracking in a Public Turbulence Database.” Journal of Turbulence.