A data format I need to know about

HDF5: (Python/R/Java/C/Fortan/MATLAB/whatever) An optimised and flexible data format from the Big Science applications of the 90s. Inbuilt compression and indexing support for things too big for memory. A good default with wide support and good performance. For numerical data very simple.

The table definition when writing structured data is booooring and it mangles text encodings if you arenā€™t careful, which ends up meaning a surprising amount of time can be lost to writing schemas if there is highly structured data to store.

Built-in lossless compression in HDF5 is not impressive on floats, and their lossy compression is bad. Recent HDF5 supports filters providing fancier methods.

I am currently doing a lot of heave HDF5 processing, so this is cruffy.


I have been trying to use compression from HDF5, but the built-in options are not great. Extended lossy compression is available via hdf5.

Untidy notes

Parallel access

HDF5 with multiple processes is complicated; may be worthwhile.

Useful tools: visualisers.

To install hdf5py using the homebrew HDF5 libraries (useful for Apple Silicon or other weird architectures) try this tip:

brew install hdf5
HDF5_DIR="$(brew --prefix hdf5)" pip install --no-binary=h5py h5py
HDF5_DIR="$(brew --prefix hdf5)" pip install --no-build-isolation h5py

Virtual datasets

a.k.a. arrays-made-up-of-multiple-arrays-across-multiple-files.

I wonder how this interacts with parallelism? How about when writing? It sounds like that is a supported means of getting multiple writers if we are careful.


Di, Sheng, Dingwen Tao, Xin Liang, and Franck Cappello. 2018. ā€œEfficient Lossy Compression for Scientific Data Based on Pointwise Relative Error Bound.ā€ IEEE Transactions on Parallel and Distributed Systems 30 (2).
Diffenderfer, James, Alyson L. Fox, Jeffrey A. Hittinger, Geoffrey Sanders, and Peter G. Lindstrom. 2019. ā€œError Analysis of ZFP Compression for Floating-Point Data.ā€ SIAM Journal on Scientific Computing 41 (3): A1867ā€“98.
Dube, Griffin, Jiannan Tian, Sheng Di, Dingwen Tao, Jon Calhoun, and Franck Cappello. 2022. ā€œSIMD Lossy Compression for Scientific Data.ā€ arXiv:2201.04614 [Cs], January.
Liang, Xin, Sheng Di, Dingwen Tao, Sihuan Li, Shaomeng Li, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2018. ā€œError-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets.ā€ In 2018 IEEE International Conference on Big Data (Big Data), 438ā€“47.
Zhao, Kai, Sheng Di, Maxim Dmitriev, Thierry-Laurent D. Tonellot, Zizhong Chen, and Franck Cappello. 2021. ā€œOptimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation.ā€ In 2021 IEEE 37th International Conference on Data Engineering (ICDE), 1643ā€“54.
Zhao, Kai, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Zizhong Chen, and Franck Cappello. 2020. ā€œSignificantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization.ā€ In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, 89ā€“100. HPDC ā€™20. New York, NY, USA: Association for Computing Machinery.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.