A data format I need to know about

HDF5: (Python/R/Java/C/Fortan/MATLAB/whatever) An optimised and flexible data format from the Big Science applications of the 90s. Inbuilt compression and indexing support for things too big for memory. A good default with wide support and good performance. For numerical data very simple.

The table definition when writing structured data is booooring and it mangles text encodings if you aren’t careful, which ends up meaning a surprising amount of time can be lost to writing schemas if there is highly structured data to store.

Built-in lossless compression in HDF5 is not impressive on floats, and their lossy compression is bad. Recent HDF5 supports filters providing fancier methods.

I am currently doing a lot of heave HDF5 processing, so this is cruffy.

Parallel access

HDF5 with multiple processes is complicated; may be worthwhile.

Useful tools: visualisers.

To install hdf5py using the homebrew HDF5 libraries (useful for Apple Silicon or other weird architectures) try this tip:

brew install hdf5
HDF5_DIR="$(brew --prefix hdf5)" pip install --no-binary=h5py h5py
HDF5_DIR="$(brew --prefix hdf5)" pip install --no-build-isolation h5py

Virtual datasets

a.k.a. arrays-made-up-of-multiple-arrays-across-multiple-files.

I wonder how this interacts with parallelism? How about when writing? It sounds like that is a supported means of getting multiple writers if we are careful.


Di, Sheng, Dingwen Tao, Xin Liang, and Franck Cappello. 2018. β€œEfficient Lossy Compression for Scientific Data Based on Pointwise Relative Error Bound.” IEEE Transactions on Parallel and Distributed Systems 30 (2).
Diffenderfer, James, Alyson L. Fox, Jeffrey A. Hittinger, Geoffrey Sanders, and Peter G. Lindstrom. 2019. β€œError Analysis of ZFP Compression for Floating-Point Data.” SIAM Journal on Scientific Computing 41 (3): A1867–98.
Dube, Griffin, Jiannan Tian, Sheng Di, Dingwen Tao, Jon Calhoun, and Franck Cappello. 2022. β€œSIMD Lossy Compression for Scientific Data.” arXiv:2201.04614 [Cs], January.
Liang, Xin, Sheng Di, Dingwen Tao, Sihuan Li, Shaomeng Li, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2018. β€œError-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets.” In 2018 IEEE International Conference on Big Data (Big Data), 438–47.
Zhao, Kai, Sheng Di, Maxim Dmitriev, Thierry-Laurent D. Tonellot, Zizhong Chen, and Franck Cappello. 2021. β€œOptimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation.” In 2021 IEEE 37th International Conference on Data Engineering (ICDE), 1643–54.
Zhao, Kai, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Zizhong Chen, and Franck Cappello. 2020. β€œSignificantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization.” In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, 89–100. HPDC ’20. New York, NY, USA: Association for Computing Machinery.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.