Data storage formats



What format do I send my data over the network or stash it on disk? Files full of stored data are the most basic form of database. Classic enterprise databases are optimised for things I don’t often need in data science research, such as structured records, high-write concurrency etc.

Textual formats

For CSV, JSON etc, see text data processing.

A slow hot mess that has the virtue of many things claiming they use it, although they all use it badly, inconsistently and slowly. For reproducible projects, this is best parsed with pandas or tablib and thereafter ignored. For exploratory small data analysis though this is tolerable and common.

HDF5

So important I made a new notebook. See HDF5.

Arrow

Apache Arrow (Python/R/C++/many more) (source) is a fresh entrant designed to serialise the same types of data as HDF5, but be both simpler from the user side and faster at scale. See Wes McKinney’s pyarrow blog post or Notes from a data witch: Getting started with Apache Arrow.

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.

  • Lazy | eager execution
  • Multi-threaded
  • SIMD
  • Query optimization
  • Powerful expression API
  • Hybrid Streaming (larger than RAM datasets)
  • Rust | Python | NodeJS | …

Pandas

pandas can use various data formats for backend access, e.g. HDF5, CSV, JSON, SQL, Excel, Parquet.

Petastorm

Is this a data file format or some kind of database? Possibly both?

In this article, we describe Petastorm, an open source data access library developed at Uber ATG. This library enables single machine or distributed training and evaluation of deep learning models directly from multi-terabyte datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, Pytorch, and PySpark. It can also be used from pure Python code.

It sounds like a nice backend built upon pyarrow with special support for spark and ML formats. It seems like this is easy to use from python but might be less pleasant for non-python users, even if it is still possible, because there will not be so many luxurious supporting libraries.

Dask

Numpy-style array, but distributed. See Dask

zarr

Numpy-style array, but distributed. See zarr

Protobuf

protobuf: (Python/R/Java/C/Go/whatever) Google’s famous data format. Recommended for tensorflow although it’s soooooo boooooooring if I am reading that page I am very far from what I love. A data format rather than a storage solution per se (there is no assumption that it will be stored on the file system)

You might also want everything not to be hard. Try prototool or buf

flatbuffers

flatbuffers:

FlatBuffers is an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, PHP, and Python. It was originally created at Google for game development and other performance-critical applications…

Why not use Protocol Buffers, or…?

Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.

That sounds like an optimisation that I will not need.

Compressing scientific data

Lossy compression of dense numerical data is weirdly fascinating.

SZ3 and ZFP seem like competing contenders right now.

sdrbench.github.io/ notionally benchmarks different methods, although I cannot make any sense of that myself.

References

Di, Sheng, Dingwen Tao, Xin Liang, and Franck Cappello. 2018. Efficient Lossy Compression for Scientific Data Based on Pointwise Relative Error Bound.” IEEE Transactions on Parallel and Distributed Systems 30 (2).
Diffenderfer, James, Alyson L. Fox, Jeffrey A. Hittinger, Geoffrey Sanders, and Peter G. Lindstrom. 2019. Error Analysis of ZFP Compression for Floating-Point Data.” SIAM Journal on Scientific Computing 41 (3): A1867–98.
Dube, Griffin, Jiannan Tian, Sheng Di, Dingwen Tao, Jon Calhoun, and Franck Cappello. 2022. SIMD Lossy Compression for Scientific Data.” arXiv:2201.04614 [Cs], January.
Liang, Xin, Sheng Di, Dingwen Tao, Sihuan Li, Shaomeng Li, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2018. Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets.” In 2018 IEEE International Conference on Big Data (Big Data), 438–47.
Zhao, Kai, Sheng Di, Maxim Dmitriev, Thierry-Laurent D. Tonellot, Zizhong Chen, and Franck Cappello. 2021. Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation.” In 2021 IEEE 37th International Conference on Data Engineering (ICDE), 1643–54.
Zhao, Kai, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Zizhong Chen, and Franck Cappello. 2020. Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization.” In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, 89–100. HPDC ’20. New York, NY, USA: Association for Computing Machinery.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.