Data storage formats

2020-06-17 — 2023-11-26

Wherein are set forth various data storage formats for numeric science, and the columnar Parquet format is singled out for on‑disk columnar organization with efficient compression and analytics use.

computers are awful

data sets

standards

In what format do I send my data over the network or stash it on disk? Especially interesting for experiment data, which is big and numbery.

Files full of stored data are the most basic form of database,¹ and much less arsing around if we can get away with it. Classic enterprise databases are optimised for things I don’t often need in data science research, such as structured records, high-write concurrency etc.

1 Textual formats

For CSV, JSON etc, see text data processing.

A hot mess that has the virtue that many projects use it, although they all use it badly, inconsistently and slowly. For exploratory small data analysis, textual tabular is tolerable and common.

For ongoing projects, CSV is best parsed with a tabular library such as pandas or tablib and thereafter stored in some fancier format.

2 HDF5

A numerical data storage system beloved of physicists. So important in my workflow that I made a new notebook. See HDF5.

3 Arrow

Apache Arrow (Python/R/C++/many more) (source) is a fresh system designed to serialize the same types of data as HDF5, but be both simpler from the user side and faster at scale. See Wes McKinney’s pyarrow blog post or Notes from a data witch: Getting started with Apache Arrow. It seems to be optimised for in-memory data, although it does store stuff on disk.

3.1 Polars

Polars (source)

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.

Features

Lazy | eager execution

Multi-threaded

SIMD

Query optimization

Powerful expression API

Hybrid Streaming (larger than RAM datasets)

Rust | Python | NodeJS | …

4 Parquet

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file format[…]

Free and open source file format.

Language agnostic.

Column-based format - files are organised by column, rather than by row, which saves storage space and speeds up analytics queries.

Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases.

Highly efficient data compression and decompression.

Supports complex data types and advanced nested data structures.

Seems to be more happy to serialize to disk than Apache Arrow.

5 Pandas

pandas can use various data formats for backend access, e.g. HDF5, CSV, JSON, SQL, Excel, Parquet.

6 Petastorm

Is this a data file format or some kind of database? Possibly both?

In this article, we describe Petastorm, an open source data access library developed at Uber ATG. This library enables single machine or distributed training and evaluation of deep learning models directly from multi-terabyte datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, Pytorch, and PySpark. It can also be used from pure Python code.

It sounds like a nice backend built upon pyarrow with special support for spark and ML formats. It seems like this might be easy to use from python but might be less pleasant for non-python users, even if it is still possible, because there will not be so many luxurious supporting libraries.

7 Dask

Numpy-style arrays and pandas style dataframes and more, but distributed. See Dask

8 zarr

Numpy-style arrays and pandas style dataframes and more, but distributed. See zarr

9 Protobuf

protobuf: (Python/R/Java/C/Go/whatever) Google’s famous data format. Recommended for tensorflow although it’s soooooo boooooooring if I am reading that page I am very far from what I love. A data format rather than a storage solution per se (there is no assumption that it will be stored on the file system)

You might also want everything not to be hard. Try prototool or buf

10 flatbuffers

flatbuffers:

FlatBuffers is an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, PHP, and Python. It was originally created at Google for game development and other performance-critical applications…

Why not use Protocol Buffers, or…?

Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.

That sounds like an optimization that I will not need.

11 Algorithms for compressing scientific data

Lossy compression of dense numerical data is weirdly fascinating. There are various special features of numerical data; They are stored in floating point, which is very wasteful and typically too precise. They typically have much redundancy, and we need to load them quickly; so we might naturally imagine they can be compressed. But can we control the accuracy with which they are compressed and stored such that they are reliable to load and that the errors introduced by lossy compression do not invalidate our inference? What this even means is context-dependent.

SZ3 and ZFP seem to be leading contenders right now, for flexible and reasonable data storage for science.

sdrbench.github.io/ notionally benchmarks different methods, although I cannot make any sense of it myself. Where are the actual results?

12 References

Diffenderfer, Fox, Hittinger, et al. 2019. “Error Analysis of ZFP Compression for Floating-Point Data.” SIAM Journal on Scientific Computing.

Di, Tao, Liang, et al. 2018. “Efficient Lossy Compression for Scientific Data Based on Pointwise Relative Error Bound.” IEEE Transactions on Parallel and Distributed Systems.

Dube, Tian, Di, et al. 2022. “SIMD Lossy Compression for Scientific Data.” arXiv:2201.04614 [Cs].

Liang, Di, Tao, et al. 2018. “Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets.” In 2018 IEEE International Conference on Big Data (Big Data).

Zhao, Di, Dmitriev, et al. 2021. “Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation.” In 2021 IEEE 37th International Conference on Data Engineering (ICDE).

Zhao, Di, Liang, et al. 2020. “Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization.” In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. HPDC ’20.

Footnotes

In some weird sense of basic, insofar as a filesystem can be thought of as a sophisticated database of sorts.↩︎