Data storage formats


What format do I send my data over the network or stash it on disk? Files full of stored data are the most basic (and for non-enterprise uses usually the most sensible) for of database.

Textual formats

For CSV, JSON etc text data processing.

A slow hot mess that has the virtue of many things claiming they use it, although they all use it badly, inconsistently and slowly. For reproducible projects, this is best parsed with pandas or tablib and thereafter ignored. For exploratory small data analysis though this is tolerable and common.

HDF5

HDF5: (Python/R/Java/C/Fortan/MATLAB/whatever) An optimised and flexible data format from the Big Science applications of the 90s. Inbuilt compression and indexing support for things too big for memory. A good default with wide support, although the table definition when writing data is booooring and it mangles text encodings if you aren’t careful, which ends up meaning a surprising amount of time gets lost to writing schemas.

useful tools:

arrow

arrow (Python/R/C++) is a fresh entrant designed to serialise the same types of data as HDF5, but be much simpler and faster at scale. See Wes McKinney’s pyarrow blog post. This is aprt of an ecosystem including more specifc applications such as feather.

Dask

dask

is a flexible parallel computing library for analytics.

howto

Protobuf

protobuf: (Python/R/Java/C/Go/whatever) Google’s famous data format. Recommended for tensorflow although it’s soooooo boooooooring if I am reading that page I am very far from what I love. A data format rather than a storage solution per se (there is no assumption that it will be stored on the file system)

You might also want everything not to be hard. Try prototool or buf

flatbuffers

flatbuffers:

FlatBuffers is an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, PHP, and Python. It was originally created at Google for game development and other performance-critical applications…

Why not use Protocol Buffers, or…?

Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.

That sounds like an optimisation that I will not need