Data storage formats

What format do I send my data over the network or stash it on disk? Files full of stored data are the most basic orm of database. Classic enterprise databases are optimised for things I don’t often need in data science research, such as structured records, high-write concurrency etc.

Textual formats

For CSV, JSON etc, see text data processing.

A slow hot mess that has the virtue of many things claiming they use it, although they all use it badly, inconsistently and slowly. For reproducible projects, this is best parsed with pandas or tablib and thereafter ignored. For exploratory small data analysis though this is tolerable and common.


HDF5: (Python/R/Java/C/Fortan/MATLAB/whatever) An optimised and flexible data format from the Big Science applications of the 90s. Inbuilt compression and indexing support for things too big for memory. A good default with wide support. For numerical data very simple. The table definition when writing structured data is booooring and it mangles text encodings if you aren’t careful, which ends up meaning a surprising amount of time gets lost to writing schemas.

Useful tools: visualisers.


Apache Arrow (Python/R/C++/many more) (source) is a fresh entrant designed to serialise the same types of data as HDF5, but be much simpler and faster at scale. See Wes McKinney’s pyarrow blog post. Alternatively, Notes from a data witch: Getting started with Apache Arrow.


Python specific. See Dask


Python specific. See zarr


protobuf: (Python/R/Java/C/Go/whatever) Google’s famous data format. Recommended for tensorflow although it’s soooooo boooooooring if I am reading that page I am very far from what I love. A data format rather than a storage solution per se (there is no assumption that it will be stored on the file system)

You might also want everything not to be hard. Try prototool or buf



FlatBuffers is an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, PHP, and Python. It was originally created at Google for game development and other performance-critical applications…

Why not use Protocol Buffers, or…?

Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.

That sounds like an optimisation that I will not need

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.