What format do I send my data over the network or stash it on disk? Files full of stored data are the most basic (and for non-enterprise uses usually the most sensible) for of database.
For CSV, JSON etc text data processing.
A slow hot mess that has the virtue of many things claiming they use it, although they all use it badly, inconsistently and slowly. For reproducible projects, this is best parsed with pandas or tablib and thereafter ignored. For exploratory small data analysis though this is tolerable and common.
An optimised and flexible data format from the Big Science applications of
Inbuilt compression and indexing support for things too big for memory.
A good default with wide support, although the table definition when
writing data is booooring and it mangles text encodings if you aren’t careful, which ends up meaning a surprising amount of time gets lost to writing schemas.
arrow (Python/R/C++) is a fresh entrant designed to serialise
the same types of data as HDF5, but be much simpler and faster at scale.
See Wes McKinney’s pyarrow blog post.
This is aprt of an ecosystem including more specifc applications such as feather.
Google’s famous data format.
Recommended for tensorflow
although it’s soooooo boooooooring if I am reading that page I am very far from what I love.
A data format rather than a storage
solution per se (there is no assumption that it will be stored on the file system)
Why not use Protocol Buffers, or…?
Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.
That sounds like an optimisation that I will not need