What format do I send my data over the network or stash it on disk? Files full of stored data are the most basic orm of database. Classic enterprise databases are optimised for things I don’t often need in data science research, such as structured records, high-write concurrency etc.
For CSV, JSON etc, see text data processing.
A slow hot mess that has the virtue of many things claiming they use it, although they all use it badly, inconsistently and slowly. For reproducible projects, this is best parsed with pandas or tablib and thereafter ignored. For exploratory small data analysis though this is tolerable and common.
An optimised and flexible data format from the Big Science applications of the 90s.
Inbuilt compression and indexing support for things too big for memory.
A good default with wide support. For numerical data very simple.
The table definition when writing structured data is booooring and it mangles text encodings if you aren’t careful, which ends up meaning a surprising amount of time gets lost to writing schemas.
Useful tools: visualisers.
Python specific. See Dask
Python specific. See zarr
Google’s famous data format.
Recommended for tensorflow
although it’s soooooo boooooooring if I am reading that page I am very far from what I love.
A data format rather than a storage
solution per se (there is no assumption that it will be stored on the file system)
Why not use Protocol Buffers, or…?
Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.
That sounds like an optimisation that I will not need