What format do I send my data over the network or stash it on disk? Files full of stored data are the most basic form of database. Classic enterprise databases are optimised for things I don’t often need in data science research, such as structured records, high-write concurrency etc.
Textual formats
For CSV, JSON etc, see text data processing.
A slow hot mess that has the virtue of many things claiming they use it, although they all use it badly, inconsistently and slowly. For reproducible projects, this is best parsed with pandas or tablib and thereafter ignored. For exploratory small data analysis though this is tolerable and common.
HDF5
So important I made a new notebook. See HDF5.
Arrow
Apache Arrow (Python/R/C++/many more) (source) is a fresh entrant designed to serialise the same types of data as HDF5, but be both simpler from the user side and faster at scale. See Wes McKinney’s pyarrow blog post or Notes from a data witch: Getting started with Apache Arrow.
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model.
- Lazy | eager execution
- Multi-threaded
- SIMD
- Query optimization
- Powerful expression API
- Hybrid Streaming (larger than RAM datasets)
- Rust | Python | NodeJS | …
Pandas
pandas can use various data formats for backend access, e.g. HDF5, CSV, JSON, SQL, Excel, Parquet.
Petastorm
Is this a data file format or some kind of database? Possibly both?
In this article, we describe Petastorm, an open source data access library developed at Uber ATG. This library enables single machine or distributed training and evaluation of deep learning models directly from multi-terabyte datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, Pytorch, and PySpark. It can also be used from pure Python code.
It sounds like a nice backend built upon pyarrow with special support for spark and ML formats. It seems like this is easy to use from python but might be less pleasant for non-python users, even if it is still possible, because there will not be so many luxurious supporting libraries.
Dask
Numpy-style array, but distributed. See Dask
zarr
Numpy-style array, but distributed. See zarr
Protobuf
protobuf
: (Python/R/Java/C/Go/whatever)
Google’s famous data format.
Recommended for tensorflow
although it’s soooooo boooooooring if I am reading that page I am very far from what I love.
A data format rather than a storage
solution per se (there is no assumption that it will be stored on the file system)
You might also want everything not to be hard.
Try prototool
or buf
flatbuffers
FlatBuffers is an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, PHP, and Python. It was originally created at Google for game development and other performance-critical applications…
Why not use Protocol Buffers, or…?
Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.
That sounds like an optimisation that I will not need.
Compressing scientific data
Lossy compression of dense numerical data is weirdly fascinating.
SZ3 and ZFP seem like competing contenders right now.
- SZ Lossy Compression
- szcompressor/SZ3
- Welcome to H5Z-ZFP
- LLNL/zfp: Compressed numerical arrays that support high-speed random access
- zfp
- Compression Modes
sdrbench.github.io/ notionally benchmarks different methods, although I cannot make any sense of that myself.
No comments yet. Why not leave one?