I do a a lot of data processing, and not so much running of websites and such. This is not the typical target workflow for a database. But here are the most convenient for my needs: working at a particular, sub-Google, scale, where my datasets are a few gigabytes but never a few terabytes.
- Pre-sharded or non-concurrent-write datasets too big for RAM:
hdf5. Annoying schema definition needed, but once you’ve done that it’s fast for numerical data.
- Pre-sharded or non-concurrent-write datasets:
sqlite. Has great tooling, a pity that it’s wasteful for numerical data.
- Pre-sharded or non-concurrent-write datasets between python and R:
- Concurrent but not incessant writes (i.e. I just want to manage my processes):
- Sometimes I just want to quickly browser a database to see what is in it. See DB UIs.
- Concurrent frequent writes: redis.
Or maybe I could pipeline my python analysis using blaze?
OK, full notes now:
With a focus on slightly specialised data stores for use in my statistical jiggerypokery. Which is to say: I care about analysis of lots of data fast. This is probably inimical to running, e.g. your webapp from the same store, which has different requirements. (Massively concurrent writes, consistency guarantees, many small queries instead of few large) Don’t ask me about that.
I prefer to avoid running a database server at all if I can; At least in the sense of a highly specialized multi-client server process. Those are not optimised for a typical scientific workflow. First stop is in-process non-concurrent-write data storage e.g. HDF5 or sqlite.
However, if you want to mediate between lots of threads/processes/machines updating your data in parallel, a “real” database server can be justified.
OTOH if your data is big enough, perhaps you need a crazy giant distributed store of some kind? Requirements change vastly depending on your scale.
Unless my data is enormous, or I need to write to it concurrently, this is what I want, because
- no special server process is required and
- migrating data is just copying a file
An optimised and flexible data format from the Big Science applications of
Inbuilt compression and indexing support for things too big for memory.
A good default with wide support, although the table definition when
writing data is booooring and it mangles text encodings if you aren’t careful, which ends up meaning a surprising amount of time gets lost to writing schemas.
arrow (Python/R/C++) is a fresh entrant designed to serialise
the same types of data as HDF5, but be much simpler and faster at scale.
See Wes McKinney’s pyarrow blog post.
This is aprt of an ecosystem including more specifc applications such as feather.
is a flexible parallel computing library for analytics. Dask emphasizes the following virtues:
Google’s famous data format.
Recommended for tensorflow
although it’s soooooo boooooooring if I am reading that page I am very far from what I love.
A data format rather than a storage
solution per se (there is no assumption that whi will be stored on the file system)
You might also want everything not to be hard.
Why not use Protocol Buffers, or…?
Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.
A slow hot mess that has the virtue of many things claiming they use it, although they all use it badly, inconsistently and slowly. For reproducible projects, this is best parsed with pandas or tablib and thereafter ignored. For exploratory small data analysis though this is tolerable and common. See [text data processing[(./text_data_processing.html).
Array stores that are not filesystem stores
Time series/Event crunching/Streaming
Want to handle floppy ill-defined documents of ill-specified possibly changing metadata? Already resigned to the process of querying and processing this stuff being depressingly slow and/or storage-greedy?
You’re looking for document stores!
If you are looking at document stores as your primary workhorse, as opposed to something you want to get data out of for other storage, then you have
- Not much data so performance is no problem, or
- a problem, or
- a big engineering team.
Let’s assume number 1, which is common.
Mongodb has a pleasant JS api but is not all that
good at concurrent storage, so why are you bothering to do this in a
document store? If your data is effectively single-writer you could just be
doing this from the filesystem. Still I can imagine scenarios where the
dynamic indexing of post hoc metadata is nice, for example in the
exploratory phase with a data subset?
Couchdb was the pinup child of the current crop
of non SQL-based databases, but seems to be unfashionable.
kinto “is a lightweight JSON
storage service with synchronisation and sharing abilities. It is meant to
be easy to use and easy to self-host. Supports fine permissions, easy
host-proof encryption, automatic versioning for device sync.”
So this is probably for the smartphone app version.
lmdb looks interesting if you want
a simple store that just guarantees
you can write to it without corrupting data, and without requiring a custom
server process. Most efficient for small records (2K)
- UNSTRUCTURED DATA
- QMiner provides support for unstructured data, such as text and social networks across the entire processing pipeline, from feature engineering and indexing to aggregation and machine learning.
- QMiner provides out-of-the-box support for indexing, querying and aggregating structured, unstructured and geospatial data using a simple query language.
- C++ LIBRARY
- QMiner is implemented in C++ and can be included as a library into custom C++ projects, thus providing them with stream processing and data analytics capabilities.
Long lists of numbers? Spreadsheet-like tables?
Wish to do queries mostly of the sort supported by database engines,
such as grouping, sorting and range queries?
Sqlite if it fits in memory.
(No need to click on that link though,
sqlite is already embedded in your tool of choice.)
🏗 how to write safely to sqlite from multiple processes through write locks.
Also: Mark Litwintschik’s
Minimalist Guide to SQLite.
If not, or if you need to handle concurrent writing by multiple processes, MySQL or Postgres. Not because they are best for this job, but because they are common. Honestly, though, unless this is a live production service for many users, you should probably be using a disk-backed store.
Clickhouse for example is a columnar database that avoids some of the problems of row-oriented tabular databases. I guess you could try that? And Amazon Athena turns arbitrary data into SQL-queryable data, apparently. So the skills here are general.
Accessing RDBMSs from python
Maybe you can make numerical work easier using
Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems.
Ever since google, every CS graduate wants to write one of these. There are dozens of options; you probably need none of them.
- Hbase for Hadoop (original hip open source one, no longer hip)
- Hypertable Baidu’s open competitor to google internal database
[…] is a networking and distributed transaction layer built atop SQLite, the fastest, most reliable, and most widely distributed database in the world.
Bedrock is written for modern hardware with large SSD-backed RAID drives and generous RAM file caches, and thereby doesn’t mess with the zillion hacky tricks the other databases do to eke out high performance on largely obsolete hardware. This results in fewer esoteric knobs, and sane defaults that “just work”.
Build flexible, distributed systems that can leverage the entire history of your critical data, not just the most current state. Build them on your existing infrastructure or jump straight to the cloud.
See parallel computing.
See python caches for the practicalities of doing this for one particular languages.
Graph-tuple oriented processing.
GE is also a flexible computation engine powered by declarative message passing. GE is for you, if you are building a system that needs to perform fine-grained user-specified server-side computation.
From the perspective of graph computation, GE is not a graph system specifically optimized for a certain graph operation. Instead, with its built-in data and computation modeling capability, we can develop graph computation modules with ease. In other words, GE can easily morph into a system supporting a specific graph computation.
Nebula Graph is an open-source graph database capable of hosting super large scale graphs with dozens of billions of vertices (nodes) and trillions of edges, with milliseconds of latency.