Databases for realtime stuff

Databases at the intersection of storing data and processing streams, for, e.g. time series forecasting and realtime-analytics.

  • RRD is not flashy or new but remarkably capable; designed for network and computer monitoring but basically any real time data with a ring buffer (i.e. fixed maximum history size of interest) can be stored and processed. I like the utilitarian aesthetic.

    What data can be put into an RRD?

    You name it, it will probably fit as long as it is some sort of time-series data. This means you have to be able to measure some value at several points in time and provide this information to RRDtool. If you can do this, RRDtool will be able to store it. The values must be numerical but don’t have to be integers, as is the case with MRTG (the next section will give more details on this more specialized application).

  • Redis is adept at heavy write-transactions. you can just run it without setting up your special dedicated server. Convenient for things that are just big enough to fit in your memory but you need to process the shit out of them fast. Easy set up, built-in lua interpreter, hip so widely compatible.

  • Influxdb is a database designed to query time-series live, by current time, relative age and so on. The sort of thing designed to run the kind of elaborate real time situation visualisation that evil overlords have in holographic displays in their lairs. Comes with free count aggregation and lite visualisations. Haven’t used it, just noting it here, will return if I need a dashboard of malevolence for my headquarters.

  • druid, as used by airbnb, is “a high-performance, column-oriented, distributed data store” that happens to be good at events also.

  • prometheus

    Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Prometheus’s main features are:

    • a multi-dimensional data model with time series data identified by metric name and key/value pairs
    • PromQL, a flexible query language to leverage this dimensionality
    • no reliance on distributed storage; single server nodes are autonomous
    • time series collection happens via a pull model over HTTP
    • pushing time series is supported via an intermediary gateway
    • targets are discovered via service discovery or static configuration
    • multiple modes of graphing and dashboarding support
  • timescaledb is a realtime/time series extension to postgres.

  • Heroic, by Spotify

    Heroic is our in-house time series database. We built it to address the challenges we were facing with near real-time data collection and presentation at scale. At the core are two key pieces of technology are Cassandra, and Elasticsearch. Cassandra acts as the primary means of storage with Elasticsearch being used to index all data. We currently operate over 200 Cassandra nodes in several clusters across the world serving over 50 million distinct time series.

  • rethinkdb is a database which does push instead of being polled. Recently open-sourced, very fancy pedigree, haven’t used it.

  • qminer

    QMiner provides support for unstructured data, such as text and social networks across the entire processing pipeline, from feature engineering and indexing to aggregation and machine learning.
    QMiner provides out-of-the-box support for indexing, querying and aggregating structured, unstructured and geospatial data using a simple query language.
    QMiner applications are implemented in JavaScript, making it easy to get started. Using the Javascript API it is easy to compose complete data processing pipelines and integrate with other systems via RESTful web services.
    QMiner is implemented in C++ and can be included as a library into custom C++ projects, thus providing them with stream processing and data analytics capabilities.

Document stores

Want to handle floppy ill-defined documents of ill-specified possibly changing metadata? Already resigned to the process of querying and processing this stuff being depressingly slow and/or storage-greedy?

You’re looking for document stores!

If you are looking at document stores as your primary workhorse, as opposed to something you want to get data out of for other storage, then you have either

  • Not much data so performance is no problem, or

  • a problem.

Let’s assume number 1, which is common.

  • Mongodb has a pleasant JS api but is not all that good at concurrent storage, so why are you bothering to do this in a document store? If your data is effectively single-writer you could just be doing this from the filesystem. Still I can imagine scenarios where the dynamic indexing of post hoc metadata is nice, for example in the exploratory phase with a data subset?

  • Couchdb was the pinup child of the current crop of non SQL-based databases, but seems to be unfashionable.

  • kinto “is a lightweight JSON storage service with synchronisation and sharing abilities. It is meant to be easy to use and easy to self-host. Supports fine permissions, easy host-proof encryption, automatic versioning for device sync.”

    So this is probably for the smartphone app version.

  • lmdb looks interesting if you want a simple store that just guarantees you can write to it without corrupting data, and without requiring a custom server process. Most efficient for small records (2K)