Databases

structured data storage for completists and “data science”

tl;dr

I do a a lot of data processing, and not so much running of websites and such. This is not the typical target workflow for a database. But here are the most convenient for my needs: working at a particular, sub-Google, scale, where my datasets are a few gigabytes but never a few terabytes.

  • Pre-sharded or non-concurrent-write datasets too big for RAM: hdf5. Annoying schema definition needed, but once you’ve done that it’s fast for numerical data.

  • Pre-sharded or non-concurrent-write datasets: sqlite. Has great tooling, a pity that it’s wasteful for numerical data.

  • Pre-sharded or non-concurrent-write datasets between python and R: feather.

  • Concurrent but not incessant writes (i.e. I just want to manage my processes): Maybe dogpile.cache or joblib?

  • Concurrent frequent writes: redis.

Or maybe I could pipeline my python analysis using blaze?

OK, full notes now:

With a focus on slightly specialised data stores for use in my statistical jiggerypokery. Which is to say: I care about analysis of lots of data fast. This is probably inimical to running, e.g. your webapp from the same store, which has different requirements. (Massively concurrent writes, consistency guarantees, many small queries instead of few large) Don’t ask me about that.

I prefer to avoid running a database server at all if I can; At least in the sense of a highly specialized multi-client server process. Those are not optimised for a typical scientific workflow. First stop is in-process non-concurrent-write data storage e.g. HDF5 or sqlite.

However, if you want to mediate between lots of threads/processes/machines updating your data in parallel, a “real” database server can be justified.

OTOH if your data is big enough, perhaps you need a crazy giant distributed store of some kind? Requirements change vastly depending on your scale.

Filesystem stores

Unless your data is very very big, this is what you want.

  • HDF5: (Python/R/Java/C/Fortan/MATLAB/whatever) An optimised and flexible data format from the Big Science applications of the 90s. Inbuilt compression and indexing support for things too big for memory. A good default with wide support, although the table definition when writing data is booooring and it mangles text encodings if you aren’t careful.

    useful tools:

  • arrow (Python/R/C++) is a fresh entrant designed to serialise the same types of data as HDF5, but be much simpler and faster at scale. See Wes McKinney’s pyarrow blog post.

  • dask

    is a flexible parallel computing library for analytics. Dask emphasizes the following virtues:

    howto

  • protobuf: (Python/R/Java/C/Go/whatever) Google’s famous data format. Recommended for tensorflow although it’s soooooo boooooooring. A data format rather than a storage solution per se, so if you end using it you’ve still only solved part of the problem and you need a db or caching layer, or at least a file naming and locking convention.

    Bonus wrinkle: the default python version is slow - you need to install google’s custom protobuff.

    You might also want everything not to be hard. Try prototool.

  • flatbuffers:

    FlatBuffers is an efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, PHP, and Python. It was originally created at Google for game development and other performance-critical applications…

    Why not use Protocol Buffers, or .. ?

    Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has neither optional text import/export nor schema language features like unions.

  • CSV:

    A slow hot mess that has the virtue of many things claiming they use it, although they all use it badly, inconsistently and slowly. Best parsed with pandas or tablib and thereafter ignored

Array stores that are not filesystem stores

TileDB.

Time series/Event crunching/Streaming

Databases at the intersection of storing data and processing streams, for, e.g. time series forecasting and realtime-analytics.

  • Redis is adept at heavy write-transactions. you can just run it without setting up your special dedicated server. Convenient for things that are just big enough to fit in your memory but you need to process the shit out of them fast. Easy set up, built-in lua interpreter, hip so widely compatible.

  • Influxdb is a database designed to query time-series live, by current time, relative age and so on. The sort of thing designed to run the kind of elaborate real time situation visualisation that evil overlords have in holographic displays in their lairs. Comes with free count aggregation and lite visualisations. Haven’t used it, just noting it here, will return if I need a dashboard of malevolence for my headquarters.

  • druid, as used by airbnb, is “a high-performance, column-oriented, distributed data store” that happens to be good at events also.

  • prometheus

    Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Prometheus’s main features are:

    • a multi-dimensional data model with time series data identified by metric name and key/value pairs
    • PromQL, a flexible query language to leverage this dimensionality
    • no reliance on distributed storage; single server nodes are autonomous
    • time series collection happens via a pull model over HTTP
    • pushing time series is supported via an intermediary gateway
    • targets are discovered via service discovery or static configuration
    • multiple modes of graphing and dashboarding support
  • timescaledb is a realtime/time series extension to postgres.

  • Heroic, by Spotify

    Heroic is our in-house time series database. We built it to address the challenges we were facing with near real-time data collection and presentation at scale. At the core are two key pieces of technology are Cassandra, and Elasticsearch. Cassandra acts as the primary means of storage with Elasticsearch being used to index all data. We currently operate over 200 Cassandra nodes in several clusters across the world serving over 50 million distinct time series.

  • rethinkdb is a database which does push instead of being polled. Recently open-sourced, very fancy pedigree, haven’t used it.

  • qminer

    UNSTRUCTURED DATA
    QMiner provides support for unstructured data, such as text and social networks across the entire processing pipeline, from feature engineering and indexing to aggregation and machine learning.
    SEARCH
    QMiner provides out-of-the-box support for indexing, querying and aggregating structured, unstructured and geospatial data using a simple query language.
    JAVASCRIPT API
    QMiner applications are implemented in JavaScript, making it easy to get started. Using the Javascript API it is easy to compose complete data processing pipelines and integrate with other systems via RESTful web services.
    C++ LIBRARY
    QMiner is implemented in C++ and can be included as a library into custom C++ projects, thus providing them with stream processing and data analytics capabilities.

Document stores

Want to handle floppy ill-defined documents of ill-specified possibly changing metadata? Already resigned to the process of querying and processing this stuff being depressingly slow and/or storage-greedy?

You’re looking for document stores!

If you are looking at document stores as your primary workhorse, as opposed to something you want to get data out of for other storage, then you have either

  • Not much data so performance is no problem, or

  • a problem.

Let’s assume number 1, which is common.

  • Mongodb has a pleasant JS api but is not all that good at concurrent storage, so why are you bothering to do this in a document store? If your data is effectively single-writer you could just be doing this from the filesystem. Still I can imagine scenarios where the dynamic indexing of post hoc metadata is nice, for example in the exploratory phase with a data subset?

  • Couchdb was the pinup child of the current crop of non SQL-based databases, but seems to be unfashionable.

  • kinto “is a lightweight JSON storage service with synchronisation and sharing abilities. It is meant to be easy to use and easy to self-host. Supports fine permissions, easy host-proof encryption, automatic versioning for device sync.”

    So this is probably for the smartphone app version.

  • lmdb looks interesting if you want a simple store that just guarantees you can write to it without corrupting data, and without requiring a custom server process. Most efficient for small records (2K)

Relational databases

Long lists of numbers? Spreadsheet-like tables? Wish to do queries mostly of the sort supported by database engines, such as grouping, sorting and range queries? Sqlite if it fits in memory. (No need to click on that link though, sqlite is already embedded in your tool of choice.) 🏗 how to write safely to sqlite from multiple processes through write locks. Also: Mark Litwintschik’s Minimalist Guide to SQLite.

If not, or if you need to handle concurrent writing by multiple processes, MySQL or Postgres. Not because they are best for this job, but because they are common. Honestly, though, unless this is a live production service for many users, you should probably be using a disk-backed store.

Clickhouse for example is a columnar database that avoids some of the problems of row-oriented tabular databases. I guess you could try that? And Amazon Athena turns arbitrary data into SQL-queryable data, apparently. So the skills here are general.

Accessing RDBMSs from python

Maybe you can make numerical work easier using Blaze?

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems.

More generally, records, which wraps tablib and sqlalchemy](http://www.sqlalchemy.org/), is good at this.

Distributed stores

Ever since google, every CS graduate wants to write one of these. There are dozens of options; you probably need none of them.

  • Hbase for Hadoop (original hip open source one, no longer hip)

  • Voldemort

  • Cassandra

  • Hypertable Baidu’s open competitor to google internal database

  • bedrockdb:

    is a networking and distributed transaction layer built atop SQLite, the fastest, most reliable, and most widely distributed database in the world.

    Bedrock is written for modern hardware with large SSD-backed RAID drives and generous RAM file caches, and thereby doesn’t mess with the zillion hacky tricks the other databases do to eke out high performance on largely obsolete hardware. This results in fewer esoteric knobs, and sane defaults that “just work”.

  • datalog seems to be a protocol/language (prolog) designed for largish stores, with implementations such as datomic getting good press for being scalable. Read this tutorial and explain it to me.

    datomic:

    Build flexible, distributed systems that can leverage the entire history of your critical data, not just the most current state. Build them on your existing infrastructure or jump straight to the cloud.

  • orbitdb Not necessarily giant (I mean, I don’t know the how it scales) but convenient for offline/online syncing and definitely distributed, orbitdb uses ipfs for its backend.

    See parallel computing.

Caches

redis and memcached are the default generic choices here. Redis is newer and more flexible. memcached is sometimes faster? Dunno. Perhaps see Why Redis beats Memcached for caching.

See python caches for the practicalities of doing this for one particular languages.

Graph stores

Graph-tuple oriented processing.

graphengine:

GE is also a flexible computation engine powered by declarative message passing. GE is for you, if you are building a system that needs to perform fine-grained user-specified server-side computation.

From the perspective of graph computation, GE is not a graph system specifically optimized for a certain graph operation. Instead, with its built-in data and computation modeling capability, we can develop graph computation modules with ease. In other words, GE can easily morph into a system supporting a specific graph computation.

UIs

If you want to access SQL databases there are a couple of nice options in the open source land (only a few decades after SQL’s birth).

  • grafana

    Open source software for time series analytics

    From heatmaps to histograms. Graphs to geomaps. Grafana has a plethora of visualization options to help you understand your data, beautifully. Bring your data together to get better context. Grafana supports dozens of databases, natively. Mix them together in the same Dashboard.

  • database flow

    Database Flow is an open source self-hosted SQL client, GraphQL server, and charting application that works with your database.

    Visualize schemas, query plans, charts, and results.

    Java app.

  • metabase

    Shanker Sneh, about whom I know nothing, says

    Good:

    1. Robust and clearly laid out framework. Supports proper database for application metadata.
    2. Feature-rich with easy user, query, segment & dashboard management & classification.
    3. Supports Google SSO, Slack, Email integration.

    Not-so-good:

    1. Framework is Java based. Any customisation will require dev activities from our end.
  • R can do lots of data analysis, uncluding database anlystics as a special case. If you want it to be web-based, shiny can put many queries online.

  • superset

    tl;dr autogenerates smooth dashboards based on your database, makes it look like you have been doing something.

    Apache Superset is a data exploration and visualization web application.

    Superset provides:

    • A wide array of beautiful visualizations to showcase your data.

    • A state of the art SQL editor/IDE exposing a rich metadata browser, and an easy workflow to create visualizations out of any result set.

    • Out of the box support for most SQL-speaking databases

    • [other keywords that only boring bizdev types care about and noone real ever needs]

  • redash (redash source)

    Redash consists of two parts:

    • Query Editor Think of JS Fiddle for SQL queries. It’s your way to share data in the organization in an open way, by sharing both the dataset and the query that generated it. This way everyone can peer review not only the resulting dataset but also the process that generated it. Also it’s possible to fork it and generate new datasets and reach new insights.

    • Dashboards/Visualizations once you have a dataset, you can create different visualizations out of it, and then combine several visualizations into a single dashboard. Currently it supports charts, pivot table and cohorts.

  • blazer is a dashboarding/interactive query UI.

    features:

    • Multiple data sources - PostgreSQL, MySQL, Redshift, Mongodb…
    • Variables - run the same queries with different values
    • Checks & alerts - get emailed when bad data appears
    • Audits - all queries are tracked
    • Security - works with your authentication system
  • dbeaver

    Free multi-platform database tool for developers, SQL programmers, database administrators and analysts. Supports all popular databases: MySQL, PostgreSQL, MariaDB, SQLite, Oracle, DB2, SQL Server, Sybase, MS Access, Teradata, Firebird, Derby, etc.

    (Java Eclipse app)

  • sqlitebrowser does only sqlite but is open source and featureful.

  • sqlitestudio is a qt-based sqlite manager/browser, also open source.

  • datasette provides a read-only Web JSON api for SQLite

  • sqlectron

    A simple and lightweight SQL client desktop/terminal with cross database and platform support.

    Unmaintained.