Databases (for scientists)

Stashing experimental data for later use

You have to store it somewhere

tl;dr I do a lot of data processing, and not so much running of websites and such. This is not the typical target workflow for a database, at least not as they usually imagine at enterprise database focus groups. So, here are some convenient databases for my needs: working at a particular, sub-Google, scale, where my datasets are a few gigabytes but never a few terabytes, and capturing stuff like experiment data, data processing pipelines and that kind of thing. Not covered: processing credit card transactions, running websites, etc.

Short list of things I have used for different purposes:

  • Pre-sharded or non-concurrent-write datasets too big for RAM: some files formatted as hdf5. Annoying schema definition needed, but once you’ve done that it’s fast for numerical data. Not a database application per se, but a structured data format that fills a similar niche.

  • Pre-sharded or non-concurrent-write datasets: sqlite. Has great tooling. A pity that it’s not super-high performance for numerical data (but typically this is not a bottleneck except for biiiig data)

  • Concurrent but not incessant writes (i.e. I just want to manage my processes): Maybe dogpile.cache or joblib, or perhaps use berkeley DB which has lock support.

  • Concurrent frequent writes: redis.

  • honourable mention: If I want a low-fuss searchable index over some structured content from a python script, TinyDB!

  • Honourable mention: Pre-sharded or non-concurrent-write datasets between python/R/Go/Rust/JS/etc arrow.

  • Sometimes I want to quickly browse a database to see what is in it. For that I use DB UIs.

  • I want to share my data: Data sharing

Point of contact here: Data Lakes, big stores of (usually?) tabular data which are not yet normalized into data tables and are implicitly large. I don’t know much about those but see, e.g. Introducing Walden.

Maybe one could get a bit of perspective on the tools here by write-ups such as Luc Perkins’s Recent database technology that should be on your radar.

OK, longer notes begin here:

With a focus on slightly specialised data stores for use in my statistical jiggerypokery. Which is to say: I care about analysis of lots of data fast. This is probably inimical to running, e.g. your webapp from the same store, which has different requirements. (Massively concurrent writes, consistency guarantees, many small queries instead of few large) Don’t ask me about that.

I would prefer to avoid running a database server at all if I can; At least in the sense of a highly specialized multi-client server process. Those are not optimised for a typical scientific workflow. First stop is in-process non-concurrent-write data storage e.g. HDF5 or sqlite.

However, if you want to mediate between lots of threads/processes/machines updating your data in parallel, a “real” database server can be justified.

OTOH if your data is big enough, perhaps you need a crazy giant distributed store of some kind? Requirements change vastly depending on your scale.


Unless my data is enormous, or I need to write to it concurrently, this is what I want, because

  1. no special server process is required and
  2. migrating data is just copying a file

But how to encode the file? See data format.

Document stores

Want to handle floppy ill-defined documents of ill-specified possibly changing metadata? Already resigned to the process of querying and processing this stuff being depressingly slow and/or storage-greedy?

You’re looking for document stores!

If you are looking at document stores as your primary workhorse, as opposed to something you want to get data out of for other storage, then you have

  1. Not much data so performance is no problem, or
  2. a problem, or
  3. a big engineering team.

Let’s assume number 1, which is common.


Mongodb has a pleasant JS api but is not all that good at concurrent storage, so why are you bothering to do this in a document store? If your data is effectively single-writer you could just be doing this from the filesystem. Still I can imagine scenarios where the dynamic indexing of post hoc metadata is nice, for example in the exploratory phase with a data subset?


Couchdb was the pinup child of the current crop of non SQL-based databases, but seems to be unfashionable.


kinto “is a lightweight JSON storage service with synchronisation and sharing abilities. It is meant to be easy to use and easy to self-host. Supports fine permissions, easy host-proof encryption, automatic versioning for device sync.”

So this is probably for the smartphone app version.


lmdb looks interesting if you want a simple store that just guarantees you can write to it without corrupting data, and without requiring a custom server process. Most efficient for small records (2K)




QMiner provides support for unstructured data, such as text and social networks across the entire processing pipeline, from feature engineering and indexing to aggregation and machine learning.
QMiner provides out-of-the-box support for indexing, querying and aggregating structured, unstructured and geospatial data using a simple query language.
QMiner applications are implemented in JavaScript, making it easy to get started. Using the Javascript API it is easy to compose complete data processing pipelines and integrate with other systems via RESTful web services.
QMiner is implemented in C++ and can be included as a library into custom C++ projects, thus providing them with stream processing and data analytics capabilities.

Berkeley DB

berkeley is a venerable key-value store that is no longer fashionable. Howeever it is efficient for storing binary data, and supports multi-process concurrency via lock files, all without using a server process. As such it may be useful for low-fuss HPC data storage and processing. There are, e.g. python bindings.

Relational databases

Long lists of numbers? Spreadsheet-like tables? Wish to do queries mostly of the sort supported by database engines, such as grouping, sorting and range queries? Sqlite if it fits in memory. (No need to click on that link though, sqlite is already embedded in your tool of choice.) 🏗 how to write safely to sqlite from multiple processes through write locks. Also: Mark Litwintschik’s Minimalist Guide to SQLite.

If not, or if you need to handle concurrent writing by multiple processes, MySQL or Postgres. Not because they are best for this job, but because they are common. Honestly, though, unless this is a live production service for many users, you should probably be using a disk-backed store.

Clickhouse for example is a columnar database that avoids some of the problems of row-oriented tabular databases. I guess you could try that? And Amazon Athena turns arbitrary data into SQL-queryable data, apparently. So the skills here are general.

Accessing RDBMSs from python

Maybe you can make numerical work easier using Blaze?

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems.

More generally, records, which wraps tablib and sqlalchemy, is good at this.

Distributed stores

Ever since google, every CS graduate wants to write one of these. There are dozens of options; you probably need none of them.

I have used non of them and only mention them here to keep them straight in my head.

  • Hbase for Hadoop (original hip open source one, no longer hip)

  • Voldemort

  • Cassandra

  • Hypertable is Baidu’s open competitor to google internal database

  • bedrockdb:

    […] is a networking and distributed transaction layer built atop SQLite, the fastest, most reliable, and most widely distributed database in the world.

    Bedrock is written for modern hardware with large SSD-backed RAID drives and generous RAM file caches, and thereby doesn’t mess with the zillion hacky tricks the other databases do to eke out high performance on largely obsolete hardware. This results in fewer esoteric knobs, and sane defaults that “just work”.

  • datalog seems to be a protocol/language designed for largish stores, with implementations such as datomic getting good press for being scalable. Read this tutorial and explain it to me.


    Build flexible, distributed systems that can leverage the entire history of your critical data, not just the most current state. Build them on your existing infrastructure or jump straight to the cloud.

  • orbitdb Not necessarily giant (I mean, I don’t know the how it scales) but convenient for offline/online syncing and definitely distributed, orbitdb uses ipfs for its backend.


redis and memcached are the default generic choices here. Redis is newer and more flexible. memcached is sometimes faster? Dunno. Perhaps see Why Redis beats Memcached for caching.

See python caches for the practicalities of doing this for one particular languages.

Graph stores

Graph-tuple oriented processing.


GE is also a flexible computation engine powered by declarative message passing. GE is for you, if you are building a system that needs to perform fine-grained user-specified server-side computation.

From the perspective of graph computation, GE is not a graph system specifically optimized for a certain graph operation. Instead, with its built-in data and computation modeling capability, we can develop graph computation modules with ease. In other words, GE can easily morph into a system supporting a specific graph computation.


Nebula Graph is an open-source graph database capable of hosting super large scale graphs with dozens of billions of vertices (nodes) and trillions of edges, with milliseconds of latency.

There are a lot more of these. Everyone is inventing new graph stores at the moment.



immudb is a lightweight, high-speed immutable database for systems and applications, written in Go. With immudb you can track changes in sensitive data in your transactional databases and then record those changes permanently in a tamperproof immudb database. This allows you to keep an indelible history of sensitive data, for example debit/credit card transactions.

Traditional transaction logs are hard to scale and are mutable. So there is no way to know for sure if your data has been compromised.

As such, immudb provides unparalleled insights retroactively of changes to your sensitive data, even if your perimeter has been compromised. immudb guarantees immutability by using a Merkle tree structure internally.

immudb gives you the same cryptographic verification of the integrity of data written with SHA-256 as a classic blockchain without the cost and complexity associated with blockchains today.

Array stores that are not filesystem stores

Luc Perkins mentions

TileDB is a DB built around multi-dimensional arrays that enables you to easily work with types that aren’t a great fit for existing RDBMS systems, such as dense and sparse arrays and dataframes. TileDB is specifically geared toward use cases like genomics and geospatial data.

Noteworthy features

Time series/Event crunching/Streaming

See databases for realtime use.

Interesting experiments


Self-learning database: NoisePage

NoisePage is a relational database management system (DBMS) designed from the ground up for autonomous deployment. It uses integrated machine learning components to control its configuration, optimization, and tuning. The system will support automated physical database design (e.g., indexes, materialized views, sharding), knob configuration tuning, SQL tuning, and hardware capacity/scaling. Our research focuses on building the system components that support such self-driving operation with little to no human guidance.

I believe that it does notvwork yet.


logica (source)

Logica is for engineers, data scientists and other specialists who want to use logic programming syntax when writing queries and pipelines to run on BigQuery.

Logica compiles to StandardSQL and gives you access to the power of BigQuery engine with the convenience of logic programming syntax. This is useful because BigQuery is magnitudes more powerful than state of the art native logic programming engines.

We encourage you to try Logica, especially if

  • you already use logic programming and need more computational power, or
  • you use SQL, but feel unsatisfied about its readability, or
  • you want to learn logic programming and apply it to processing of Big Data.

In the future we plan to support more SQL dialects and engines.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.