Databases (for scientists)

Stashing experimental data for later use

2015-03-04 — 2025-04-11

Wherein scientific database choices are surveyed, and a preference for serverless, file-backed storage for gigabyte-scale experiment data (HDF5, SQLite, DuckDB) is articulated, and concurrent-write tradeoffs are catalogued.

computers are awful

concurrency hell

data sets

tl;dr Databases. Systems to store data and then query them after. I use them a lot. Unlike most of the internet my needs are science-y, which means I have some specific and unusual requirements. I do a lot of data processing, and not so much running of websites and such. This is not the typical target workflow for a database, at least not as they usually imagined by enterprise database focus groups.

So, this notebook documents some convenient databases for my needs: working at a particular, sub-Google scale, where my datasets are a few gigabytes but never a few terabytes, and capturing stuff like experiment data, data processing pipelines and that kind of thing. I hav ea strong preference for not needing a server if possible, and also for not needing to install a lot of dependencies, and ideally also to handle concurrent writes from lots of parallel processes. Achieving these all at once is annoying, but we can achiveve good compromises depending on the use case.

Short list of things I have used for various purposes, which will probably do as a starting point for most data science:

Non-concurrent-write datasets too big for RAM? Do not bother with a database per se. Instead stash the stuff in some files. hdf5 is a good starting choice that I tend to use. Annoying schema definition needed, but once that’s done it’s fast for numerical data, on a diversity of storage backends. Dependencies are easy to install and self-contained.
Non-concurrent-write datasets where we want to do complicated queries? sqlite. Has great tooling. A pity that it’s not super-high performance for numerical data (but for structured, relational data it is pretty good). Actually these days the cool kids use DuckDB instead of sqlite for this kind of thing; I have not done that myself though. It seems to have better support for analytical queries.
Concurrent but not incessant writes and simple queries (i.e. I just want to manage my processes and store some data): Maybe dogpile.cache or joblib, or perhaps use berkeley DB which has lock support, or Diskcache
Concurrent frequent writes: redis, but now we need a server.
Honourable mention: If I want a low-fuss searchable index over some structured content from a python script, TinyDB!
Sometimes I want to quickly browse a database to see what is in it. For that I use DB UIs.
I want to share my data: Data sharing

Point of contact here: Data Lakes, big stores of (usually?) somewhat tabular data which are not yet normalised into data tables and are implicitly large. I don’t know much about those but see, e.g. Introducing Walden.

Maybe one could get a bit of perspective on the tools here by write-ups such as Luc Perkins’s recent database technology that should be on your radar.

1 Too-long bit begins here

With a focus on slightly specialised data stores for use in my statistical jiggery-pokery. Which is to say: I care about analysis of lots of data fast. This is probably inimical to running, e.g. your webapp from the same store, which has different requirements. (Massively concurrent writes, consistency guarantees, many small queries instead of few large) Don’t ask me about that.

I would prefer to avoid running a database server at all if I can, at least in the sense of a highly specialised multi-client server process. Those are not optimised for a typical scientific workflow. First stop is in-process non-concurrent-write data storage, e.g. HDF5 or sqlite.

However, if I want to mediate between lots of threads/processes/machines updating my data in parallel, a “real” database server might be justified.

OTOH if my data is big enough, perhaps I need a crazy giant distributed store of some kind? Requirements change vastly depending on scale.

2 Files

Unless my data is enormous, or I need to write to it concurrently, this is what I want, because

no special server process is required and
migrating data is just copying a file

But how to encode the file? See data formats.

3 Document stores

Want to handle floppy ill-defined documents of ill-specified possibly changing metadata? Already resigned to the process of querying and processing this stuff being depressingly slow and/or storage-greedy? I must be looking for document stores!

If I am looking at document stores as my primary workhorse, as opposed to something I want to get data out of for other storage, then I have

Not much data so performance is no problem, or
a problem, or
a big engineering team.

Let’s assume number 1, which is common for me.

4 DDB

DictDataBase is a fast document-based database that uses json files or compressed json files for storage.

Multi threading and multi processing safe. Multiple processes on the same machine can simultaneously read and write to dicts without losing data.

ACID compliant. Unlike TinyDB, it is suited for concurrent environments.

No Conflict resolution required. Unlike with ZODB, lock-based access control is used, such that conflicts never occur.

No database server required. Simply import DictDataBase in your project and use it.

Compression. Configure if the files should be stored as raw json or as json compressed with zlib.

Fast. Key-value pairs inside a json file can be accessed quickly and efficiently because the keys are indexed.

Tested with 98%+ coverage on Python 3.8 to 3.13.

4.1 Mongodb

Mongodb has a pleasant JS API but is notoriously not all that good at concurrent storage. It requires a server process, with heavy dependencies and root access. If my data is effectively single-writer I could just be doing this from the filesystem; why do I need a server? Still, I can imagine scenarios where the dynamic indexing of post hoc metadata is nice, for example in the exploratory phase with a data subset?

Since it went closed-source, it is worth knowing there is an open-source alternative, FerretDB 1.0 GA.

4.2 Couchdb

Couchdb was the pin-up child of the current crop of non SQL-based databases, but seems to be unfashionable rn? A big ecosystem of different implementations of the core DB for different purposes, all of which promise to support replication into eventually-consistent clusters.

4.3 kinto

kinto “is a lightweight JSON storage service with synchronization and sharing abilities. It is meant to be easy to use and easy to self-host. Supports fine permissions, easy host-proof encryption, automatic versioning for device sync.”

I can imagine distributed data analysis applications.

5 Key value stores

You have names for things (“keys”), and the things (“values”). You build indices out of that if you want.

5.1 LevelDB

google/leveldb: LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

5.2 RocksDB

The RocksDB library provides a persistent key value store. Keys and values are arbitrary byte arrays. The keys are ordered within the key value store according to a user-specified comparator function.

The library is maintained by the Facebook Database Engineering Team, and is based on LevelDB, by Sanjay Ghemawat and Jeff Dean at Google.

5.3 lmdb

lmdb looks interesting for its concurrency support.

LMDB is a tiny database with some excellent properties:

Ordered map interface (keys are always lexicographically sorted).

Reader/writer transactions: readers don’t block writers, writers don’t block readers. Each environment supports one concurrent write transaction.

Read transactions are extremely cheap.

Environments may be opened by multiple processes on the same host, making it ideal for working around Python’s GIL.

Multiple named databases may be created with transactions covering all named databases.

Memory mapped, allowing for zero copy lookup and iteration. This is optionally exposed to Python using the buffer() interface.

Maintenance requires no external process or background threads.

No application-level caching is required: LMDB fully exploits the operating system’s buffer cache.

Most efficient for small records (2K).

Records are grouped into pages matching the operating system’s VM page size, which is usually 4096 bytes. Each page must contain at least 2 records, in addition to 8 bytes per record and a 16 byte header. Due to this the engine is most space-efficient when the combined size of any (8+key+value) combination does not exceed 2040 bytes.

When an attempt to store a record would exceed the maximum size, its value part is written separately to one or more dedicated pages. Since the trailer of the last page containing the record value cannot be shared with other records, it is more efficient when large values are an approximate multiple of 4096 bytes, minus 16 bytes for an initial header.

5.4 Berkeley DB

berkeley is a venerable key-value store that is no longer fashionable. However, it is efficient for storing binary data and supports multi-process concurrency via lock files, all without using a server process. As such, it may be useful for low-fuss HPC data storage and processing. There are, e.g. python bindings.

6 Relational databases

Long lists of numbers? Spreadsheet-like tables? Wish to do queries mostly of the sort supported by database engines, such as grouping, sorting and range queries? Note that if I have tabular data but do not particularly wish to perform diverse RDBMS-style queries, then I should just use HDF5 or some other simple disk data store.

If not, or if I need to handle concurrent writing by multiple processes, we need one of the classic RDBMS servers, e.g. MySQL or Postgres. Scientific use cases are not usually like this; we are not usually concurrently generating lots of data.

6.1 Sqlite

First stop is Sqlite, if the data fits in memory, in the sense of the bit of the data I am mostly-using mostly-fitting in memory. Battle-tested and has a lot of tooling.

🏗 how to write safely to sqlite from multiple processes with write locks, worked examples. Also: See Mark Litwintschik’s Minimalist Guide to SQLite.

Interestingly, if we delete stuff from SQLite it can be relatively easy to recover. See Undark - a SQLite deleted data recovery tool / corrupt database data recovery tool. See also: Make sure Delete is Delete.

6.2 DuckDB

Columnar in-process DB. Similar to SQLite but columnar, so better for queries, i.e. re-analysing all that data you just stored.

See DuckDB:

DuckDB is easy to install and deploy. It has zero external dependencies and runs in-process in its host application or as a single binary.

DuckDB runs on Linux, macOS, Windows, and all popular hardware architectures. It has idiomatic client APIs for major programming languages.

DuckDB offers a rich SQL dialect. It can read and write file formats such as CSV, Parquet, and JSON, to and from the local file system and remote endpoints such as S3 buckets.

DuckDB runs analytical queries at blazing speed thanks to its columnar engine, which supports parallel execution and can process larger-than-memory workloads.

DuckDB is extensible by third-party features such as new data types, functions, file formats and new SQL syntax.

DuckDB and its core extensions are open-source under the permissive MIT License.

6.3 Mysql

TBD

6.4 MariaDB

MariaDB/server.

MariaDB server is a community developed fork of MySQL server. Started by core members of the original MySQL team, MariaDB actively works with outside developers to deliver the most featureful, stable, and sanely licensed open SQL server in the industry

6.5 postgresql

Classic. What I tended to use because it has powerful embedded scripting and good support for spatial data. But running servers saddens me.

6.6 openGauss

Huawei’s postgres fork is openGauss. What is its value proposition?

6.7 Accessing RDBMSs from python

Maybe we can make numerical work easier using Blaze?

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems.

More generally, records, which wraps tablib and sqlalchemy, are all good at this.

Julia Evan points out sqlite-utils, and tool that magically converts JSON to sqlite.

6.8 Dolt

Also covered under data versioning

Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a git repository. Connect to Dolt just like any MySQL database to run queries or update the data using SQL commands. Use the command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate’s changes.

All the commands you know for Git work exactly the same for Dolt. Git versions files, Dolt versions tables. It’s like Git and MySQL had a baby.

We also built DoltHub, a place to share Dolt databases. We host public data for free. If you want to host your own version of DoltHub, we have DoltLab. If you want us to run a Dolt server for you, we have Hosted Dolt.

7 Distributed stores

Ever since Google, every CS graduate wants to write one of these. There are dozens of options; you probably need none of them.

I have used none of them and only mention them here to keep them straight in my head.

Hbase for Hadoop (original hip open source one, no longer hip)
Voldemort
Cassandra
Hypertable is Baidu’s open competitor to Google’s internal database
bedrockdb:

[…] is a networking and distributed transaction layer built atop SQLite, the fastest, most reliable, and most widely distributed database in the world.

Bedrock is written for modern hardware with large SSD-backed RAID drives and generous RAM file caches, and thereby doesn’t mess with the zillion hacky tricks the other databases do to eke out high performance on largely obsolete hardware. This results in fewer esoteric knobs, and sane defaults that “just work”.
datalog seems to be a protocol/language designed for large stores, with implementations such as datomic getting good press for being scalable. Read this tutorial and explain it to me.

datomic:

Build flexible, distributed systems that can leverage the entire history of your critical data, not just the most current state. Build them on your existing infrastructure or jump straight to the cloud.
orbitdb Not necessarily giant (I mean, I don’t know how it scales) but convenient for offline/online syncing and definitely distributed, orbitdb uses ipfs for its backend.

8 Caches

redis and memcached are the default generic choices here. Redis is newer and more flexible. memcached is sometimes faster? Dunno. Perhaps see Why Redis beats Memcached for caching.

See python caches for the practicalities of doing this for one particular language.

9 Graph stores

Graph-tuple oriented processing.

graphengine:

GE is also a flexible computation engine powered by declarative message passing. GE is for you, if you are building a system that needs to perform fine-grained user-specified server-side computation.

From the perspective of graph computation, GE is not a graph system specifically optimized for a certain graph operation. Instead, with its built-in data and computation modelling capability, we can develop graph computation modules with ease. In other words, GE can easily morph into a system supporting a specific graph computation.

nebula

Nebula Graph is an open-source graph database capable of hosting super large-scale graphs with dozens of billions of vertices (nodes) and trillions of edges, with milliseconds of latency.

There are a lot more of these. Everyone is inventing new graph stores at the moment.

10 Other

immudb

immudb is a lightweight, high-speed immutable database for systems and applications, written in Go. With immudb you can track changes in sensitive data in your transactional databases and then record those changes permanently in a tamperproof immudb database. This allows you to keep an indelible history of sensitive data, for example debit/credit card transactions.

Traditional transaction logs are hard to scale and are mutable. So there is no way to know for sure if your data has been compromised.

As such, immudb provides unparalleled insights retroactively of changes to your sensitive data, even if your perimeter has been compromised. immudb guarantees immutability by using a Merkle tree structure internally.

immudb gives you the same cryptographic verification of the integrity of data written with SHA-256 as a classic blockchain without the cost and complexity associated with blockchains today.

11 Array stores that are not filesystem stores

Luc Perkins mentions

TileDB is a DB built around multi-dimensional arrays that enables you to easily work with types that aren’t a great fit for existing RDBMS systems, such as dense and sparse arrays and dataframes. TileDB is specifically geared toward use cases like genomics and geospatial data.

11.1 Noteworthy features

Swappable data backends with support for Amazon S3, MinIO, and others

Storage integrations for HDFS, PrestoDB, Apache Spark, Dask, and others

Language bindings for C/C++, Python, R, Java, and Go

12 Vectors

See vector databases.

13 Time series/Event crunching/Streaming

See databases for realtime use.

14 Interesting experiments

14.1 noisepage

Self-learning database: NoisePage

NoisePage is a relational database management system (DBMS) designed from the ground up for autonomous deployment. It uses integrated machine learning components to control its configuration, optimization, and tuning. The system will support automated physical database design (e.g., indexes, materialized views, sharding), knob configuration tuning, SQL tuning, and hardware capacity/scaling. Our research focuses on building the system components that support such self-driving operation with little to no human guidance.

I believe that it does not work yet.

14.2 Logica

logica (source)

Logica is for engineers, data scientists and other specialists who want to use logic programming syntax when writing queries and pipelines to run on BigQuery.

Logica compiles to StandardSQL and gives you access to the power of BigQuery engine with the convenience of logic programming syntax. This is useful because BigQuery is magnitudes more powerful than state of the art native logic programming engines.

We encourage you to try Logica, especially if

you already use logic programming and need more computational power, or

you use SQL, but feel unsatisfied about its readability, or

you want to learn logic programming and apply it to processing of Big Data.

In the future we plan to support more SQL dialects and engines.

15 Incoming

Clickhouse for example is a columnar database that avoids some of the problems of row-oriented tabular databases. I guess you could try that? And Amazon Athena turns arbitrary data into SQL-queryable data, apparently. So the skills here are general.