Data versioning


Keeping track of not only changes in code but changes in data, (which these days might be what generates the code anyway.).

There is a snakes nest of tangled problems here - data sets can be big, datasets are encoded in weird ways, data might not be local files, data sets might be used by many people simultaneously, data sets are coupled tightly with data cleaning procedures, which are usually code, but also they are not updated as often as code is.

How do you get the modern affordances of source code management for data sets?

Here are some tools which variously solve some of the problems

Go Get Data

GGD comes from the genomics community. It seems to be a system for fetching and processing data based on code recipes. The combination of raw data URL plus recipe plus some caching is what gives you the data set you actually use.

Go Get Data (ggd) is a data management system that provides access to data packages containing auto curated genomic data. ggd data packages contain all necessary information for data extraction, handling, and processing. With a growing number of scientific datasets, ggd provides access to these datasets without the hassle of finding, downloading, and processing them yourself. ggd leverages the conda package management system and the infrastructure of Bioconda to provide a fast and easy way to retrieve processed annotations and datasets, supporting data provenance, and providing a stable source of reproducibility. Using the ggd data management system allows any user to quickly access all desired datasets, manage that data within an environment, and provides a platform upon which to cite data access and use by way of the ggd data package name and version.

ggd consists of:

This strikes me as an elegant solution, applicable far beyond genomics. It seems to be a lighter version of pachyhderm.

DVC

“Data science Version control”.

DVC looks hip and solves some problems related to build tools, although it is not one as such. Versions code with data assets stored in some external data store like S3 or whatever.

DVC runs on top of any Git repository and is compatible with any standard Git server or provider (Github, Gitlab, etc). Data file contents can be shared by network-accessible storage or any supported cloud solution. DVC offers all the advantages of a distributed version control system — lock-free, local branching, and versioning.

The single dvc repro command reproduces experiments end-to-end. DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.

It resembles git-lfs which is the classic git method of dealing with with Really Big Files, and maybe also git-annex, which is a Big File Handler built on git. However it puts these at the service of reproducible and easily-distributed experiments.

It has a github overlay dagshub that specialises for DVC projects.

Splitgraph

splitgraph:

Splitgraph is a data management, building and sharing tool inspired by Docker and Git that works on top of PostgreSQL and integrates seamlessly with anything that uses PostgreSQL.

Splitgraph allows the user to manipulate data images (snapshots of SQL tables at a given point in time) as if they were code repositories by versioning, pushing and pulling them. It brings the best parts of Git and Docker, tools well-known and loved by developers, to data science and data engineering, and allows users to build and manipulate datasets directly on their database using familiar commands and paradigms.

It works on top of PostgreSQL and uses SQL for all versioning and internal operations. You can“check out” data into actual PostgreSQL tables, offering read/write performance and feature parity with PostgreSQL and allowing you to query it with any SQL client. The client application has no idea that it’s talking to a Splitgraph table and you don’t need to rewrite any of your tools to use Splitgraph. Anything that works with PostgreSQL will work with Splitgraph.

Splitgraph also defines the declarative Splitfile language with Dockerfile-like caching semantics that allows you to build Splitgraph data images in a composable, maintainable and reproducible way. When you build data with Splitfiles, you get provenance tracking. You can inspect an image’s metadata to find the exact upstream images, tables and columns that went into it. With one command, Splitgraph can use this provenance data to rebuild an image against a newer version of its upstream dependencies. You can easily integrate Splitgraph into your existing CI pipelines, to keep your data up-to-date and stay on top of changes to its inputs.

You do not need to download the full Splitgraph image to query it. Instead, you can query Splitgraph images with layered querying, which will download only the regions of the table relevant to your query, using bloom filters and other metadata. This is useful when you’re exploring large datasets from your laptop, or when you’re only interested in a subset of data from an image. This is still completely transparent to the client application, which sees a PostgreSQL schema that it can talk to using the Postgres wire protocol.

Splitgraph does not limit your data sources to Postgres databases. It includes first-class support for importing and querying data from other databases using Postgres foreign data wrappers. You can create Splitgraph images or query data in MongoDB, MySQL, CSV files or other Postgres databases using the same interface.

Sno

Sno

Sno stores geospatial and tabular data in Git, providing version control at the row and cell level.

  • Built on Git, works like Git
  • Uses standard Git repositories and Git-like CLI commands. If you know Git, you’ll feel right at home with Sno.
  • Supports current GIS workflows
  • Provides repository working copies as GIS databases and files. Edit directly in common GIS software without plugins.

This is a neat approach if you have a large enough git repository I suppose.

dolt

dolt

Dolt is a relational database, i.e. it has tables, and you can execute SQL queries against those tables. It also has version control primitives that operate at the level of table cell. Thus Dolt is a database that supports fine grained value-wise version control, where all changes to data and schema are stored in commit log.

It is inspired by RDBMS and Git, and attempts to blend concepts about both in a manner that allows users to better manage, distribute, and collaborate on, data.

It has a twin project, dolthub, which is the github of dolts, i.e. data sharing infrastructure.

Pachyderm

Pachyderm

is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you’re looking for a way to “productionize” them, Pachyderm can make this easy for you.

  • Containerized: Pachyderm is built on Docker and Kubernetes. Whatever languages or libraries your pipeline needs, they can run on Pachyderm which can easily be deployed on any cloud provider or on prem.
  • Version Control: Pachyderm version controls your data as it’s processed. You can always ask the system how data has changed, see a diff, and, if something doesn’t look right, revert.
  • Provenance (aka data lineage): Pachyderm tracks where data comes from. Pachyderm keeps track of all the code and data that created a result.
  • Parallelization: Pachyderm can efficiently schedule massively parallel workloads.
  • Incremental Processing: Pachyderm understands how your data has changed and is smart enough to only process the new data.

This is the wrong scale for me, but interesting to see how enterprise might be doing big versions of my little experiments.

Dat

dat, tracks updates to your datasets and (unique selling point) shares the updates in a quasi-peer-to-peer fashion. It is similar to syncthing, with a different emphasis - sharing discoverable data to strangers rather than to friends. You could also use it for backups so somesuch I suppose.

Dat is the package manager for data. Share files with version control, back up data to servers, browse remote files on demand, and automate long-term data preservation. Secure, distributed, fast.

However, you cannot, e.g. use it for distributed collation of streams from many different servers in the cloud because it’s one-writer-many-readers.

$ npm install -g dat hypercored
…
$ mkdir MyData
$ cd MyData
$ dat create
> Title My Amazing Data
> Title My Awesome Dat
> Description This is a dat

  Created empty Dat in /Users/me/MyData/.dat

$ dat share

Some hacks exist for partial downloading. If you wanted that you could use Dat’s base layer, hyperdrive, directly from node.js. (However, no-one uses node.js for science at the moment, so if you find yourself working on this bit of plumbing, ask yourself if you are yak shaving, and whether your procrastinating might be better spent going outside for some fresh air.)

Despite the upbeat publicity around this project, I can't find a use for it. I think I must not be the target audience, but then… When does my lab have the single source of truth for a constantly-updated data set that they want to pump out to others constantly? Are there enough of these to get critical mass of community?

git-annex

Choose this if… you are a giant nerd with harrowing restrictions on your data transfer and its worth your while to leverage this very sophisticated and yet confusing bit of software to work around these challenges. E.g. you are integrating sneakernets and various online options. Which I am not. It is not targetted specifically at data science people but is much more broad. Might still work though.

git-annex supports explicit and customisable folder-tree synchronisation, merging and sneakernets and as such I am well disposed toward it. You can choose to have things in various stores, and to copy files to an from servers or disks as they become available. It doesn’t support iOS. Windows support is experimental. Granularity is per-file. It has weird symlink-based file access protocol which might be inconvenient for many uses. (I’m imagining this is trouble for Microsoft Word or whatever.)

Also, do you want to invoke various disk-online-disk-offline-how-sync-when options from the command line, or do you want stuff to magically replicate itself across some machines without requiring you to remember the correct incantation on a regular basis?

The documentation is nerdy and unclear, but I think my needs are nerdy and unclear by modern standards. However, the combinatorial explosion of options and excessive hands-on-ness is a serious problem which I will not realistically get around to addressing due to my to-do list already being too long.