Data versioning

Keeping track of not only changes in code but changes in data.

DVC

“Data science Version control”.

DVC looks hip and solves some problems related to build tools, although it is not one as such. Versions code with data assets in some external data store like S3 or whatever.

DVC runs on top of any Git repository and is compatible with any standard Git server or provider (Github, Gitlab, etc). Data file contents can be shared by network-accessible storage or any supported cloud solution. DVC offers all the advantages of a distributed version control system — lock-free, local branching, and versioning.

The single dvc repro command reproduces experiments end-to-end. DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.

It resembles git-lfs which is the classic git method of dealing with with Really Big Files, and maybe also git-annex, which is a Big File Handler built on git. However it puts these at the service of reproducible and easily-distributed experiments.

It has a github overlay dagshub that specialises for DVC projects.

Dat

dat, tracks updates to your datasets and (unique selling point) shares the updates in a quasi-peer-to-peer fashion. It is similar to syncthing, with a different emphasis - sharing discoverable data to strangers rather than to friends. You could also use it for backups so somesuch I suppose.

Dat is the package manager for data. Share files with version control, back up data to servers, browse remote files on demand, and automate long-term data preservation. Secure, distributed, fast.

However, you cannot, e.g. use it for distributed collation of streams from many different servers in the cloud because it’s one-writer-many-readers.

$ npm install -g dat hypercored
…
$ mkdir MyData
$ cd MyData
$ dat create
> Title My Amazing Data
> Title My Awesome Dat
> Description This is a dat

  Created empty Dat in /Users/me/MyData/.dat

$ dat share

Some hacks exist for partial downloading. If you wanted that you could use Dat’s base layer, hyperdrive, directly from node.js. (However, no-one uses node.js for science at the moment, so if you find yourself working on this bit of plumbing, ask yourself if you are yak shaving, and whether your procrastinating might be better spent going outside for some fresh air.)

Sno

Sno

Sno stores geospatial and tabular data in Git, providing version control at the row and cell level.

  • Built on Git, works like Git
  • Uses standard Git repositories and Git-like CLI commands. If you know Git, you’ll feel right at home with Sno.
  • Supports current GIS workflows
  • Provides repository working copies as GIS databases and files. Edit directly in common GIS software without plugins.

dolt

dolt

Dolt is a relational database, i.e. it has tables, and you can execute SQL queries against those tables. It also has version control primitives that operate at the level of table cell. Thus Dolt is a database that supports fine grained value-wise version control, where all changes to data and schema are stored in commit log.

It is inspired by RDBMS and Git, and attempts to blend concepts about both in a manner that allows users to better manage, distribute, and collaborate on, data.

git-annex

Choose this if… you are a giant nerd with harrowing restrictions on your data transfer and its worth your while to leverage this very sophisticated and yet confusing bit of software to work around these challenges. E.g. you are integrating sneakernets and various online options. Which I am not. It is not targetted specifically at data science people but is much more broad. Might still work though.

git-annex supports explicit and customisable folder-tree synchronisation, merging and sneakernets and as such I am well disposed toward it. You can choose to have things in various stores, and to copy files to an from servers or disks as they become available. It doesn’t support iOS. Windows support is experimental. Granularity is per-file. It has weird symlink-based file access protocol which might be inconvenient for many uses. (I’m imagining this is trouble for Microsoft Word or whatever.)

Also, do you want to invoke various disk-online-disk-offline-how-sync-when options from the command line, or do you want stuff to magically replicate itself across some machines without requiring you to remember the correct incantation on a regular basis?

The documentation is nerdy and unclear, but I think my needs are nerdy and unclear by modern standards. However, the combinatorial explosion of options and excessive hands-on-ness is a serious problem which I will not realistically get around to addressing due to my to-do list already being too long.

Pachyderm

Pachyderm

is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you're looking for a way to "productionize" them, Pachyderm can make this easy for you.

  • Containerized: Pachyderm is built on Docker and Kubernetes. Whatever languages or libraries your pipeline needs, they can run on Pachyderm which can easily be deployed on any cloud provider or on prem.
  • Version Control: Pachyderm version controls your data as it’s processed. You can always ask the system how data has changed, see a diff, and, if something doesn’t look right, revert.
  • Provenance (aka data lineage): Pachyderm tracks where data comes from. Pachyderm keeps track of all the code and data that created a result.
  • Parallelization: Pachyderm can efficiently schedule massively parallel workloads.
  • Incremental Processing: Pachyderm understands how your data has changed and is smart enough to only process the new data.

This is the wrong scale for me, but interesting to see how enterprise might be doing big versions of my little experiments.