Data versioning
March 11, 2020 — September 7, 2022
Keeping track of not only changes in code but changes in data. One of the things we might need to do these days alongside configuring ML, tracking progress, optimizing hyperparameters etc.
There is a snake nest of tangled problems in data versioning — datasets can be big, datasets are encoded in weird ways, data might not be local files, datasets might be used by many people simultaneously, datasets are coupled tightly with data cleaning procedures, which are usually code, but also they are not updated as often as code is.
How do you get the modern affordances of source code management for datasets?
Here are some tools which variously solve some of the problems
1 DVC
“Data science Version control”.
DVC looks hip and solves some problems related to experiment tracking. Versions code with data assets stored in some external data store like S3 or whatever.
DVC runs on top of any Git repository and is compatible with any standard Git server or provider (Github, Gitlab, etc). Data file contents can be shared by network-accessible storage or any supported cloud solution. DVC offers all the advantages of a distributed version control system — lock-free, local branching, and versioning.
The single
dvc repro
command reproduces experiments end-to-end. DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.
It resembles git-lfs which is the classic git method of dealing with Really Big Files, and maybe also git-annex, which is a Big File Handler built on git. However, it puts these at the service of reproducible and easily-distributed experiments. There is a github overlay dagshub that specialises for DVC projects.
Perhaps in practice we should also think of DVC also as an experiment tracking tool?
2 Renku
Renku is some fairly general open source Reproducible Data Science platform.
2.1 Renku features: Empowering all stages of your work
Renku gives you tools and functionality for each stage of the data science lifecycle: from datasets to workflow execution.
2.1.1 Versioned Data
Renku Datasets equip your files with versioning and metadata. > #### Interactive Computing
Access free computing resources directly in the browser with familiar front-ends like Jupyter, Rstudio, and VSCode.
2.1.2 Automatic Provenance
Track inputs and outputs easily without having to learn a new workflow language.
2.1.3 Version Control by Default
Leverage Renku’s GitLab instance to automatically version your project’s files.
2.1.4 Containers as Standard
Access a maintained stack of Docker images and project templates which ensure computational reproducibility.
2.1.5 Reusable Workflows
Flexibly track your commands and reuse them as templates with different inputs or parameters.
2.2 Renku Use Cases: Built to be versatile
2.2.1 Collaborative Scientific Research
Ensure computational reproducibility between you and your colleagues throughout the entire scientific process.
2.2.2 Teach a Class or Workshop
Access project templates in Python, R, Julia (and more!) out of the box, or create your own template to share with students.
They can work together in the browser in or out of class.
2.2.3 Build, execute, and track workflows
Automate processes and follow them in real time. Rest easy, as re-executions are reproducible given the same computational environment.
3 datalad
Building on top of Git and git-annex, DataLad allows you to version control arbitrarily large files in datasets, without the need for custom data structures, central infrastructure, or third party services.
- Track changes to your data
- Revert to previous versions
- Capture full provenance records
- Ensure complete reproducibility
A DataLad dataset is a directory with files, managed by DataLad. You can link other datasets, known as subdatasets, and perform commands recursively across an arbitrarily deep hierarchy of datasets. This helps you to create structure while maintaining advanced provenance capture abilities, versioning, and actionable file retrieval.
DataLad lets you consume datasets provided by others, and collaborate with them. You can install existing datasets and update them from their sources, or create sibling datasets that you can publish updates to and pull updates from. The collaborative power of Git, for your data.
DataLad is integrated with a variety of hosting services and data management platforms, and extended and used by a diverse community. Export datasets to third party services such as GitHub or Figshare with built-in commands. Extend DataLad to be compatible with your preferred data supplier or workflow. Or use a multitude of other DataLad-compatible services such as Dropbox or Amazon S3. Search through all integrations, extensions, and use cases to find the right fit for your data!
I think we could read that as “a friendly python frontend to git-annex”.
4 git-annex
It is not targeted specifically at data science people but is much more broad.
git-annex supports explicit and customisable folder-tree synchronization, merging and sneakernets and as such I am well disposed toward it. You can choose to have things in various stores, and to copy files to and from servers or disks as they become available. It doesn’t support iOS. Windows support is experimental. Granularity is per-file. It has weird symlink-based file access protocol which might be inconvenient for many uses. (I’m imagining this is trouble for Microsoft Word or whatever.)
Also, do you want to invoke various disk-online-disk-offline-how-sync-when options from the command line, or do you want stuff to magically replicate itself across some machines without requiring you to remember the correct incantation on a regular basis?
The documentation is nerdy and unclear, but I think my needs are nerdy and unclear by modern standards. However, the combinatorial explosion of options and excessive hands-on-ness is a serious problem which I will not realistically get around to addressing due to my to-do list already being too long.
5 Go Get Data
GGD comes from the genomics community. It seems to be a system for fetching and processing data based on code recipes. The combination of raw data URL plus recipe plus some caching is what gives you the dataset you actually use.
Go Get Data (ggd) is a data management system that provides access to data packages containing auto curated genomic data.
ggd
data packages contain all necessary information for data extraction, handling, and processing. With a growing number of scientific datasets, ggd provides access to these datasets without the hassle of finding, downloading, and processing them yourself.ggd
leverages the conda package management system and the infrastructure of Bioconda to provide a fast and easy way to retrieve processed annotations and datasets, supporting data provenance, and providing a stable source of reproducibility. Using theggd
data management system allows any user to quickly access all desired datasets, manage that data within an environment, and provides a platform upon which to cite data access and use by way of the ggd data package name and version.
ggd consists of:
- a repository of data recipes hosted on Github
- a command line interface (cli) to communicate with the ggd ecosystem
- a continually growing list of genomic recipes to provide quick and easy access to processed genomic data using the ggd cli tool
This strikes me as an elegant solution, applicable far beyond genomics. It seems to be a lighter version of pachyderm.
6 Splitgraph
Splitgraph is a data management, building and sharing tool inspired by Docker and Git that works on top of PostgreSQL and integrates seamlessly with anything that uses PostgreSQL.
Splitgraph allows the user to manipulate data images (snapshots of SQL tables at a given point in time) as if they were code repositories by versioning, pushing and pulling them. It brings the best parts of Git and Docker, tools well-known and loved by developers, to data science and data engineering, and allows users to build and manipulate datasets directly on their database using familiar commands and paradigms.
It works on top of PostgreSQL and uses SQL for all versioning and internal operations. You can “check out” data into actual PostgreSQL tables, offering read/write performance and feature parity with PostgreSQL and allowing you to query it with any SQL client. The client application has no idea that it’s talking to a Splitgraph table and you don’t need to rewrite any of your tools to use Splitgraph. Anything that works with PostgreSQL will work with Splitgraph.
Splitgraph also defines the declarative Splitfile language with Dockerfile-like caching semantics that allows you to build Splitgraph data images in a composable, maintainable and reproducible way. When you build data with Splitfiles, you get provenance tracking. You can inspect an image’s metadata to find the exact upstream images, tables and columns that went into it. With one command, Splitgraph can use this provenance data to rebuild an image against a newer version of its upstream dependencies. You can easily integrate Splitgraph into your existing CI pipelines, to keep your data up-to-date and stay on top of changes to its inputs.
You do not need to download the full Splitgraph image to query it. Instead, you can query Splitgraph images with layered querying, which will download only the regions of the table relevant to your query, using bloom filters and other metadata. This is useful when you’re exploring large datasets from your laptop, or when you’re only interested in a subset of data from an image. This is still completely transparent to the client application, which sees a PostgreSQL schema that it can talk to using the Postgres wire protocol.
Splitgraph does not limit your data sources to Postgres databases. It includes first-class support for importing and querying data from other databases using Postgres foreign data wrappers. You can create Splitgraph images or query data in MongoDB, MySQL, CSV files or other Postgres databases using the same interface.
7 pangeo-force
Maybe potentially interesting: pangeo-forge/roadmap: Pangeo Forge public roadmap
Pangeo Forge is inspired to copy the very successful pattern of Conda Forge. Conda Forge makes it easy for anyone to create a conda package, a binary software package that can be installed with the conda package manager. In Conda Forge, a maintainer contributes a recipe which is used to generate a conda package from a source code tarball. Behind the scenes, CI downloads the source code, builds the package, and uploads it to a repository. By automating the difficult parts of package creation, Conda Forge has enabled the open-source community to collaboratively maintain a huge and dynamic library of software packages.
8 Sno
Sno stores geospatial and tabular data in Git, providing version control at the row and cell level.
- Built on Git, works like Git
- Uses standard Git repositories and Git-like CLI commands. If you know Git, you’ll feel right at home with Sno.
- Supports current GIS workflows
- Provides repository working copies as GIS databases and files. Edit directly in common GIS software without plugins.
This is a neat approach if you have a large enough git repository I suppose.
9 Dolt
Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a git repository. Connect to Dolt just like any MySQL database to run queries or update the data using SQL commands. Use the command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate’s changes.
All the commands you know for Git work exactly the same for Dolt. Git versions files, Dolt versions tables. It’s like Git and MySQL had a baby.
We also built DoltHub, a place to share Dolt databases. We host public data for free. If you want to host your own version of DoltHub, we have DoltLab. If you want us to run a Dolt server for you, we have Hosted Dolt.
It has a twin project, dolthub, which is the github of dolts, i.e. data sharing infrastructure.
10 Intake
Intake TBC
11 Pachyderm
is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modelling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you’re looking for a way to “productionize” them, Pachyderm can make this easy for you.
- Containerized: Pachyderm is built on Docker and Kubernetes. Whatever languages or libraries your pipeline needs, they can run on Pachyderm, which can easily be deployed on any cloud provider or on-prem.
- Version Control: Pachyderm version controls your data as it’s processed. You can always ask the system how data has changed, see a diff, and, if something doesn’t look right, revert.
- Provenance (aka data lineage): Pachyderm tracks where data comes from. Pachyderm keeps track of all the code and data that created a result.
- Parallelization: Pachyderm can efficiently schedule massively parallel workloads.
- Incremental Processing: Pachyderm understands how your data has changed and is smart enough to only process the new data.
This is the wrong scale for me, but interesting to see how enterprise might be doing big versions of my little experiments.