Research data sharing

May 31, 2017 — January 16, 2023

academe
data sets
how do science
information provenance

Tips and tricks for collaborative data sharing, e.g. for reproducible research.

Related: the problems of organising the data efficiently for the task in hand. For that, see database, and the task of versioning the data. Also related: the problem of finding some good data for your research project, ideally without having to do the research yourself. For some classic datasets that use these data sharing methods (and others) see data sets.

Figure 1

1 Hosted data repositories

You’ve finished writing a paper? Congratulations.

Online services to host supporting data from your finished projects in the name of reproducible research. The data gets uploaded once and thereafter is static.

There is not much to this, except that you might want verification of your data — e.g. someone who will vouch that you have not tampered with it after publication. You might also want a persistent identifier such as a DOI so that other researchers can refer to your work in an academe-endorsed fashion.

  • Figshare, which hosts the supporting data for many researchers. It gives you a DOI for your dataset. Up to 5GB. Free.
  • Zenodo is similar. Backed by CERN, on their infrastructure. Uploads get a DOI. Up to 50GB. Free.
  • IEEE Dataport happily hosts 2TB datasets. It gives you a DOI and integrates with many IEEE publications, plus allows convenient access from the Amazon cloud via AWS, which might be where your data is anyway. They charge USD2000 for an open access upload, and otherwise only other IEEE dataport users can get at your data. I know this is not an unusual way for access to journal articles to work, but for data sets it feels like a ham-fisted way of enforcing scarcity, and it is hard to see how they will compete with Zenodo except for the Very Large Dataset users.
  • Datadryad gives 10Gb of data per DOI, and validates it. USD120/dataset. Free for members which, on balance of probability, is not your institution but why not check?
  • Some campuses offer their own systems, e.g. my university offers resdata.
  • DIY option. You could probably upload your data, if not too large, to github and for veracity get a trusted party to cryptographically sign it. Or indeed you could upload it to anywhere and get someone to cryptographically sign it. The problem with such DIY solutions is that they are unstable - few data sets last more than a few years with this kind of set up. Campus web servers shut down, hosting fees go up etc. On the plus side you can make a nice presentational web page explaining everything and providing nice formatting for the text and tables and such.
  • Open Science Framework doesn’t host data, but it does index data sets in google drive or whatever and make them coherently available to other researchers.

The question that you are asking about all of these if you are me is: can I make a nice web front-end to my media examples? Can I play my cool movies or audio example? The answer is, AFAICT, not in general; one would need to build an extra front-end and even then it might have difficulty streaming video or whatever from the fancy data store. Media streaming is a DIY option.

Recommendation: If your data is small, make a DIY site to show it off for users and also make site on e.g. Zenodo to host if for future users.

If you are sharing data for ongoing collaboration (the experiments are are still accumulating data) you might want a different tool, with less focus on DOIs/verification and more on convenient updating and reproducibility of intermediate results.

Realistically, seeing how often data sets are found to be flawed, or how often they can be improved, I’m not especially interested in verifiable one-off spectacular data releases. I’m interested in accessing collaborative, incremental, and improvable data. That is, after all, how research itself progresses.

The next options are solutions to simplify that kind of thing.

2 Dataverse

Dataverse is an open-source data storage/archive system, hosted by some large partners.

TBC; I’m using this right now but have little time to say things.

For uploading serious data the DVUploader app is best. Downloads are here but it takes a little work to find the manual. It supports useful stuff like “direct upload” mode (sending file direct to the backend store instead of via the dataverse frontend) which is an order of magnitude faster and more reliable than the indirect alternative.

The python API pydataverse is not great at time of writing; too much uploading and downloading of data, file sizes capped at 2gb with default python distribution… Most things are just about as easy if we use curl commands from the command line, and some things are impossible with Python.

3 Globus

Not sure.

4 dolthub

dolthub is the collaborative/sharey arm of dolt, the versioning database for relational data. I presume that means it is implicitly a data sharing system

5 Xethub

We’re excited to announce the public beta of XetHub, a collaborative storage platform for data management. XetHub aims to address each of the above requirements head-on towards our end goal: to make working with data as fast and collaborative as working with code.[…]

With XetHub, users can run the same flows and commands they already use for code (e.g., commits, pull requests, history, and audits) with repositories of up to 1 TB. Our Git-backed protocol allows easy integration with existing workflows, with no need to reformat files or adopt heavyweight data ecosystems, and also allows for incremental difference tracking on compatible data types.

6 DVC

a.k.a. Data science Version control

DVC looks like it gets us data sharing as a side effect. Versions code with data assets in some external data store like S3 or whatever, which means they are shareable if you set the permissions right. Read more at DVC/data versioning.

7 Dat

dat, tracks updates to your datasets and shares the updates in a distributed fashion. I would use this for sharing predictably updated research, because I wished to have the flexibility of updating my data, at the cost of keeping the software running. But if I publish to zenodo, all my typos and errors are immortalised for all time so I might be too afraid to ever get around to publishing it. Read more at Dat/data versioning.

8 Orbitdb

Not sure yet. TBC.

9 Qu

Qu publishes any old data from a mongodb store. Mongodb needs more effort to set up than I am usually prepared to tolerate, and isn’t great for dense binary blobs, which is my stock in trade, so I won’t explore that further.

10 Incoming

Google’s open data set protocol, which they call their “Dataset Publishing Language”, is a standard for medium-size datasets with EZ visualisations.

  • Open Science Framework seems to strive to be github-for-preserving-data-assets. TODO.

  • rOpensci provides a number of open data set importers that work seamlessly. They are a noble target for your own data publishing efforts.

  • Dan Hopkins and Brendan Nyhan on How to make scientific research more trustworthy.

  • CKAN “is a powerful data management system that makes data accessible — by providing tools to streamline publishing, sharing, finding and using data. CKAN is aimed at data publishers (national and regional governments, companies and organizations) wanting to make their data open and available.”

    • Seems to have a data table library called recline.