Research data sharing

for research and science


Tips and tricks for collaborative data sharing, e.g. for reproducible research.

Related: the problems of organising the data efficiently for the task in hand. For that, see database, and the task of versioning the data. Also the problem of getting some good data for your research project.For that, you probably want a data set repository.

Web data repositories

You’ve finished writing a paper? Congratulations.

Online services to host supporting data from your finished projects in the name of reproducible research. The data gets uploaded once and thereafter is static.

There is not much to this, except that you might want verification of your data – e.g. someone who will vouch that you have not tampered with it after publication. You might also want a persistent identifier such as a DOI so that other researchers can refer to your work in an academe-endorsed fashion.

  • Figshare, which hosts the supporting data for many researchers. It gives you a DOI for your dataset. Up to 5GB. Free.
  • Datadryad gives 10Gb of data per DOI, and validates it. USD120/dataset. Frree for members.
  • Zenodo is similar. Backed by CERN, on their infrastructure. Uploads get a DOI. Up to 50GB. Free.
  • IEEE Dataport is free for IEEE members and happily hosts 2TB datasets. It gives you a DOI and integrates with many IEEE publications, plus allows convenient access from the Amazon cloud via AWS, which might be where your data is anyway. for the beta period (all of 2019) it is free. Thereafter, they charge USD2000 for an open access upload, and otherwise only other IEEE dataport users can get at your data. I know this is not an unusual way for access to journal articles to work, but for data sets it feels like a ham-fisted way of enforcing scarcity for data.
  • Some campuses offer their own systems, e.g. my university offers resdata.
  • DIY option. You could probably upload your data, if not too large, to github and for veracity get a trusted party to cryptographically sign it. Or indeed you could upload it to anywhere and get someone to cryptographically sign it. The problem with such DIY solutions is that they are unstable - very few data sets last more than a few years with this kind of set up. Campus web servers shut down, hosting fees go up etc. On the plus side you can make a nice presentational web page explaining everything and providing nice formatting for the text and tables and such.
  • Open Science Framework doesn’t host data, but it does index data sets in google drive or whatever and make them coherently available to other researchers.

The question that you are asking about all of these if you are me is: can I make a nice HTMl front-end to my media examples? The answer is, AFAICT, no. Only in the DIY option.

Recommendation: If your data is small, make a DIY site for users nd also make a zenodo site to host it.

If you are sharing data for ongoing collaboration (you are still accumulating data) you might want a different tool, with less focus on DOIs/verification and more on convenient updating and reproducibility of intermediate results.

Realistically, seeing how often data sets are found to be flawed, or how often they can be improved, I’m not especially interested in verifiable one-off spectacular data releases. I’m interested in accessing collaborative, incremental, and improvable data. That is, after all, how research itself progresses.

The next options are solutions to simplify that kind of thing.

DVC

a.k.a. Data science Version control

DVC looks promising. Versions code with data assets in some external data store like S3 or whatever, which means they are shareable if you set the permissions right. Read more at DVC/data versioning.

Dat

dat, tracks updates to your datasets and shares the updates in a distributed fashion. I would use this for sharing predictably updated research, because I wished to have the flexibility of updating my data, at the cost of keeping the software running. But if I publish to zenodo, all my typos and errors are immortalised for all time so I might be too afraid to ever get around to publishing it. Read more at Dat/data versioning.

Orbitdb

Not sure yet. TBC.

Qu

Qu publishes any old data from a mongodb store. Mongodb needs more effort to set up than I am usually prepared to tolerate, and isn’t great for dense binary blobs, which is my stock in trade, so I won’t explore that further.

Misc

Google’s open data set protocol, which they call their “Dataset Publishing Language”, is a standard for medium-size datasets with EZ visualisations