Tips and tricks for collaborative data sharing, e.g. for reproducible research.
Related: the problems of organising the data efficiently for the task in hand. For that, see database, and the task of versioning the data. Also related: the problem of finding some good data for your research project, ideally without having to do the research yourself. For some classic datasets that use these data sharing methods (and others) see data sets.
Web data repositories
You’ve finished writing a paper? Congratulations.
Online services to host supporting data from your finished projects in the name of reproducible research. The data gets uploaded once and thereafter is static.
There is not much to this, except that you might want verification of your data — e.g. someone who will vouch that you have not tampered with it after publication. You might also want a persistent identifier such as a DOI so that other researchers can refer to your work in an academe-endorsed fashion.
- Figshare, which hosts the supporting data for many researchers. It gives you a DOI for your dataset. Up to 5GB. Free.
- Zenodo is similar. Backed by CERN, on their infrastructure. Uploads get a DOI. Up to 50GB. Free.
- IEEE Dataport happily hosts 2TB datasets. It gives you a DOI and integrates with many IEEE publications, plus allows convenient access from the Amazon cloud via AWS, which might be where your data is anyway. They charge USD2000 for an open access upload, and otherwise only other IEEE dataport users can get at your data. I know this is not an unusual way for access to journal articles to work, but for data sets it feels like a ham-fisted way of enforcing scarcity, and it is hard to see how they will compete with Zenodo except for the Very Large Dataset users.
- Datadryad gives 10Gb of data per DOI, and validates it. USD120/dataset. Free for members which, on balance of probability, is not your institution but why not check?
- Some campuses offer their own systems, e.g. my university offers resdata.
- DIY option. You could probably upload your data, if not too large, to github and for veracity get a trusted party to cryptographically sign it. Or indeed you could upload it to anywhere and get someone to cryptographically sign it. The problem with such DIY solutions is that they are unstable - very few data sets last more than a few years with this kind of set up. Campus web servers shut down, hosting fees go up etc. On the plus side you can make a nice presentational web page explaining everything and providing nice formatting for the text and tables and such.
- Open Science Framework doesn’t host data, but it does index data sets in google drive or whatever and make them coherently available to other researchers.
The question that you are asking about all of these if you are me is: can I make a nice HTMl front-end to my media examples? The answer is, AFAICT, no. Only in the DIY option.
Recommendation: If your data is small, make a DIY site for users and also make a zenodo site to host it.
If you are sharing data for ongoing collaboration (you are still accumulating data) you might want a different tool, with less focus on DOIs/verification and more on convenient updating and reproducibility of intermediate results.
Realistically, seeing how often data sets are found to be flawed, or how often they can be improved, I’m not especially interested in verifiable one-off spectacular data releases. I’m interested in accessing collaborative, incremental, and improvable data. That is, after all, how research itself progresses.
The next options are solutions to simplify that kind of thing.
a.k.a. Data science Version control
dat, tracks updates to your datasets and shares the updates in a distributed fashion. I would use this for sharing predictably updated research, because I wished to have the flexibility of updating my data, at the cost of keeping the software running. But if I publish to zenodo, all my typos and errors are immortalised for all time so I might be too afraid to ever get around to publishing it. Read more at Dat/data versioning.
Not sure yet. TBC.
Qu publishes any old data from a mongodb store. Mongodb needs more effort to set up than I am usually prepared to tolerate, and isn’t great for dense binary blobs, which is my stock in trade, so I won’t explore that further.
Google’s open data set protocol, which they call their “Dataset Publishing Language”, is a standard for medium-size datasets with EZ visualisations
- Open Science Framework seems to strive to be github-for-preserving-data-assets. TODO.
- rOpensci provides a number of open data set importers that work seamlessly. They are a noble target for your own data publishing efforts.
- Dan Hopkins and Brendan Nyhan on How to make scientific research more trustworthy.