Reproducible research

Open notebooks etc

June 7, 2016 — October 1, 2024

academe
collective knowledge
how do science
information provenance
workflow
Figure 1

Philosophies of how to share your methods for the purpose of Doing Science. More technical details are under build/pipelines tools, experiment tracking and scientific workbooks and academic software engineering. There should probably be some continuous integration system mentioned here, but maybe later; we are taking baby steps. Also, as will be obvious, my knowledge is about machine learning research, not actual laboratory science, but many of the same tools and approaches apply.

The painful process of journals and article validation is under academic publishing.

1 Actually existing reproducible research tools

Figure 2

Basic steps toward reproducible research. Also useful, scientific workbooks, and build/pipelines tools. The fraught process of getting stuff in journals is under academic publishing. Hypergraph is a vaunted new experiment-and-analysis tracking system which promises some collaborative tools. I have not yet tried it.

2 Peer review

Replication markets claims to provide markets in predicting experiment replication, which supposedly provides incentives to replicate research better.

The Knowledge Repository is one open workflow sharing platform. Motivation:

[…] our process combines the code review of engineering with the peer review of academia, wrapped in tools to make it all go at startup speed. As in code reviews, we check for code correctness and best practices and tools. As in peer reviews, we check for methodological improvements, connections with preexisting work, and precision in expository claims. We typically don’t aim for a research post to cover every corner of investigation, but instead prefer quick iterations that are correct and transparent about their limitations.

3 Open notebooks

Figure 3: Tom Gauld, Hazard labels used in our laboratory

What do you get when you take your scientific workbooks and publish them online along with the data? An open notebook! These are huge in the machine learning pedagogy world right now, and small-to-medium in the applied-machine-learning, especially the recruitment end of that. They are noticeable but rare AFAICS in the rest of the world.

If you want in-depth justifications for open notebooks, see Caleb McDaniel or Ben Marwick’s slide deck.

I’m interested in this because it seems like the actual original pitch for how scientific research was supposed to work, with rapid improvement upon each others’ ideas. Whether I get around to fostering such stuff despite the fact that it is not valued by my employer, that is the question.

3.1 Examples of online notebooks

4 Open process

Open access publishing platform ResearchEquals launched ResearchEquals.com

Publish each step

You produce vital outputs at every research step. Why let them go unpublished?

Publish your text, data, code, or anything else you struggle to publish in articles.

Each step gets a DOI. Link them all together to document a journey.

5 Executable papers

See executable papers.

6 Replication markets

Nifty idea. No time to explain right now but check out the worked example How I Made $10k Predicting Which Studies Will Replicate.

7 For ML in particular

See reproducible ML.

8 Containerized workflow

Docker is designed for reproducible deployment, which makes it an approximate fit for reproducible research. See docker for reproducible research.

9 Build tools

A reproducible experiment is closely coupled to build tools, which recreate all the, possibly complicated and lengthy, steps. Some of the build tools I document have reproducibility as a primary focus, notably DVC, drake, lancet, and pachyderm.

10 Sundry data sharing ideas

See Data sharing.

11 Collaboration

codeocean seems to be targeting this use case.

For the first time, researchers, engineers, developers and scientists can upload code and data in any open source programming language and link working code in a computational environment with the associated article for free. We assign a Digital Object Identifier (DOI) to the algorithm, providing correct attribution and a connection to the published research.

The platform provides open access to the published software code and data to view and download for everyone for free. But the real treat is that users can execute all published code without installing anything on their personal computer. Everything runs in the cloud on CPUs or GPUs according to the user needs. We make it easy to change parameters, modify the code, upload data, run it again, and see how the results change.

They also ran a workshop on this.

Possibly Sylabs cloud is a similar project?

Hybrid environment Nextjournal might also be this; It is a collaborative coding machine that claims to make this easy for you and your colleagues to write in a workbook style together, and uses containerised environments under the hood.

Less code-obsessed but possibly related, Open Science Framework seems to

OSF is a free and open source project management tool that supports researchers throughout their entire project lifecycle.

As a collaboration tool, OSF helps research teams work on projects privately or make the entire project publicly accessible for broad dissemination. As a workflow system, OSF enables connections to the many products researchers already use, streamlining their process and increasing efficiency.

12 Communities and organisations

rOpensci

…fosters a culture that values open and reproducible research using shared data and reusable software.

We do this by:

  • Creating technical infrastructure in the form of carefully vetted, staff- and community-contributed R software tools that lower barriers to working with scientific data sources on the web
  • Creating social infrastructure through a welcoming and diverse community
  • Making the right data, tools and best practices more discoverable
  • Building capacity of software users and developers and fostering a sense of pride in their work
  • Promoting advocacy for a culture of data sharing and reusable software.

rOpenSci is a non-profit initiative founded in 2011 by Karthik Ram, Scott Chamberlain, and Carl Boettiger to make scientific data retrieval reproducible. Over the past seven years we have developed an ecosystem of open source tools, we run annual unconferences, and review community developed software.

numfocus is a foundation supporting in particular open code for better science via various interesting projects.

Dr Ulrich Schimmack’s Blog about Replicability is a readable explanation of which results are reproducible from a guy who does lots of meta analyses.

Figure 4

13 Incoming

14 References

Baker. 2021. Five Keys to Writing a Reproducible Lab Protocol.” Nature.
Boettiger. 2015. An Introduction to Docker for Reproducible Research, with Examples from the R Environment.” ACM SIGOPS Operating Systems Review.
Chattopadhyay, Prasad, Henley, et al. 2020. “What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities.”
Community, Arnold, Bowler, et al. 2019. The Turing Way: A Handbook for Reproducible Data Science.”
Cormier, Belyeu, Pedersen, et al. 2020. Go Get Data (GGD): Simple, Reproducible Access to Scientific Data.” bioRxiv.
Donoho. 2023. Data Science at the Singularity.”
Fitschen, Schlemmer, Hornung, et al. 2019. CaosDB - Research Data Management for Complex, Changing, and Automated Research Workflows.” Data.
Himmelstein, Rubinetti, Slochower, et al. 2019. Open Collaborative Writing with Manubot.” Edited by Dina Schneidman-Duhovny. PLOS Computational Biology.
Kurtzer, Cclerget, Bauer, et al. 2021. Hpcng/Singularity: Singularity 3.7.3.”
Mölder, Jablonski, Letcher, et al. 2021. Sustainable Data Analysis with Snakemake.” F1000Research.
Simonsohn. 2015. Small Telescopes: Detectability and the Evaluation of Replication Results.” Psychological Science.
Tong. 2019. Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science.” The American Statistician.