Software engineering for scientists

June 7, 2016 — September 16, 2021

collective knowledge
computers are awful
how do science
information provenance
Figure 1

1 Development practice

Placeholder: most of us are doing at best harm minimisation of harmful software practice

Philosophy from Gaël Varoquaux, Please destroy this software after publication.

Collaborative textbook The Turing Way:

The Turing Way is an open source community-driven guide to reproducible, ethical, inclusive and collaborative data science.

Our goal is to provide all the information that data scientists in academia, industry, government and in the third sector need at the start of their projects to ensure that they are easy to reproduce and reuse at the end.

This is coupled with build tools and probably scientific workbooks and experiment tracking.

2 Executable papers

A very direct kind of reproducible research: papers which [

  • groundai

    tl;dr This renders papers in a friendly format for public annotation and links to related ones and supporting data etc easily.

    Aims to affect both discovery and publishing, by providing community peer review.

    The potential of Community Peer Review Commenting will provide a way for researchers to ask for feedback about their work, then incorporating this feedback into revisions and to generate new ideas.

    Making this feedback openly accessible to everyone can help increase the public’s understanding and trust of scientific work and increase transparency.

    Having community support and a dialogue atmosphere inspire ideas to flow and be explored freely through insightful questions. In dialogue, people think together.

    […] Preprints discussions usually happen on twitter and facebook, but these comments are not housed with the preprint. We believe having the opportunity to provide feedback that is stored directly with the preprint will increase transparency and collaboration at all stages of the scientific process. We hope to see the dialogue becomes a part of the scholarly record.

  • gitxiv (source)

    In recent years, a highly interesting pattern has emerged: Computer scientists release new research findings on arXiv and just days later, developers release an open-source implementation on GitHub. This pattern is immensely powerful. […]

    GitXiv is a space to share links to open computer science projects. Countless Github and arXiv links are floating around the web. It’s hard to keep track of these gems. GitXiv attempts to solve this problem by offering a collaboratively curated feed of projects. Each project is conveniently presented as arXiv + Github + Links + Discussion. Members can submit their findings and let the community rank and discuss it. A regular newsletter makes it easy to stay up-to-date on recent advancements. It’s free and open.

    In terms of things that I will actually use, this source-code requirement idea is good. However, the site itself is no longer maintained at time of writing and has fallen into disrepair.

    Perhaps they are superceded by…

  • papers with code, which is similar.

    The mission of Papers With Code is to create a free and open resource with Machine Learning papers, code and evaluation tables.

    We believe this is best done together with the community and powered by automation.

    We’ve already automated the linking of code to papers, and we are now working on automating the extraction of evaluation metrics from papers.

  • Executable Papers in CodaLab, e.g.:

    Run your machine learning experiments in the cloud. Manage them in a digital lab notebook. Publish them so other researchers can reproduce your results.

    Upload code (in any programming language) and datasets (in any format) as bundles. There are no constraints on how you structure your bundles.

    Run your code in the cloud by specifying an arbitrary command along with your bundle dependencies, a Docker execution environment, and resource requirements. The output of the run becomes a new bundle.

3 References

Baker. 2021. Five Keys to Writing a Reproducible Lab Protocol.” Nature.
Balaban, Grytten, Rand, et al. 2021. Ten Simple Rules for Quick and Dirty Scientific Programming.” PLOS Computational Biology.
Barnes. 2010. Publish Your Computer Code: It Is Good Enough.” Nature.
Boettiger. 2015. An Introduction to Docker for Reproducible Research, with Examples from the R Environment.” ACM SIGOPS Operating Systems Review.
Commons. 2022. A National Agenda for Research Software.”
Community, Arnold, Bowler, et al. 2019. The Turing Way: A Handbook for Reproducible Data Science.”
Leroy, Sallou, Bourcier, et al. 2021. When Scientific Software Meets Software Engineering.” IEEE Computer Society.