Big data ML best practice

Being transparent about what I put in this black box

September 16, 2020 — May 22, 2024

computers are awful
machine learning
neural nets
Figure 1

A grab bag of links I have found pragmatically useful in the topsy-turvy world of ML research. Here, where even though we have big data about the world, we still have small data about our own experimental models of the world, because they are so computationally expensive.

1 Workflow rules-of-thumb

Martin Zinkervich’s Rules of ML for engineers, and Google’s broad brush workflow overview. Andrej Karpathy’s Recipe for training neural networks.

2 Testing and debugging

Figure 2: Jordan’s test-ML pipeline

Zayd Enam on why debugging machine learning is hard and Jeremy Jordan on writing tests for ML.

3 Data management

A whole field. See also data versioning.

4 As reproducible research

Figure 3: Abstruse Goose says

The Turing Way by the Alan Turing institute covers many reproducible research/open notebook science ideas which includes some tips applicable to ML research.

5 Trustworthiness and transparency

6 Tools

See also configuring ML for some abstractions of use, and experiment tracking in ML.

7 Incoming

8 References

Albertoni, Colantonio, Skrzypczyński, et al. 2023. Reproducibility of Machine Learning: Terminology, Recommendations and Open Issues.”
Ameisen. 2020. Building machine learning powered applications: going from idea to product.
Friedrich, Antes, Behr, et al. 2020. Is There a Role for Statistics in Artificial Intelligence? arXiv:2009.09070 [Cs].
Gibney. 2019. This AI Researcher Is Trying to Ward Off a Reproducibility Crisis.” Nature.
Madaio, Stark, Wortman Vaughan, et al. 2020. Co-Designing Checklists to Understand Organizational Challenges and Opportunities Around Fairness in AI.” In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. CHI ’20.
Pineau, Vincent-Lamarre, Sinha, et al. 2020. Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program).”
Pushkarna, Zaldivar, and Kjartansson. 2022. Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI.” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’22.