Tracking experiments in machine learning

2020-10-23 — 2025-09-02

computers are awful
faster pussycat
how do science
premature optimization
provenance
Wherein experiment-tracking for neural‑network training is examined, and the recording of runtime metadata — parameters, metrics, artifacts and GPU energy usage — is described as being centralized to local or remote stores to support reproducibility.
Figure 1

Experiment tracking, specialized for ML and in particular neural nets. In particular, ML experiments often have a long optimization process, and we care a lot about the whole process and the various metrics calculated during it. This is the nuts-and-bolts end of how we allow for reproducibility in AI, even when we have complicated model fitting, many long, slow computation steps, and code that changes through a complicated development process.

Neptune reviews a few options, including its own product.

1 DIY

If it’s just a few configuration parameters that need tracking, we don’t need to be too fancy. hydra, for example, allows us to store the configuration with the output data. We can probably capture other useful metrics too. But if we want to do some non-trivial analytics on our massive NN, then we need more elaborate tools, a nice way of visualizing them, and a way to dynamically update those visualizations… We could possibly pump them into some nice data store and a data dashboard, but most people find they’ve reinvented tensorboard and switch to that.

2 Tensorboard

Tensorboard is the de facto standard debugging and tracking tool. It’s easy-ish to install, hard to modify, and works well enough for NN visualizations, but less well for everything else. It’s a handy tool to have around — I gave it its own page.

NB: It doesn’t do comprehensive tracking out of the box; we need to set it up to track that this value came from that experiment.

3 Trackio

A Hugging Face–integrated option. Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face:

Trackio is an open-source Python library that lets you track any metrics and visualize them using a local Gradio dashboard. You can also sync this dashboard to Hugging Face Spaces, which means you can then share the dashboard with other users simply by sharing a URL. Since Spaces can be private or public, this means you can share a dashboard publicly or just within members of your Hugging Face organization.

Also

At Hugging Face, our science team has started using Trackio for our research projects, and we’ve found several key advantages over other tracking solutions:

Easy Sharing and Embedding: Trackio makes it incredibly simple to share training progress with colleagues or embed plots directly in blog posts and documentation using iframes. This is especially valuable when you want to showcase specific training curves or metrics without requiring others to set up accounts or navigate complex dashboards.

Standardization and Transparency: Metrics like GPU energy usage are important to track and share with the community so we can have a better idea of the energy demands and environmental impacts of model training. Using Trackio, which directly gets information from the nvidia-smi command, makes it easy to quantify and compare energy usage and to add it to model cards.

Data Accessibility: Unlike some tracking tools that lock your data behind proprietary APIs, Trackio makes it straightforward to extract and analyze the data being recorded. This is crucial for researchers who need to perform custom analysis or integrate training metrics into their research workflows.

Flexibility for Experimentation: Trackio’s lightweight design allows us to easily experiment with new tracking features during training runs. For instance, we can decide when to move tensors from GPU to CPU when logging tensors while training, which significantly improves training throughput when you need to track model/intermediate states without impacting performance.

Notably, it’s a near drop-in replacement for Weights & Biases.

4 Weights and Biases

Weights & Biases is a full-featured model-training-tracking system that uses a third-party host. In practice, shunting stuff off to the cloud makes things a bit easier than TensorFlow, and the SDKs for various platforms abstract away some boring details.

Documentation is here. For my purposes, the handiest entry point is their Experiment Tracking.

Track and visualize experiments in real time, compare baselines, and iterate quickly on ML projects

Use the wandb Python library to track machine learning experiments with a few lines of code. If you’re using a popular framework like PyTorch or Keras, we have lightweight integrations.

You can then review the results in an interactive dashboard or export your data to Python for programmatic access using our Public API.

Figure 2

5 MLflow

I think MLflow does a few things, including experiment tracking and configuration (configuring).

MLflow is an open‑source platform. It tracks parameters, metrics, artifacts, and code so runs can be compared and reproduced reliably. Its tracking server and UI organize work into experiments and runs, with optional remote storage and integrations across common ML frameworks.

MLflow manages the ML lifecycle with components for tracking experiments, structuring projects, and managing models, enabling traceability from code to results.

An MLflow run is a single execution of training or evaluation code. An experiment is a collection of related runs that can be filtered and compared in the UI. Models are captured as artifacts associated with runs, and in recent versions, LoggedModels elevate models to first‑class entities for lifecycle tracking across training and evaluation. That’s cool.

Runs can log parameters (e.g., hyperparameters), metrics (e.g., accuracy or loss over steps), tags, and artifacts such as model files, plots, and datasets. Source information and environment details can also be recorded to support reproducibility and auditability. This consistent metadata makes it straightforward to benchmark alternatives and trace results back to inputs.

Experiments help segment efforts by project or hypothesis, with features to sort, filter, and compare runs to understand performance trade‑offs. Teams commonly name experiments by feature branch, dataset slice, or task to keep explorations structured and discoverable in the UI. Tags further aid searchability and reporting across large numbers of runs.

Runs log to an active experiment on a local or remote tracking server, which can be configured with a tracking URI to centralize across users and environments. Artifacts can be stored locally or in remote/cloud locations, enabling scalable storage for models and auxiliary files. This separation of metadata and artifacts supports both laptop workflows and enterprise setups.

Minimal example:

import mlflow

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("batch_size", 32)
    mlflow.log_metric("accuracy", 0.95)
    with open("model_summary.txt", "w") as f:
        f.write("Model achieved 95% accuracy.")
    mlflow.log_artifact("model_summary.txt")

6 Aimstack

AimStack: Dev tools for AI engineers. TBD.

7 Neptune

Neptune.ai calls itself a Metadata Store for MLOps. It has many team collaboration features. TBD.

8 Comet ML

Comet. TBD

9 DrWatson.jl

For Julia there is DrWatson which automatically attaches code versions to simulations and does other work to help keep simulations tracked and reproducible.

DrWatson is a scientific project assistant software. Here is what it can do:

  • Project Setup : A universal project structure and functions that allow you to consistently and robustly navigate through your project, no matter where it is located.
  • Naming Simulations : A robust and deterministic scheme for naming and handling your containers.
  • Saving Tools : Tools for safely saving and loading your data, tagging the Git commit ID to your saved files, safety when tagging with dirty repos, and more.
  • Running & Listing Simulations: Tools for producing tables of existing simulations/data, adding new simulation results to the tables, preparing batch parameter containers, and more.

See the DrWatson Workflow Tutorial page to get a quick overview over all of these functionalities.

10 DVC

DVC, the data versioning tool, can apparently do some useful tracking. TBD.

11 Configuring

A related problem to experiment tracking is experiment configuration. How do I even set up those parameters?