Configuring machine learning experiments

2021-10-20 — 2025-06-15

computers are awful
faster pussycat
how do science
premature optimization
provenance
python

A dual problem to experiment tracking is experiment configuring; How can I nicely define experiments and the parameters that make them go? Ideally, how can I do both these things at the same time?

Figure 1

For a problem that we all need to solve constantly, and which has ?hundreds? of packages to solve it, this is remarkably unsolved.

Almost everyone gets annoyed by the frictions of configuring experiments at some stage, triggered by a complicated situation where we have some messy hierarchical neural net and we wish to keep the configuration in a file, while still allowing ourselves to override it from the command line. And it’s a deeply nested hierarchical configuration. And we are editing the code. And sometimes we are invoking it programmatically from a notebook, and sometimes from a command-line script. We want to make it easy to tweak function arguments. easy to say but… how? How do we make it “easy” without introducing many more moving parts I would rather not need to have “configuration objects” and “configuration files” and “configuration decorators” and “configuration classes” and “configuration functions” and “configuration arguments” and “configuration overrides” and “configuration defaults” and “configuration validation” and “configuration parsing”. Yet, many solutions seem to involve a remarkably large number of such components. Somehow I need to write 200 lines of code and I just want to tweak the learning rate.

That’s the setting. Further, since all my experiments at the moment are in Python, I will assume python hereafter.

The competing design goals of experiment configurations are hard to resolve. I want my core business logic to be plain python functions. I also want to invoke them easily from various systems, such as the CLI, which will want to know a lot about my business logic in order to call it automatically. Is that argument always an integer? Can it return a negative number? Oh that argument is an nn.Module? Which of its constructor arguments might we change? what is the default value for the others?

Of course, we are in neural network land, which is typically an uneasy truce between functional and object-oriented programming with late binding, leaky abstractions, and a lot of late resolutions of dimensions and other such annoyances that seductively easy to configure if you are doing exactly the example in the user manual, but turn out to be brittle to automate in like, actual applications. (I’m looking at you, UNet). The question of who bears responsibility for which parts of the complexity things is not settled; all the libraries try different approaches. Which is to say, it is not clear what the simplest thing is.

And there are so, so many libraries once you start looking for them. It is like mushroom picking; first you see one, and then another, and then you realise you are standing in a field of mushrooms, and you have no idea which ones are edible and which ones will have you hallucinating for a week. It seems this problem defines a psychological weak spot in python’s design where it tips over from empowering software to enabling yak shaving. ML config is in the Dunning-Kruger liminal zone where the waves of duck-typing lap upon the beach of solid software engineering, but demolish all you build upon it. Python has just enough metaprogramming facilities to make it conceivable for an overconfident developer to feel they can automate some of the labour but at the cost of being ugly and hard to reason about but how hard could it be? Just because a hundred people before you have tried and failed doesn’t mean you will fail maybe THEY hadn’t read PEP484 ! or PEP557! or thought about implementing everything in using a brand new Param class that is a dataclass but also a decorator and also a context manager and also a type hint and also a function signature and generates CLIs by introspection !!! This will make my life SO EASY!!!!

So, onto this beach littered with the flotsam and jetsam of abandoned configuration systems, and also somehow hallucinatory mushrooms somehow, and probably in addition landmines, I will try sell you my most premium sandcastle real estate.

I have provisionally decided that the following things are anti-patterns:

  1. recursively configuring a configuration file in some special language like json or yaml. It sounds elegant, but in practice ends up being a sad journey into a mess of mapping arguments to functions. I am open to being persuaded otherwise. A system that allowed incrementally adding configuration to a function call, rather than replacing the whole configuration, might work
  2. auto-config. fiddle tried to do that to make my life easier, and all it did was tank my LLM budget trying to debug fiddle.

But, you know what? I haven’t found anything that works so you should ignore my opinions and try everything anyway.

1 MLFlow

Auditioning this one on the recommendation of a colleague. AFAICT, MLflow’s configuration is mostly about centralizing evidence: point everything at a tracking server, choose an artifact store, and (optionally) wire a model registry. With that in place, experiments stop living in ad-hoc folders and spreadsheets. My current understanding is that MLflow isn’t magic—defaults are local and easy to lose, the UI can be clunky, and the docs are scattered—but a single tracking URI and Tracking Server gives us a durable audit trail.

For experiment tracking, the USPs are straightforward. First, runs become queryable objects with parameters, metrics, tags, and artifacts we can slice in the UI, Python API, or CLI. Second, autologging reduces boilerplate for common libraries while still letting us log bespoke diagnostics. Third, artifacts are first-class—models, plots, datasets, whatever—so comparisons are reproducible rather than anecdotal. Finally, a Model Registry gives us promotable, versioned models with stage semantics when you need them.

Configuration-wise, my best read is: set MLFLOW_TRACKING_URI or call mlflow.set_tracking_uri(...), set a real artifact root (e.g., S3/MinIO/GCS) and the right env vars (MLFLOW_S3_ENDPOINT_URL etc.), pick an experiment via mlflow.set_experiment(...), and, if you care about promotion, set a registry URI too (tracking basics). I’d also lock the runtime with Projects so re-runs don’t drift—parameters, entry points, and environment in MLproject. Mild gotchas I’d plan for: nested runs don’t group cleanly unless they share the same experiment and are started with mlflow.start_run(nested=True) (see start_run docs), and “it worked locally” often means “you never set a durable artifact store.”

Extending this to something close to optimal experimental design under a fixed budget: I’d model a “study” as a parent run and each trial as a child run with fully parameterized inputs. Then I’d drive trials with a Bayesian optimizerHyperopt’s TPE seems to be the default choice—to maximize improvement per evaluation instead of spraying grid/random sweeps (see Databricks’ TPE tips).

Caveats. First, MLflow won’t version data; if that matters, pair it with a data versioning layer and log dataset fingerprints so run-to-run comparisons are valid (e.g., lakeFS’s take: Why pair with MLflow). Second, registry value is real only if we actually gate, promote, and deploy; otherwise it’s paperwork. Given those constraints, I think a sane setup is: central tracking + artifact store, Projects for reproducibility, disciplined tags (dataset/seed/augmentation), nested runs for searches, and a Bayesian loop to spend compute where it moves the metric.

I haven’t battle-tested this end-to-end; it’s simply my best current understanding of how to make MLflow earn its keep.

2 yaml_config_override

See sashank-tirumala/yaml_config_override.

This library does one thing: it automatically creates command-line arguments to override values in a YAML configuration file. I know I just said that I don’t like special configuration languages, this library is so simple that maybe I should suck it up and have a go?

  • Extremely Lightweight: It’s a tiny library with a single-purpose.
  • Non-Intrusive: You can add it to your existing workflow with just a couple of lines of code. Your business logic remains completely separate from the configuration logic.
  • Easy CLI Integration: This is its core feature. You get command-line overrides for free, without writing any argparse boilerplate.

Here’s how you might use it:

# main.py
from yaml_config_override import add_arguments
import yaml
from pathlib import Path

# Assume you have a 'config.yaml'
# outer:
#   x: 0
#   inner:
#     y: 1

my_config_path = 'config.yaml'
conf = yaml.safe_load(Path(my_config_path).read_text())
conf = add_arguments(conf)

print(conf)

# You can now run from the command line:
# python main.py --outer.x 2 --outer.inner.y 3

Looks to me like there might be trouble if one of those arguments is an nn.Module.

3 argbind

Looks simple conceptually. However it’s intrusive; I do not enjoy the philosophy of my function signatures being the canonical definition of my experiment’s parameters.

import argbind

# Your logic is decorated, tying it to the config system
@argbind.bind
def train(learning_rate: float = 0.01, epochs: int = 10):
    # ...

if __name__ == “__main__”:
    args = argbind.parse_args()
    with argbind.scope(args):
        train() # Arguments are magically passed in

4 fastargs

If you’re looking for a more programmatic and powerful solution, maybe fastargs? It allows you to define your configuration directly in Python and use decorators to inject configuration values into your functions.

  • Incremental Adoption: The decorator-based approach means you can start by configuring just a few functions and expand from there.
  • Separation of Concerns: It helps keep your configuration logic separate from your main code, but in a more programmatic way than YAML files.
  • Powerful Features for ML: It can handle complex types like Python modules as arguments, allowing you to specify things like which optimizer to use directly from the command line.
  • Clear Boilerplate: While it does involve some boilerplate with decorators, it’s very clear what needs to be written, making it easy to use with an LLM assistant.

I dunno though, decorators are already a red flag for me.

5 YACS

rbgirshick/yacs: YACS – Yet Another Configuration System It looks so ugly to configure, but people have won kaggle competitions using it so it might be good, right?

Worked example: Building A Flexible Configuration System For Deep Learning Models · Julien Beaulieu.

6 Hydra

This does more or less everything, and in fact too many things to even understand, and it is very opinionated about them. See my Hydra page. Too heavy and confusing for me in practice, in hindsight.

7 Fiddle

Fiddle seems to resolve some problems in gin so I gave it a go. On paper the API is much simpler to use. It employs a code-based config system, where we define Python functions that mutate the configuration. In practice it is a land-war-in-Indochina situation, with shifting APIs, lagging documentation, counterintuitive patterns that are only simple in hindsight, and massive complexity that tanks my productivity. See my fiddle notebook.

8 Gin

gin-config configures default parameters in a useful way for ML experiments. It is made by Googlers, as opposed to Hydra, which is made by Facebookers. It is more limited than Hydra, but slightly lighter. Things I miss from Hydra: CLI parsing for overrides. It seems to be abandoned or deprecated in favour of fiddle.

9 Spock

Also looks like it solves lots of problem. However, code is untouched since 2023. Does that mean it is finished? or unmaintained? Spock aims to solve so many problems that I am suspicious it could be finished.

Spock is a parameter configuration framework that uses decorated Python classes. It offers a comprehensive approach to managing complex configurations with strong typing.

Key features include:

  • Type-checked, immutable parameters
  • Support for inheritance and complex parameter dependencies
  • Multiple configuration file formats (YAML, TOML, JSON)
  • Automatic CLI argument generation
  • Serialization for reproducibility
  • Hierarchical configuration through composition

Spock seems quite feature-rich, supporting everything from simple parameter definitions to complex nested types and even hyperparameter tuning. While the repository hasn’t been updated since 2023, it appears fairly complete. The question remains whether it’s “finished” or unmaintained. idk.

10 Pyrallis

eladrich/pyrallis seems to be a reimagining of hydra. It seems a little less opinionated, which is relaxing, while leaving the user with fewer choices about how to map the configuration to the code.

The major trick is using the recent (v3.7) Python feature dataclasses as a first-class citizen. Looks elegant but not very maintained.

There is a fork, dlwh/draccus: Configuration with Dataclasses+YAML+Argparse. Fork of Pyrallis.

11 Allennlp Param

Allennlp’s Param system is a kind of introductory trainer-wheels configuration system, but not recommended in practice. It comes with a lot of baggage — installing it will bring in many fragile and fussy dependencies for language parsing. Once I used this for a while I realised all the reasons I would want a better system, and I no longer recommend this in general.

12 DIY

Why use an external library for this? I could, of course, roll my own. I have done that quite a few times. It is a surprisingly large amount of work, remarkably easy to get wrong.