Scheduling ML jobs on HPC clusters

In Soviet Russia, job puts YOU in queue

2018-03-09 — 2025-09-30

Wherein classic HPC schedulers are surveyed and a practical habit is recommended: submitit is presented as a way to run Python functions via Slurm, and Snakemake’s job‑grouping is noted.

computers are awful

computers are awful together

concurrency hell

premature optimization

python

julia

Working on classic HPC clusters.

Slurm, torque, PlatformLSF Typically, a campus cluster ships with a few gruff example commands that worked for someone once and not much else. Often the examples run a molecular‑simulation package written in a language I’ve never heard of, or one I’d rather forget. This mixes the understandable desire to avoid writing documentation for every bizarre, idiosyncratic use case with a kind of “availability-through-obscurity” demand management. Nonetheless, clusters tend to share a similar API and provide a similar kind of concurrency support via the renowned Byzantine committee–designed greasy totem pole priority system.

In my experience, a cluster’s IT documentation is often weirdly reluctant to divulge which scheduler it uses, and you need to dig around in the command-line. I assume this is a hangover from era before big compute when it was deemed better to on-board users to whatever the local norm was with a feel of universalism, rather than introduce it in comparison to the half-dozen systems they have already learned, because the users had not learned a half-dozen systems when this documentation was written. Be prepared for documentation which won’t tell you what the system is called, but will tell you how to write a pageful of punctuation salad invoking it.

~~They’re typically less eager to allocate GPUs, a bit baffled by modern neural-network workflows, and downright flabbergasted by containers.~~ I’ve lived through the transition from classic compute to GPU‑everything. These days, sysadmins know all about GPUs.

Anyway, here are some methods that work well for my use cases, which lean toward statistical inference and neural nets.

1 submitit

A simple tool that tells Slurm to run a Python function as a job, cleanly. It’s the infrastructure for a lot of other things. I use it so much that I made a submitit notebook. Go take a look. It’s my current go-to option for Python.

2 Snakemake

I missed it before, but Snakemake deserves a hearing: it’s a workflow engine (inspired by Make) with built‑in support for submitting jobs to HPC clusters, and it handles parallelism and dependencies moderately-cleanly.

See snakemake for more details.

What follows are some other options I do not intend to use in the near future.

3 Parsl

I’ve ended up on a cluster that doesn’t use Slurm — it uses PBS Pro. Parsl (“Parallel Scripting in Python”) seems like a possible fit for this cluster. Parsl isn’t as seamless as submitit, but it has a long development history and major institutional backers, so we expect it to be reliably, flexible and battle tested.

See Parsl for more details.

4 MLflow

MLflow is primarily a tracker, but it can also act as an executor. We can use MLflow as a “submitter + tracker”, but there are caveats.

4.1 Slurm + mlflow-slurm plugin

There’s an external plugin, mlflow-slurm, maintained by NCSA. It implements an MLflow Projects backend that submits jobs via sbatch. We call it like this:

mlflow run --backend slurm \
          --backend-config slurm_config.json \
          <project>

The slurm_config.json lets us control the partition, account, modules to load, memory, GPUs, nodes, ntasks, exclusivity and more.

One useful feature is “sequential workers”: if training can’t finish within the walltime, we can checkpoint and schedule dependent jobs to continue ([GitHub][2]).

However:

The plugin isn’t first-party — it’s community-maintained and can lag behind core MLflow versions.
Because the plugin generates its own sbatch scripts, we lose flexibility and often have to duplicate its logic for modules, environment setup, and pre/post hooks.
For multi-step or multi-task workflows — especially those needing orchestration beyond a single project run — the plugin’s capabilities are fairly limited. We often embed our own scheduling logic or wrap calls instead of expecting MLflow to manage complex DAGs.
The plugin works well when an ML job is encapsulated in a single “project run”, i.e., when we want a self-contained entry point.

4.2 PBS / non-Slurm environments

In clusters that use PBS Pro, Torque, etc., there isn’t a widely adopted MLflow Projects backend equivalent to mlflow-slurm. In practice, we commonly:

Write a PBS job script (or template) that activates the environment, sets MLFLOW_TRACKING_URI, and then calls either mlflow run or a Python entry point that logs to MLflow.
Use external orchestrators or scheduling helpers (e.g. Dask-Jobqueue’s PBSCluster or other queue submission clients) to handle job submission while embedding MLflow tracking inside the job. In this model, MLflow acts as a tracking and packaging layer, not the scheduler.

5 Nextflow

Nextflow is a workflow manager and a DSL that expresses pipelines as dataflow graphs of “processes” connected by “channels”, allowing parallelism to emerge naturally from data dependencies rather than from hand‑crafted orchestration logic. Pipelines can be versioned and run directly from Git repositories, parameterized at invocation time, and made reproducible through first‑class support for containers and environments (Docker/Podman, Apptainer/Singularity, Conda).

A neat operational feature is incremental “resume” (call caching). It avoids re‑running tasks whose inputs, parameters, or container/image haven’t changed, speeding up iterations on large analyses.

The DSL stuff looks cool, but I instinctively resist it — it suggests projects bigger than I’d like to tackle as a lone researcher. That might just mean my projects aren’t a good fit for it, rather than the tool itself being “bad”.

On classic HPC clusters, Nextflow’s main selling point is its native executors for common schedulers. It submits each process as a batch job with the requested CPUs, memory, and walltime, so we can keep a single pipeline and swap execution backends via configuration profiles using Executors. This includes widely used managers such as Slurm, PBS/Torque/PBS Pro, SGE, and LSF, along with container runtimes suited to multi‑tenant HPC (notably Apptainer/Singularity). This helps satisfy site policy while preserving consistent software stacks across nodes. Operational controls like queue size limits, submission throttling, retries, and per‑process resource directives help align pipeline behaviour with scheduler expectations. The same pipeline definitions can, in principle, target cloud backends (e.g., AWS Batch, Azure) without code changes to the workflow logic.

In practice, teams using HPC need to develop site‑aware configuration profiles, containerized tools vetted for the cluster, and data‑locality strategies that minimize cross‑filesystem I/O during high fan‑out stages.

See

6 Naked DRMAA

Apparently there’s a modern programmatic API for some classic schedulers called DRMAA (Distributed Resource Management Application API). It supports fairly generic job definitions and, in principle, seems to work on my local cluster, but there’s no documentation on how to run it there. So I suspect that’s a battle I don’t want to pick.

7 The here-document trick

tbeason suggests this hack:

#!/usr/bin/env sh
#SBATCH -N 10
#SBATCH -n 8
#SBATCH -o %x-%j.out

module load julia/1.6.1 ## I have to load julia before calling julia

julia << EOF

using SomePackage
do_julia_stuff

EOF

Did we notice what happened? We invoked our preferred programming language from the job-submission shell script, so the submission script and the code stayed together.

8 Request a multicore job from the scheduler and manage it like a mini-cluster in Python

Dask.distributed works well for multi-machine cluster jobs and can even spawn the Slurm job for us.
Easily distribute a parallel IPython notebook on a cluster.

Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”
ipython-cluster-helper automates the whole process for us.

“Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]

ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.

It works with Platform LSF, Sun Grid Engine, Torque, and SLURM. Python-only.

9 Misc Python

See also DRMAA Python, a Python wrapper for the DRMAA API.

Other projects I looked at: Andrea Zonca Zonca wrote a script that lets us submit jobs to a cluster from a Jupyter notebook. After several iterations and improvements, it’s now called batchspawner.

10 Hadoop on the cluster

The hanythingondemand project provides scripts to set up an ad-hoc Hadoop cluster by submitting PBS jobs.

11 Misc Julia

In Julia, there’s a rather fancy system JuliaParallel/ClusterManagers.jl that supports most major HPC job managers.

There’s also a bare-bones cth/QsubCmds.jl: Run Julia external (shell) commands on an HPC cluster.

1 submitit

2 Snakemake

3 Parsl

4 MLflow

4.1 Slurm + mlflow-slurm plugin

4.2 PBS / non-Slurm environments

5 Nextflow

6 Naked DRMAA

7 The here-document trick

8 Request a multicore job from the scheduler and manage it like a mini-cluster in Python

9 Misc Python

10 Hadoop on the cluster

11 Misc Julia

12 R-specific