Scheduling ML jobs on HPC clusters
In Soviet Russia, job puts YOU in queue
2018-03-09 — 2025-09-30
Wherein classic HPC schedulers are surveyed and a practical habit is recommended: submitit is presented as a way to run Python functions via Slurm, and Snakemake’s job‑grouping is noted.
Working on classic HPC clusters.
Slurm
, torque
, PlatformLSF
Typically, a campus cluster ships with a few gruff example commands that worked for someone once and not much else. Often the examples run a molecular‑simulation package written in a language I’ve never heard of, or one I’d rather forget. This mixes the understandable desire to avoid writing documentation for every bizarre, idiosyncratic use case with a kind of “availability-through-obscurity” demand management. They’re typically less eager to allocate GPUs, a bit baffled by modern neural-network workflows, and downright flabbergasted by containers. Clusters tend to share a similar API and provide concurrency guarantees via the infamous Byzantine committee–designed greasy totem pole priority system. In our experience, a cluster’s IT team is often reluctant to document which scheduler it uses. I’ve lived through the transition from classic compute to GPU‑everything. These days, sysadmins know all about GPUs.
Anyway, here are some methods that work well for my use cases, which lean toward statistical inference and neural nets.
1 submitit
A simple tool that tells Slurm to run a Python function as a job, cleanly. It’s the infrastructure for a lot of other things. I use it so much that I made a submitit notebook. Go take a look. It’s my current go-to option for Python.
2 Snakemake
I missed it before, but Snakemake
deserves mention: it’s a workflow engine (inspired by Make) with built‑in support for submitting jobs to HPC clusters, and it handles parallelism and dependencies cleanly.
- We express a pipeline as a DAG of rules (using
input
,output
, andshell
/run
for logic and resource requirements). - Snakemake detects jobs that can run in parallel based on dependencies, so we don’t have to write orchestration code.
- It has built-in support for submitting jobs to batch schedulers (Slurm, PBS, SGE, LSF) via its cluster execution mode.
- It supports job grouping, which reduces scheduler load for many small tasks by bundling them into a single job.
- We can define default resources and override them per rule (e.g. memory, threads), and forward scheduler-specific flags via templates or
cluster-config.yaml
. - Because Snakemake is workflow-aware, it avoids re-running tasks whose inputs haven’t changed, supports modular workflows, and integrates more easily with containerization and rule reuse.
A typical pattern is:
Create a
profile
(e.g. in~/.config/snakemake/profiles/mycluster/
) that encodes the cluster submission command template (e.g.sbatch …
orqsub …
), default resources, concurrency limits, and translations from rule-level resource names to scheduler flags. Many communities publish example profiles for Slurm, PBS, etc.Write a Snakefile so each rule has resource hints: e.g.
When we run the workflow, we might do the following:
This causes Snakemake to submit each rule as a separate job — or to bundle several rules together by grouping — to the scheduler, while respecting dependencies. [TODO clarify]
Snakemake uses its own DSL. I don’t usually love the DSL, but it might be worth it here.
As an aside, Snakemake can also target public cloud backends — including Kubernetes clusters, AWS Batch, and Google’s Life Sciences API — by staging data in cloud object stores (S3/GCS) and containerizing rules. That lets us run the same Snakefile on-premises HPC clusters or managed cloud services with only profile or config changes, though costs and object-store quirks mean most people still run workflows on classic clusters.
3 Parsl
I’ve ended up on a cluster that doesn’t use Slurm — it uses PBS Pro. Parsl (“Parallel Scripting in Python”) seems like a good fit for this cluster. Parsl isn’t as seamless as submitit
, but it has a long development history and major institutional backers, so we expect it to be more reliable.
- code: Parsl/parsl
- docs: Parsl
TODO: compare and contrast with Snakemake
.
4 MLflow
MLflow is primarily a tracker, but it can also act as an executor. We can use MLflow as a “submitter + tracker”, but there are caveats.
4.1 Slurm + mlflow-slurm plugin
There’s an external plugin, mlflow-slurm, maintained by NCSA. It implements an MLflow Projects backend that submits jobs via sbatch
. We call it like this:
The slurm_config.json
lets us control the partition, account, modules to load, memory, GPUs, nodes, ntasks, exclusivity and more.
One useful feature is “sequential workers”: if training can’t finish within the walltime, we can checkpoint and schedule dependent jobs to continue ([GitHub][2]).
However:
- The plugin isn’t first-party — it’s community-maintained and can lag behind core MLflow versions.
- Because the plugin generates its own sbatch scripts, we lose flexibility and often have to duplicate its logic for modules, environment setup, and pre/post hooks.
- For multi-step or multi-task workflows — especially those needing orchestration beyond a single project run — the plugin’s capabilities are fairly limited. We often embed our own scheduling logic or wrap calls instead of expecting MLflow to manage complex DAGs.
- The plugin works well when an ML job is encapsulated in a single “project run”, i.e., when we want a self-contained entry point.
4.2 PBS / non-Slurm environments
In clusters that use PBS Pro, Torque, etc., there isn’t a widely adopted MLflow Projects backend equivalent to mlflow-slurm. In practice, we commonly:
Write a PBS job script (or template) that activates the environment, sets
MLFLOW_TRACKING_URI
, and then calls eithermlflow run
or a Python entry point that logs to MLflow.Use external orchestrators or scheduling helpers (e.g. Dask-Jobqueue’s
PBSCluster
or other queue submission clients) to handle job submission while embedding MLflow tracking inside the job. In this model, MLflow acts as a tracking and packaging layer, not the scheduler.
5 Nextflow
Nextflow is a workflow manager and a DSL that expresses pipelines as dataflow graphs of “processes” connected by “channels”, allowing parallelism to emerge naturally from data dependencies rather than from hand‑crafted orchestration logic. Pipelines can be versioned and run directly from Git repositories, parameterized at invocation time, and made reproducible through first‑class support for containers and environments (Docker/Podman, Apptainer/Singularity, Conda).
A neat operational feature is incremental “resume” (call caching). It avoids re‑running tasks whose inputs, parameters, or container/image haven’t changed, speeding up iterations on large analyses.
The DSL stuff looks cool, but I instinctively resist it — it suggests projects bigger than I’d like to tackle as a lone researcher. That might just mean my projects aren’t a good fit for it, rather than the tool itself being “bad”.
On classic HPC clusters, Nextflow’s main selling point is its native executors for common schedulers. It submits each process as a batch job with the requested CPUs, memory, and walltime, so we can keep a single pipeline and swap execution backends via configuration profiles using Executors. This includes widely used managers such as Slurm, PBS/Torque/PBS Pro, SGE, and LSF, along with container runtimes suited to multi‑tenant HPC (notably Apptainer/Singularity). This helps satisfy site policy while preserving consistent software stacks across nodes. Operational controls like queue size limits, submission throttling, retries, and per‑process resource directives help align pipeline behaviour with scheduler expectations. The same pipeline definitions can, in principle, target cloud backends (e.g., AWS Batch, Azure) without code changes to the workflow logic.
In practice, teams using HPC need to develop site‑aware configuration profiles, containerized tools vetted for the cluster, and data‑locality strategies that minimize cross‑filesystem I/O during high fan‑out stages.
See
6 Naked DRMAA
Apparently there’s a modern programmatic API for some classic schedulers called DRMAA (Distributed Resource Management Application API). It supports fairly generic job definitions and, in principle, seems to work on my local cluster, but there’s no documentation on how to run it there. So I suspect that’s a battle I don’t want to pick.
7 The here-document trick
Did we notice what happened? We invoked our preferred programming language from the job-submission shell script, so the submission script and the code stayed together.
8 Request a multicore job from the scheduler and manage it like a mini-cluster in Python
Dask.distributed works well for multi-machine cluster jobs and can even spawn the Slurm job for us.
Easily distribute a parallel IPython notebook on a cluster.
Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”
ipython-cluster-helper automates the whole process for us.
“Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]
ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.
It works with Platform LSF, Sun Grid Engine, Torque, and SLURM. [TODO clarify] Python-only.
9 Misc Python
See also DRMAA Python, a Python wrapper for the DRMAA API.
Other projects I looked at: Andrea Zonca Zonca wrote a script that lets us submit jobs to a cluster from a Jupyter notebook. After several iterations and improvements, it’s now called batchspawner.
10 Hadoop on the cluster
The hanythingondemand project provides scripts to set up an ad-hoc Hadoop cluster by submitting PBS jobs.
11 Misc Julia
In Julia, there’s a rather fancy system JuliaParallel/ClusterManagers.jl that supports most major HPC job managers.
There’s also a bare-bones cth/QsubCmds.jl: Run Julia external (shell) commands on an HPC cluster.