Scheduling ML jobs on HPC clusters

In Soviet Russia, job puts YOU in queue

2018-03-09 — 2025-09-07

computers are awful
computers are awful together
concurrency hell
premature optimization
Figure 1

Doing stuff on classic HPC clusters.

Slurm, torque, PlatformLSF all implement a similar API providing concurrency guarantees specified by the famous Byzantine committee-designed greasy totem pole priority system. Empirical observation: the IT department for any given cluster often seems reluctant to document which one they are using. Typically a campus cluster will come with some gruff example commands that worked for that guy that time, but not much more. Usually that guy that time was running a molecular simulation package written in some language I have never heard of, or alternatively one I wish to forget that I have heard of. Presumably this is often a combination of the understandable desire not to write documentation for all the bizarre idiosyncratic use cases, and a kind of availability-through-obscurity demand-management. They are typically less eager to allocate GPUs, slightly confused by all this modern neural network stuff, and downright flabbergasted by containers. I’ve lived through this transition, from classic compute to GPU everything; modern sysadmins know all about GPUs.

Anyway, here are some methods for getting stuff done that work well for my use-cases, which tend towards statistical inference and neural nets etc.

1 submitit

My current go-to option for python. I use this so much that I made a submitit notebook. Go there.

2 MLFlow

On Slurm clusters, MLflow Projects can be launched through a dedicated plugin backend (for example, mlflow-slurm ) so runs are queued with sbatch parameters controlled by a simple config, which many find straightforward for HPC use. When a Python-first workflow is preferred, submitit provides a futures-style interface to schedule functions on Slurm and pairs cleanly with MLflow logging inside the function body, covering single- and multi-node jobs with retries/checkpointing. For distributed workloads, users typically rely on the framework launcher (e.g., torchrun) and let MLflow track runs from within the job, keeping Slurm in charge of placement and resources.

See

On PBS Pro clusters, I could find no widely used, first‑party MLflow Projects backend analogous to Slurm’s, so practitioners usually submit PBS scripts that activate the environment, set MLFLOW_TRACKING_URI, and run either mlflow run or a Python entrypoint with MLflow logging. Orchestrators that support PBS (e.g., Dask‑Jobqueue’s PBSCluster) can handle queue submission while the code logs to MLflow, cleanly separating scheduling from tracking on these systems. It seems MLflow is strongest as a cross‑environment tracker and packager; “turnkey” scheduler submission is most mature on Slurm, while PBS Pro workflows typically rely on native qsub scripts or external orchestrators coupled with MLflow tracking.

MLflow has some cool other features though, such as tracking and configuring, so is nice if you can get it.

3 Parsl

I have been forced onto a cluster that does not use SLURM but rather PBSPro. Parsl (“Parallel Scripting in Python”) seems to be an alternative for this case. It is not so seamless at submitit but has a long development history and big institutional backers so it may have fewer bugs?

What follows are some other options I do not intend to use in the near future.

4 Nextflow

Figure 2

Nextflow is a workflow manager and DSL that expresses pipelines as dataflow graphs of “processes” connected by “channels,” which allows parallelism to emerge naturally from data dependencies rather than hand-crafted orchestration logic. Pipelines can be versioned and run directly from Git repositories, parameterized at invocation time, and made reproducible through first-class support for containers and environments (Docker/Podman, Apptainer/Singularity, Conda). A cool operational feature is incremental “resume” (call caching), which avoids re-running tasks whose inputs, parameters, and container/image are unchanged, improving iteration speed on large analyses.

The DSL stuff looks cool, but also I instinctively resist using this because it seems to suggest bigger projects than the ones I would ideally like to be doing as a lone researcher. But possibly this reflect that my projects are ill-matched rather than this tool is “bad”.

On classic HPC clusters, Nextflow’s main affordance is native executors for common schedulers—submitting each process as a batch job with the requested CPUs, memory, and walltime—so users keep a single pipeline while swapping execution backends via configuration profiles using Executors. This seems to include widely used managers such as Slurm, PBS/Torque/PBS Pro, SGE, and LSF, along with container runtimes suited to multi-tenant HPC (notably Apptainer/Singularity), which helps satisfy site policy while preserving consistent software stacks across nodes. Operational controls like queue size limits, submission throttling, retries, and per-process resource directives help align pipeline behavior with scheduler expectations, and the same pipeline definitions can target cloud backends (e.g., AWS Batch, Azure) when bursting is needed without code changes to the workflow logic/

HPC fit isn’t automatic: conservative defaults or naïve task sizing can generate many short jobs, stressing shared filesystems and confusing backfill policies; users often need to tune queueSize/submit rates, coarsen tasks, and place work directories on scratch to avoid metadata storms. Site limits like strict walltime caps and queue-specific policies can interact poorly with long fan-out pipelines, so profiling task durations and aligning resource requests to partitions/queues is essential to sustain throughput. In practice, teams on HPC probably need to develop site-aware configuration profiles, containerized tools vetted for the cluster, and data-locality choices that minimize cross-filesystem I/O during high fan-out stages etc etc.

See

5 Naked DRMAA

To investigate: Apparently there is a modern programmatic API to some of the classic schedulers called DRMAA, (Distributed Resource Management Application API), which allows fairly generic job definition and which notionally works on my local cluster, although they have not documented how. As such I suspect that is a battle I do not wish to pick.

6 The here-document trick

tbeason suggests this hack:

#!/usr/bin/env sh
#SBATCH -N 10
#SBATCH -n 8
#SBATCH -o %x-%j.out

module load julia/1.6.1 ## I have to load julia before calling julia

julia << EOF

using SomePackage
do_julia_stuff

EOF

Did you see what happened? We invoked our preferred programming language from the job submission shell script, keeping the script and code in the same place.

7 Request a multicore job from the scheduler and manage that like a mini cluster in python

  • Dask.distributed works well on a multi-machine job on the cluster apparently, and will even spawn the Slurm job for itself.

  • Easily distributing a parallel IPython Notebook on a cluster:

    Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”

  • ipython-cluster-helper automates that.

    “Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]

    ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.

    Works on Platform LSF, Sun Grid Engine, Torque, SLURM. Strictly python.

8 Misc python

See also DRMAA Python, which is a Python wrapper around the DRMAA API.

Other ones I looked at: Andrea Zonca wrote a script that allows spawning jobs on a cluster from a Jupyter notebook. After several iterations and improvements it is now called batchspawner.

snakemake supports make-like build workflows for clusters. Seem general and powerful but complicated.

9 Hadoop on the cluster

hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs.

10 Misc julia

In Julia there is a rather fancy system JuliaParallel/ClusterManagers.jl which supports many major HPC job managers automatically.

There is also a bare-bones cth/QsubCmds.jl: Run Julia external (shell) commands on a HPC cluster.

11 R-specific