Scheduling ML jobs on HPC clusters

In Soviet Russia, job puts YOU in queue

2018-03-09 — 2025-09-30

Wherein classic HPC schedulers are surveyed and a practical habit is recommended: submitit is presented as a way to run Python functions via Slurm, and Snakemake’s job‑grouping is noted.

computers are awful
computers are awful together
concurrency hell
premature optimization
Figure 1

Working on classic HPC clusters.

Slurm, torque, PlatformLSF Typically, a campus cluster ships with a few gruff example commands that worked for someone once and not much else. Often the examples run a molecular‑simulation package written in a language I’ve never heard of, or one I’d rather forget. This mixes the understandable desire to avoid writing documentation for every bizarre, idiosyncratic use case with a kind of “availability-through-obscurity” demand management. They’re typically less eager to allocate GPUs, a bit baffled by modern neural-network workflows, and downright flabbergasted by containers. Clusters tend to share a similar API and provide concurrency guarantees via the infamous Byzantine committee–designed greasy totem pole priority system. In our experience, a cluster’s IT team is often reluctant to document which scheduler it uses. I’ve lived through the transition from classic compute to GPU‑everything. These days, sysadmins know all about GPUs.

Anyway, here are some methods that work well for my use cases, which lean toward statistical inference and neural nets.

1 submitit

A simple tool that tells Slurm to run a Python function as a job, cleanly. It’s the infrastructure for a lot of other things. I use it so much that I made a submitit notebook. Go take a look. It’s my current go-to option for Python.

2 Snakemake

I missed it before, but Snakemake deserves mention: it’s a workflow engine (inspired by Make) with built‑in support for submitting jobs to HPC clusters, and it handles parallelism and dependencies cleanly.

  • We express a pipeline as a DAG of rules (using input, output, and shell/run for logic and resource requirements).
  • Snakemake detects jobs that can run in parallel based on dependencies, so we don’t have to write orchestration code.
  • It has built-in support for submitting jobs to batch schedulers (Slurm, PBS, SGE, LSF) via its cluster execution mode.
  • It supports job grouping, which reduces scheduler load for many small tasks by bundling them into a single job.
  • We can define default resources and override them per rule (e.g. memory, threads), and forward scheduler-specific flags via templates or cluster-config.yaml.
  • Because Snakemake is workflow-aware, it avoids re-running tasks whose inputs haven’t changed, supports modular workflows, and integrates more easily with containerization and rule reuse.

A typical pattern is:

  1. Create a profile (e.g. in ~/.config/snakemake/profiles/mycluster/) that encodes the cluster submission command template (e.g. sbatch … or qsub …), default resources, concurrency limits, and translations from rule-level resource names to scheduler flags. Many communities publish example profiles for Slurm, PBS, etc.

  2. Write a Snakefile so each rule has resource hints: e.g.

    rule foo:
        input: …
        output: …
        threads: 4
        resources:
            mem_mb=16000,
            runtime_min=60
        shell:
    
  3. When we run the workflow, we might do the following:

    snakemake --profile mycluster -j 100

    This causes Snakemake to submit each rule as a separate job — or to bundle several rules together by grouping — to the scheduler, while respecting dependencies. [TODO clarify]

Snakemake uses its own DSL. I don’t usually love the DSL, but it might be worth it here.

As an aside, Snakemake can also target public cloud backends — including Kubernetes clusters, AWS Batch, and Google’s Life Sciences API — by staging data in cloud object stores (S3/GCS) and containerizing rules. That lets us run the same Snakefile on-premises HPC clusters or managed cloud services with only profile or config changes, though costs and object-store quirks mean most people still run workflows on classic clusters.

What follows are some other options I do not intend to use in the near future.

3 Parsl

I’ve ended up on a cluster that doesn’t use Slurm — it uses PBS Pro. Parsl (“Parallel Scripting in Python”) seems like a good fit for this cluster. Parsl isn’t as seamless as submitit, but it has a long development history and major institutional backers, so we expect it to be more reliable.

TODO: compare and contrast with Snakemake.

4 MLflow

MLflow is primarily a tracker, but it can also act as an executor. We can use MLflow as a “submitter + tracker”, but there are caveats.

4.1 Slurm + mlflow-slurm plugin

There’s an external plugin, mlflow-slurm, maintained by NCSA. It implements an MLflow Projects backend that submits jobs via sbatch. We call it like this:

mlflow run --backend slurm \
          --backend-config slurm_config.json \
          <project>

The slurm_config.json lets us control the partition, account, modules to load, memory, GPUs, nodes, ntasks, exclusivity and more.

One useful feature is “sequential workers”: if training can’t finish within the walltime, we can checkpoint and schedule dependent jobs to continue ([GitHub][2]).

However:

  • The plugin isn’t first-party — it’s community-maintained and can lag behind core MLflow versions.
  • Because the plugin generates its own sbatch scripts, we lose flexibility and often have to duplicate its logic for modules, environment setup, and pre/post hooks.
  • For multi-step or multi-task workflows — especially those needing orchestration beyond a single project run — the plugin’s capabilities are fairly limited. We often embed our own scheduling logic or wrap calls instead of expecting MLflow to manage complex DAGs.
  • The plugin works well when an ML job is encapsulated in a single “project run”, i.e., when we want a self-contained entry point.

4.2 PBS / non-Slurm environments

In clusters that use PBS Pro, Torque, etc., there isn’t a widely adopted MLflow Projects backend equivalent to mlflow-slurm. In practice, we commonly:

  1. Write a PBS job script (or template) that activates the environment, sets MLFLOW_TRACKING_URI, and then calls either mlflow run or a Python entry point that logs to MLflow.

  2. Use external orchestrators or scheduling helpers (e.g. Dask-Jobqueue’s PBSCluster or other queue submission clients) to handle job submission while embedding MLflow tracking inside the job. In this model, MLflow acts as a tracking and packaging layer, not the scheduler.

5 Nextflow

Figure 2

Nextflow is a workflow manager and a DSL that expresses pipelines as dataflow graphs of “processes” connected by “channels”, allowing parallelism to emerge naturally from data dependencies rather than from hand‑crafted orchestration logic. Pipelines can be versioned and run directly from Git repositories, parameterized at invocation time, and made reproducible through first‑class support for containers and environments (Docker/Podman, Apptainer/Singularity, Conda).

A neat operational feature is incremental “resume” (call caching). It avoids re‑running tasks whose inputs, parameters, or container/image haven’t changed, speeding up iterations on large analyses.

The DSL stuff looks cool, but I instinctively resist it — it suggests projects bigger than I’d like to tackle as a lone researcher. That might just mean my projects aren’t a good fit for it, rather than the tool itself being “bad”.

On classic HPC clusters, Nextflow’s main selling point is its native executors for common schedulers. It submits each process as a batch job with the requested CPUs, memory, and walltime, so we can keep a single pipeline and swap execution backends via configuration profiles using Executors. This includes widely used managers such as Slurm, PBS/Torque/PBS Pro, SGE, and LSF, along with container runtimes suited to multi‑tenant HPC (notably Apptainer/Singularity). This helps satisfy site policy while preserving consistent software stacks across nodes. Operational controls like queue size limits, submission throttling, retries, and per‑process resource directives help align pipeline behaviour with scheduler expectations. The same pipeline definitions can, in principle, target cloud backends (e.g., AWS Batch, Azure) without code changes to the workflow logic.

In practice, teams using HPC need to develop site‑aware configuration profiles, containerized tools vetted for the cluster, and data‑locality strategies that minimize cross‑filesystem I/O during high fan‑out stages.

See

6 Naked DRMAA

Apparently there’s a modern programmatic API for some classic schedulers called DRMAA (Distributed Resource Management Application API). It supports fairly generic job definitions and, in principle, seems to work on my local cluster, but there’s no documentation on how to run it there. So I suspect that’s a battle I don’t want to pick.

7 The here-document trick

tbeason suggests this hack:

#!/usr/bin/env sh
#SBATCH -N 10
#SBATCH -n 8
#SBATCH -o %x-%j.out

module load julia/1.6.1 ## I have to load julia before calling julia

julia << EOF

using SomePackage
do_julia_stuff

EOF

Did we notice what happened? We invoked our preferred programming language from the job-submission shell script, so the submission script and the code stayed together.

8 Request a multicore job from the scheduler and manage it like a mini-cluster in Python

  • Dask.distributed works well for multi-machine cluster jobs and can even spawn the Slurm job for us.

  • Easily distribute a parallel IPython notebook on a cluster.

    Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”

  • ipython-cluster-helper automates the whole process for us.

    “Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]

    ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.

    It works with Platform LSF, Sun Grid Engine, Torque, and SLURM. [TODO clarify] Python-only.

9 Misc Python

See also DRMAA Python, a Python wrapper for the DRMAA API.

Other projects I looked at: Andrea Zonca Zonca wrote a script that lets us submit jobs to a cluster from a Jupyter notebook. After several iterations and improvements, it’s now called batchspawner.

10 Hadoop on the cluster

The hanythingondemand project provides scripts to set up an ad-hoc Hadoop cluster by submitting PBS jobs.

11 Misc Julia

In Julia, there’s a rather fancy system JuliaParallel/ClusterManagers.jl that supports most major HPC job managers.

There’s also a bare-bones cth/QsubCmds.jl: Run Julia external (shell) commands on an HPC cluster.

12 R-specific