Scheduling jobs on HPC clusters for modern ML nerds

In Soviet Russia, job puts YOU in queue

March 9, 2018 — April 28, 2023

computers are awful
computers are awful together
concurrency hell
distributed
premature optimization
Figure 1

Doing stuff on classic HPC clusters.

Slurm, torque, PlatformLSF all implement a similar API providing concurrency guarantees specified by the famous Byzantine committee-designed greasy totem pole priority system. Empirical observation: the IT department for any given cluster often seems reluctant to document which one they are using. Typically a campus cluster will come with some gruff example commands that worked for that guy that time, but not much more. Usually that guy that time was running a molecular simulation package written in some language I have never heard of. Presumably this is often a combination of the understandable desire not to write documentation, and a kind of availability-through-obscurity demand-management. They are typically less eager to allocate GPUs, slightly confused by all this modern neural network stuff, and downright flabbergasted by containers.

To investigate: Apparently there is a modern programmatic API to some of these schedulers called DRMAA, (Distributed Resource Management Application API), which allows fairly generic job definition and which works on my local cluster, supposedly, although they for sure have not documented how.

Anyway, here are some methods for getting stuff done that work well for my use-cases, which tend towards statistical inference and neural nets etc.

1 submitit

My current go-to option for python. I use this so much that I made a submitit notebook. Go there.

What follows are some other options I do not frequently use.

2 The here document trick

tbeason suggests this hack:

#!/usr/bin/env sh
#SBATCH -N 10
#SBATCH -n 8
#SBATCH -o %x-%j.out

module load julia/1.6.1 ## I have to load julia before calling julia

julia << EOF

using SomePackage
do_julia_stuff

EOF

3 Request a multicore job from the scheduler and manage that like a mini cluster in python

Dask.distributed works well on a multi-machine job on the cluster apparently, and will even spawn the Slurm job.

Easily distributing a parallel IPython Notebook on a cluster:

Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”

ipython-cluster-helper automates that.

“Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]

ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.

Works on Platform LSF, Sun Grid Engine, Torque, SLURM. Strictly python.

4 hydra

Handy if what I am running is many parallel experiments, and includes a parallel job submission system. See hydra ML.

5 test-tube

seems to be discontinued.

An alternative option for many use cases is test-tube, a “Python library to easily log experiments and parallelize hyperparameter search for neural networks”. AFAICT there is nothing neural-network specific in this and it will happily schedule a whole bunch of useful types of task, generating the necessary scripts and keeping track of what is going on. This function is not obvious from the front page description of this software library, but see test-tube/SlurmCluster.md. (Thanks for pointing me to this, Chris Jackett.)

6 Misc python

See also DRMAA Python, which is a Python wrapper around the DRMAA API.

Other ones I looked at: Andrea Zonca wrote a script that allows spawning jobs on a cluster from a Jupyter notebook. After several iterations and improvements it is now called batchspawner.

snakemake supports make-like build workflows for clusters. Seem general and powerful but complicated.

7 Hadoop on the cluster

hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs.

8 Misc julia

In Julia there is a rather fancy system JuliaParallel/ClusterManagers.jl which supports most major HPC job managers automatically.

There is also a bare-bones cth/QsubCmds.jl: Run Julia external (shell) commands on a HPC cluster.

9 Luxury parallelism with pipelines and coordination

More modern tools facilitate very sophisticated workflows with execution graphs and pipelines and such. One that was briefly pitched to us that I did not ultimately use: nextflow

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.

Nextflow supports Docker and Singularity containers technology.

This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.

It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.

I think that we are actually going to be given D2iQ Kaptain: End-to-End Machine Learning Platform instead? TBC.

10 R-specific

Figure 2