Scheduling jobs on HPC clusters for modern ML nerds

In Soviet Russia, job puts YOU in queue



Doing stuff on classic HPC clusters.

Slurm, torque, PlatformLSF all implement a similar API providing concurrency guarantees specified by the famous Byzantine committee-designed greasy totem pole priority system. Empirical observation: the IT department for any given cluster often seems reluctant to document which one they are using. Typically a campus cluster will come with some gruff example commands that worked for that guy that time, but not much more. Usually that guy that time was running a molecular simulation package written in some language I have never heard of. Presumably this is often a combination of the understandable desire not to write documentation, and a kind of availability-through-obscurity demand-management. They are typically less eager to allocate GPUs, slightly confused by all this modern neural network stuff, and downright flabbergasted by containers.

To investigate: Apparently there is a modern programmatic API to some of these schedulers called DRMAA, (Distributed Resource Management Application API), which allows fairly generic job definition and which works on my local cluster, supposedly, although they for sure have nto documented how.

Anyway, here are some methods for getting stuff done that work well for my use-cases, which tend towards statistical inference and neural nets etc pautod ## submitit

My current fave for python.

submitit is a recent entrant which programmatically submits jobs from inside python. It looks like this:

import submitit

def add(a, b):
    return a + b

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(
    timeout_min=1, slurm_partition="dev",
    tasks_per_node=4  # number of cores
)
job = executor.submit(add, 5, 7)  # will compute add(5, 7)
print(job.job_id)  # ID of your job

output = job.result()  # waits for completion and returns output
assert output == 12  # 5 + 7 = 12...  your addition was computed in the cluster

The docs could be better. Here are some example pages that show how it goes though:

Here is a pattern I use when running a bunch of experiments via submitit:

import bz2
import cloudpickle

job_name = "my_cool_job"

with bz2.open(job_name + ".job.pkl.bz2", "wb") as f:
    cloudpickle.dump(jobs, f)

exp_list = []

# # Optionally append these results to a previous run
# with bz2.open(job_name + ".pkl.bz2", "rb") as f:
#     exp_list.extend(pickle.load(f))

with bz2.open(job_name + ".job.pkl.bz2", "rb") as f:
    jobs = cloudpickle.load(f)

[job.wait() for job in jobs]

fail_ids = [job.job_id for job in jobs if job.state not in  ('DONE', 'COMPLETED')]
exp_list.extend([job.result() for job in jobs if job.job_id not in fail_ids ])
failures = [job for job in jobs if job.job_id in fail_ids]

if failures:
    print("failures")
    print("===")
    for job in failures:
        print(job.state, job.stderr())

with bz2.open(job_name + ".pkl.bz2", "wb") as f:
    cloudpickle.dump(exp_list, f)

Cool feature: the spawning script only need to survive as long as it takes to put jobs on the queue, and then it can die. Later on we can reload those jobs from disk.

Request a multicore job from the scheduler and manage that like a mini cluster in python

Dask.distributed works well on a multi-machien job on the clust apparently, and will even spawn the Slurm job.

Easily distributing a parallel IPython Notebook on a cluster:

Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”

Or: ipython-cluster-helper automates that.

“Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]

ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.

works on Platform LSF, Sun Grid Engine, Torque, SLURM. Strictly python.

hydra

Handy if what I am running is many parallel experiments, and includes a parallel job submission system. See hydra ML.

test-tube

⚠️ seems to be discontinued.

An alternative option for many use cases is test-tube, a “Python library to easily log experiments and parallelize hyperparameter search for neural networks”. AFAICT there is nothing neural-network specific in this and it will happily schedule a whole bunch of useful types of task, generating the necessary scripts and keeping track of what is going on. This function is not obvious from the front page description of this software library, but see test-tube/SlurmCluster.md. (Thanks for pointing me to this, Chris Jackett.)

Misc python

See also DRMAA Python, which is a Python wrapper around the DRMAA API.

Other ones I looked at: Andrea Zonca wrote a script that allows spawning jobs on a cluster from a Jupyter notebook. After several iterations and improvements it is now called batchspawner.

snakemake supports make-like build workflows for clusters. Seem general and powerful but complicated.

Hadoop on the cluster

hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs.

Misc julia

In Julia there is a rather fancy system JuliaParallel/ClusterManagers.jl which supports most major HPC job managers automatically.

There is also a bare-bones cth/QsubCmds.jl: Run Julia external (shell) commands on a HPC cluster.

Luxury parallelism with pipelines and coordination

More modern tools facilitate very sophisticated workflows with execution graphs and pipelines and such. One that was briefly pitched to us that I did not ultimately use: nextflow

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.

Nextflow supports Docker and Singularity containers technology.

This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.

It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.

I think that we are actually going to be given D2iQ Kaptain: End-to-End Machine Learning Platform instead? TBC.


No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.