Doing stuff on classic HPC clusters.
all implement a similar API providing concurrency guarantees specified by the famous Byzantine committee-designed greasy totem pole priority system.
Empirical observation: the IT department for any given cluster often seems reluctant to document which one they are using.
Typically a campus cluster will come with some gruff example commands that worked for that guy that time, but not much more.
Usually that guy that time was running a molecular simulation package written in some language I have never heard of.
Presumably this is often a combination of the understandable desire not to write documentation, and a kind of availability-through-obscurity demand-management.
They are typically less eager to allocate GPUs, slightly confused by all this modern neural network stuff, and downright flabbergasted by containers.
To investigate: Apparently there is a modern programmatic API to some of these schedulers called DRMAA, (Distributed Resource Management Application API), which allows fairly generic job definition and which works on my local cluster, supposedly, although they for sure have nto documented how.
Anyway, here are some methods for getting stuff done that work well for my use-cases, which tend towards statistical inference and neural nets etc.
My current go-to option for python. submitit is a recent entrant which programmatically submits jobs from inside python. It looks like this in basic form:
import submitit def add(a, b): return a + b # executor is the submission interface (logs are dumped in the folder) executor = submitit.AutoExecutor(folder="log_test") # set timeout in min, and partition for running the job executor.update_parameters( timeout_min=1, slurm_partition="dev", tasks_per_node=4 # number of cores ) job = executor.submit(add, 5, 7) # will compute add(5, 7) print(job.job_id) # ID of your job output = job.result() # waits for completion and returns output assert output == 12 # 5 + 7 = 12... your addition was computed in the cluster
The docs could be better. Here are some example pages that show how it goes though:
Here is a more advanced pattern I use when running a bunch of experiments via submitit:
import bz2 import cloudpickle job_name = "my_cool_job" def add_one(a): return a + 1 # executor is the submission interface (logs are dumped in the folder) executor = submitit.AutoExecutor(folder="log_test") # set timeout in min, and partition for running the job executor.update_parameters( timeout_min=1, slurm_partition="dev", #misc slurm args tasks_per_node=4 # number of cores ) #submit 10 jobs: jobs = [executor.submit(add, num) for num in range(10)] # but actually maybe we want to come back to those jobs later, so let’s save them to disk with bz2.open(job_name + ".job.pkl.bz2", "wb") as f: cloudpickle.dump(jobs, f) # We can quit this python session and do something else now # Resume after quitting with bz2.open(job_name + ".job.pkl.bz2", "rb") as f: jobs = cloudpickle.load(f) # wait for all jobs to finish [job.wait() for job in jobs] res_list =  # # Alternatively append these results to a previous run # with bz2.open(job_name + ".pkl.bz2", "rb") as f: # res_list.extend(pickle.load(f)) # Examine job outputs for failures etc fail_ids = [job.job_id for job in jobs if job.state not in ('DONE', 'COMPLETED')] res_list.extend([job.result() for job in jobs if job.job_id not in fail_ids ]) failures = [job for job in jobs if job.job_id in fail_ids] if failures: print("failures") print("===") for job in failures: print(job.state, job.stderr()) # Save results to disk with bz2.open(job_name + ".result.pkl.bz2", "wb") as f: cloudpickle.dump(res_list, f)
Cool feature: the spawning script only need to survive as long as it takes to put jobs on the queue, and then it can die. Later on we can reload those jobs from disk as if it all happened in the same session.
The here document trick
#!/usr/bin/env sh #SBATCH -N 10 #SBATCH -n 8 #SBATCH -o %x-%j.out module load julia/1.6.1 ## I have to load julia before calling julia julia << EOF using SomePackage do_julia_stuff EOF
Request a multicore job from the scheduler and manage that like a mini cluster in python
Dask.distributed works well on a multi-machien job on the clust apparently, and will even spawn the Slurm job.
Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”
ipython-cluster-helper automates that.
“Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]
ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.
Works on Platform LSF, Sun Grid Engine, Torque, SLURM. Strictly python.
Handy if what I am running is many parallel experiments, and includes a parallel job submission system. See hydra ML.
⚠️ seems to be discontinued.
An alternative option for many use cases is test-tube, a “Python library to easily log experiments and parallelize hyperparameter search for neural networks”. AFAICT there is nothing neural-network specific in this and it will happily schedule a whole bunch of useful types of task, generating the necessary scripts and keeping track of what is going on. This function is not obvious from the front page description of this software library, but see test-tube/SlurmCluster.md. (Thanks for pointing me to this, Chris Jackett.)
See also DRMAA Python, which is a Python wrapper around the DRMAA API.
make-like build workflows for clusters.
Seem general and powerful but complicated.
Hadoop on the cluster
hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs.
There is also a bare-bones cth/QsubCmds.jl: Run Julia external (shell) commands on a HPC cluster.
Luxury parallelism with pipelines and coordination
More modern tools facilitate very sophisticated workflows with execution graphs and pipelines and such. One that was briefly pitched to us that I did not ultimately use: nextflow
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.
Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
Nextflow supports Docker and Singularity containers technology.
This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.
It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.
I think that we are actually going to be given D2iQ Kaptain: End-to-End Machine Learning Platform instead? TBC.