submitit

A less horrible way to parallelize on HPC clusters

2021-10-09 — 2025-01-29

computers are awful

computers are awful together

concurrency hell

distributed

premature optimization

python

Suspiciously similar content

My current go-to option for Python HPC jobs on clusters where SLURM is the scheduler.

If anyone has ever told you to run some code by creating a bash script full of esoteric cruft like #SBATCH -N 10, then you have experienced how user-hostile classic computer clusters are. The wading-through-punctuation-swamp vibe of submitting jobs by vanilla SLURM or TORQUE is slow and unwieldy, and leads to confusing quagmires of shell scripts that are mostly about setting resource limits by trial and error.

submitit wraps all that nonsense up tidily to keep the cluster compute workflow uncluttered. This keeps the process of executing code on the cluster much more like classic parallel execution on Python, which is to say, annoying-but-tolerable.

1 Examples

submitit looks like this in basic form:

import submitit

def add(a, b):
    return a + b

# executor is the submission interface
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min and partition for running the job
executor.update_parameters(
    timeout_min=1,
    slurm_partition="dev",
)
job = executor.submit(add, 5, 7)  # will compute add(5, 7)
print(job.job_id)  # ID of your job

output = job.result()  # waits for completion and returns output
assert output == 12  # 5 + 7 = 12... your addition was computed in the cluster

The docs could be better. Mostly, I work it out by reading the examples:

Here is a more advanced pattern I use when running a bunch of experiments via submitit:

import bz2
import cloudpickle

job_name = "my_cool_job"

def expensive_calc(a):
    return a + 1

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(
    timeout_min=1,
    slurm_partition="dev", #misc slurm args
    tasks_per_node=4,  # number of cores
    mem=8,  # memory in GB I think
)

# submit 10 jobs:
jobs = [executor.submit(expensive_calc, num) for num in range(10)]

# but actually maybe we want to come back to those jobs later, so let’s save them to disk
with bz2.open(job_name + ".job.pkl.bz2", "wb") as f:
    cloudpickle.dump(jobs, f)

# We can quit this python session and do something else now
# Resume after quitting
with bz2.open(job_name + ".job.pkl.bz2", "rb") as f:
    jobs = cloudpickle.load(f)

# wait for all jobs to finish
[job.wait() for job in jobs]

res_list = []
# Alternatively append these results to a previous run
# with bz2.open(job_name + ".pkl.bz2", "rb") as f:
#     res_list.extend(pickle.load(f))

# Examine job outputs for failures etc
fail_ids = [job.job_id for job in jobs if job.state not in ('DONE', 'COMPLETED')]
res_list.extend([job.result() for job in jobs if job.job_id not in fail_ids])
failures = [job for job in jobs if job.job_id in fail_ids]

if failures:
    print("failures")
    print("===")
    for job in failures:
        print(job.state, job.stderr())

# Save results to disk
with bz2.open(job_name + ".result.pkl.bz2", "wb") as f:
    cloudpickle.dump(res_list, f)

Cool feature: the spawning script only needs to survive as long as it takes to put jobs on the queue, and then it can die. Later, we can reload those jobs from disk as if it all happened in the same session.

2 Gotchas

2.1 `tasks_per_node`

If I set tasks_per_node to 4, and I submit 10 jobs, I get 40 jobs running on the cluster. This is unexpected, but see the examples. It’s a feature to spawn many identical copies of a job which will look at the same function arguments but use their environment to work out how to treat their friends.

I am not sure what the use case for this is. Forking multiple tasks with the same argument is maybe slightly quicker to transmit but requires more free CPUs to be allocated and breaks the tidiness of the calling convention. This would not be a net win for me.

I typically want cpus_per_task.

2.2 DebugExecutor considered harmful

When developing a function using submitit, it might seem convenient to use DebugExecutor sometimes to work out why things are crashing. DebugExecutor has radically different semantics masquerading as the same. Code runs sequentially in-process, so DebugExecutor is blocking, and you can expect asynchronous code to behave differently when run in DebugExecutor than in AutoExecutor.

If we don’t want to debug concurrency problems, but just the function itself, it might be easier to execute the function in the normal way. executor=DebugExecutor();executor.submit(func, *args, **kwargs) does not get much more than running func(*args, **kwargs) directly, except a drop-in replacement for submitit.AutoExecutor.

DebugExecutor does get one small extra gift that I did not ask for and do not relish: It always invokes pdb or ipdb, the interactive Python debuggers, upon errors. As such, DebugExecutor cannot be run from VS Code because VS Code does not support those interactive debuggers. I dislike this feature; even though I love debuggers, I find it best to invoke them myself when I want them and otherwise not, thank you very much.

2.3 Weird argument handling

submitit believes that the slurm_mem argument requisitions memory in GB, but slurm interprets it as MB per default (see --mem). There is an alternate argument, mem_gb, which allocates GB of memory. Use that for clarity.

2.4 Environment variables

Setting environment variables for a script is sometimes necessary (e.g. CUDA_VISIBLE_DEVICES).

There seem to be helpers for this, albeit undocumented:

utils.environment_variables and helpers.clean_env, context managers whose usage we can glean from the tests:

def test_clean_env() -> None:
    base = _get_env()
    with utils.environment_variables(SLURM_BLUBLU=12, SUBMITIT_BLUBLU=12):
        assert len(_get_env()) == len(base) + 2
        with helpers.clean_env():
            assert not _get_env()
        assert len(_get_env()) == len(base) + 2
    assert _get_env() == base

    with utils.environment_variables(MASTER_PORT=42, BLABLA=314):
        with helpers.clean_env(extra_names=("BLABLA",)):
            assert "MASTER_PORT" not in os.environ
            assert "BLABLA" not in os.environ

slurm_setup:

executor.update_parameters(
    slurm_job_name='submitit_test',
    nodes=2,
    tasks_per_node=2,
    slurm_gres='gpu:volta:2',
    cpus_per_task=20,
    slurm_constraint='xeon-g6',
    slurm_partition="gaia",
    slurm_setup=[
        """export MASTER_ADDR=$(hostname -s)""",
        """export MASTER_PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')"""]
)

3 hydra

Pro tip: submitit integrates with hydra.