Snakemake

2025-11-04 — 2025-11-04

Wherein a build tool is described as a DAG‑driven workflow manager for reproducible analyses, with cluster profiles used to submit jobs to Slurm and other schedulers, and container support is noted.

computers are awful

concurrency hell

faster pussycat

premature optimization

python

I missed it before, but Snakemake deserves a look. At its core, it’s a build tool inspired by Make, but it’s built with data science and bioinformatics workflows in mind.

Its main purpose is to create reproducible, scalable data analyses. We express a pipeline as a DAG (Directed Acyclic Graph) of rules, and Snakemake detects which jobs can run in parallel based on their dependencies. This avoids re-running tasks whose inputs haven’t changed — super handy if the task is a 100-hour weather simulation or a neural-network training run.

After spending a few days with it, I reckon that snakemake is an amazing build tool on the HPC, running a large experiment on the cluster etc It’s not so good as an exploratory data-science tool, since it doesn’t really seem to make it easy to adaptively design parameters. I’m currently using Parsl for that, which requires more work to get caching and dependency management, which is probably ultimately worth the friction since that also gets me interactive and adaptive experimentation

1 The DSL

Snakemake uses its own DSL (a.k.a. Domain-Specific Language) for defining the tasks in a Snakefile.

I hate this. The docs spin this as a “human-readable, Python-based language,” but in practice it’s a custom file format. This odd choice breaks modern IDE support, linting and parsing, and is generally irritating.

So why stomach it? Because its other features are that useful — especially its explicit support for nightmare campus cluster horrors. It integrates with batch schedulers like Slurm, PBS, SGE and LSF. Combined with support for remote files and containerization (like Apptainer/Singularity and Conda), it’s valuable for the kind of work I do.

2 How Cluster Execution Works

The typical pattern for using Snakemake on an HPC cluster is to combine rule definitions with a “profile.”

Define Rules in the Snakefile: Each rule specifies input files, output files, and resource hints.

rule foo:
    input: …
    output: …
    threads: 4
    resources:
        mem_mb=16000,
        runtime_min=60
    shell:
        "my-command --input {input} --output {output}"

Create a profile: This is a directory (for example, ~/.config/snakemake/profiles/mycluster/) that holds configuration. It defines the cluster submission command (for example, sbatch or qsub), the default resources, and concurrency limits. The profile maps our rule’s resource names (for example, mem_mb) to the scheduler’s specific flags (for example, --mem).
Run Snakemake:
```
snakemake --profile mycluster -j 100
```
Snakemake reads the profile, analyzes the DAG, and submits each job (or groups of jobs) to the scheduler, respecting dependencies and resource requests. It also has a useful job-grouping feature, which bundles many small tasks into a single scheduler job to reduce load.

3 Choosing Cluster Mode

When we set up the profile, we have to decide how Snakemake will talk to the scheduler. There are two main ways.

3.1 Classic `cluster-generic` Mode

This is the classic method (my recommendation). We use the snakemake-executor-plugin-cluster-generic plugin and provide a shell command template for submission.

profiles/cluster_generic/config.yaml

executor: cluster-generic
jobs: 200

# template for the submission command
cluster-generic-submit-cmd: >-
  sbatch --account=$SLURM_ACCOUNT --time={resources.runtime} --mem={resources.mem_mb} \
         --cpus-per-task={resources.cpus} --gres={resources.gres} --job-name={rule}-{jobid}

default-resources:
  - runtime=120
  - mem_mb=64000
  - cpus=4
  - gres=gpu:1

# ... other settings
printshellcmds: true
latency-wait: 60

Pros: Highly flexible: it works with any scheduler and, critically, lets us use environment variables (like $SLURM_ACCOUNT) inside the template.
Cons: We have to write and maintain the submission command boilerplate.

3.2 Executor Plugins (e.g., `slurm`)

In version 8, Snakemake introduced executor plugins, like snakemake-executor-plugin-slurm. They accept scheduler-specific resource names directly.

profiles/slurm/config.yaml

executor: slurm
jobs: 100

default-resources:
  - runtime=120 # ⚠️ Hard-coded!
  - mem_mb=64000 # ⚠️ Hard-coded!
  - cpus_per_task=4 # ⚠️ Hard-coded!
  - slurm_partition=standard# ⚠️ Hard-coded!
  - slurm_account=my_project_account # ⚠️ Hard-coded!

# ... other settings
printshellcmds: true
latency-wait: 60

This is notionally the more modern and luxurious way, but in my experience, it kinda sucks. The comforts are meagre, and the costs are high.

See all those hard‑coded resource defaults? We cannot use environment variables here. And we cannot override just one of them. Want to change the memory allocation for a run? We have to reproduce the entire config file, including lines that did not change. Want to use a different account? Reproduce a whole config file. Want to use automatic config which works perfectly fine for other parts of your app? Nope, not for your executor plugin config.

This forces us to hard‑code site-specific things like my_project_account, which is terrible for portability and leads to a proliferation of config files lying around, defeating the purpose of Snakemake.

Worse, this singular design choice is not at all clearly documented, so it is a massive time suck for fools, like me, who assume that more modern = better.

3.3 Asides

Snakemake can also target public cloud backends (Kubernetes, AWS Batch, Google Life Sciences API) and stage data from S3/GCS. In theory, this lets us run the same Snakefile on-prem or in the cloud. In practice, costs and object-store quirks mean that most people I know still run their workflows on classic HPC clusters.

1 The DSL

2 How Cluster Execution Works

3 Choosing Cluster Mode

3.1 Classic cluster-generic Mode

3.2 Executor Plugins (e.g., slurm)

3.3 Asides

3.1 Classic `cluster-generic` Mode

3.2 Executor Plugins (e.g., `slurm`)