Snakemake
2025-11-04 — 2025-11-04
Wherein a build tool is described as a DAG‑driven workflow manager for reproducible analyses, with cluster profiles used to submit jobs to Slurm and other schedulers, and container support is noted.
I missed it before, but Snakemake deserves a look. At its core, it’s a build tool inspired by Make, but it’s built with data science and bioinformatics workflows in mind.
Its main purpose is to create reproducible, scalable data analyses. We express a pipeline as a DAG (Directed Acyclic Graph) of rules, and Snakemake detects which jobs can run in parallel based on their dependencies. This avoids re-running tasks whose inputs haven’t changed — super handy if the task is a 100-hour weather simulation or a neural-network training run.
1 The DSL
Snakemake uses its own DSL (a.k.a. Domain-Specific Language) for defining the tasks in a Snakefile.
I hate this. The docs spin this as a “human-readable, Python-based language,” but in practice it’s a custom file format. This odd choice breaks modern IDE support, linting and parsing, and is generally irritating.
So why stomach it? Because its other features are that useful — especially its explicit support for nightmare campus cluster horrors. It integrates with batch schedulers like Slurm, PBS, SGE and LSF. Combined with support for remote files and containerization (like Apptainer/Singularity and Conda), it’s valuable for the kind of work I do.
2 How Cluster Execution Works
The typical pattern for using Snakemake on an HPC cluster is to combine rule definitions with a “profile.”
Define Rules in the
Snakefile: Each rule specifiesinputfiles,outputfiles, and resource hints.Create a
profile: This is a directory (for example,~/.config/snakemake/profiles/mycluster/) that holds configuration. It defines the cluster submission command (for example,sbatchorqsub), the default resources, and concurrency limits. The profile maps our rule’s resource names (for example,mem_mb) to the scheduler’s specific flags (for example,--mem).Run Snakemake:
Snakemake reads the profile, analyzes the DAG, and submits each job (or groups of jobs) to the scheduler, respecting dependencies and resource requests. It also has a useful job-grouping feature, which bundles many small tasks into a single scheduler job to reduce load.
3 Choosing Cluster Mode
When we set up the profile, we have to decide how Snakemake will talk to the scheduler. There are two main ways.
3.1 Classic cluster-generic Mode
This is the classic method (my recommendation). We use the snakemake-executor-plugin-cluster-generic plugin and provide a shell command template for submission.
profiles/cluster_generic/config.yaml
executor: cluster-generic
jobs: 200
# template for the submission command
cluster-generic-submit-cmd: >-
sbatch --account=$SLURM_ACCOUNT --time={resources.runtime} --mem={resources.mem_mb} \
--cpus-per-task={resources.cpus} --gres={resources.gres} --job-name={rule}-{jobid}
default-resources:
- runtime=120
- mem_mb=64000
- cpus=4
- gres=gpu:1
# ... other settings
printshellcmds: true
latency-wait: 60- Pros: Highly flexible: it works with any scheduler and, critically, lets us use environment variables (like
$SLURM_ACCOUNT) inside the template. - Cons: We have to write and maintain the submission command boilerplate.
3.2 Executor Plugins (e.g., slurm)
In version 8, Snakemake introduced executor plugins, like snakemake-executor-plugin-slurm. They accept scheduler-specific resource names directly.
profiles/slurm/config.yaml
This is notionally the more modern and luxurious way, but in my experience, it kinda sucks. The comforts are meagre, and the costs are high.
See all those hard‑coded resource defaults? We cannot use environment variables here. And we cannot override just one of them. Want to change the memory allocation for a run? We have to reproduce the entire config file, including lines that did not change. Want to use a different account? Reproduce a whole config file. Want to use automatic config which works perfectly fine for other parts of your app? Nope, not for your executor plugin config.
This forces us to hard‑code site-specific things like my_project_account, which is terrible for portability and leads to a proliferation of config files lying around, defeating the purpose of Snakemake.
Worse, this singular design choice is not at all clearly documented, so it is a massive time suck for fools, like me, who assume that more modern = better.
3.3 Asides
Snakemake can also target public cloud backends (Kubernetes, AWS Batch, Google Life Sciences API) and stage data from S3/GCS. In theory, this lets us run the same Snakefile on-prem or in the cloud. In practice, costs and object-store quirks mean that most people I know still run their workflows on classic HPC clusters.
