Your computer uses some pre-cloud job manager such as (primordial) Platform LSF/ Torque or (recentish) slurm. You don’t care about the details of this, since you have a vague suspicion that the entire world of old-school HPC is asymptotically approaching a 0% market share. All you know is that your hip kubernetes VM-based solution leveraging Apache Spark and some other stuff you got off github, that will not run here in the manner Google intended. But your insouciance doesn’t solve your immediate problem: you need to work in this hostile environment now, because your department prefers
- to shovel weeks grad-student labour into the sunk-cost pit at the bottom of which lies the giant computing cluster they bought 5 years ago, to
- the horrifying prospect of giving you billing privileges for the cloud.
So how can you get usable labour from the these cathedrals of computing, while learning the absolute minimum that you possibly can about anything to do with their magnificently baroque, obsolescent architecture?
What you are supposed to use
all implement a similar api about asking things
to be executed according to a byzantine committee-designed priority greasy totem pole.
Typically your campus cluster will come with documentation mentioning
some example commands that worked for that guy that time,
but don’t tell you what version of what package your computer is running.
They are all fairly similar though, which is to say, functional by dated.
To investigate: Apparently there is a more modern programmatic API called DRMAA. Not sure if I campus runs it, or if they would make it easy to find out if they did.
Doing what you would like to do with what you are supposed to use
The most well-designed option to do this seems to be
snakemake which support make-like build workflows, but on
horrible campus clusters.
Other contenders. hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs.
A more ad hoc, probably-slower-but-more-robust approach, which perhaps avoids the damn thing: Easily distributing a parallel IPython Notebook on a cluster:
Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”
Why yes, I have.
“Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” …
ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.
works on on Platform LSF, Sun Grid Engine, Torque, SLURM.
I do not at all understand how you get data back from this; I guess you run it in situ. Strictly python.
These might run on campus clusters if you schedule things just right, or if you have a bunch of unmanaged machines, which is nonstandard in my experience. I suspect normally you don’t have a bunch of idle hosts sitting around for arbitrary jobs.
See also the hip containerized deployment solution, singularity, if you are blessed with admins support it.
ClusterShell is an event-driven open source Python library, designed to run local or distant commands in parallel on server farms or on large Linux clusters. No need to reinvent the wheel: you can use ClusterShell as a building block to create cluster aware administration scripts and system applications in Python. It will take care of common issues encountered on HPC clusters, such as operating on groups of nodes, running distributed commands using optimized execution algorithms, as well as gathering results and merging identical outputs, or retrieving return codes. ClusterShell takes advantage of existing remote shell facilities already installed on your systems, like SSH.
ClusterShell’s primary goal is to improve the administration of high-performance clusters by providing a lightweight but scalable Python API for developers. It also provides clush, clubak and nodeset/cluset, convenient command-line tools that allow traditional shell scripts to benefit from some of the library features.
pdsh is a variant of the rsh(1) command. Unlike rsh(1), which runs commands on a single remote host, pdsh can run multiple remote commands in parallel. pdsh uses a “sliding window” (or fanout) of threads to conserve resources on the initiating host while allowing some connections to time out.
For example, the following would duplicate using the ssh module to run hostname(1) across the hosts foo[0-10]:
pdsh -R exec -w foo[0-10] ssh -x -l %u %h hostname
Software dependency management
Fact: you are running some ancient decrepit edition of RHEL with too elderly a version of anything to run anything you need. You will need to install everything you use. That is fine, but be aware it’s a little different than provisioning virtual machines on your new-fangled fancy cloud thingy. This is to teach you very important and hugely useful lessons for later in life, such as which compile flags to set to get the matrix kernel library version 8.7.3patch3 to compile with python 2.5.4rc3 for the itanium architecture as at May 8, 2009. Why, think on how many times you will use that skill after you leave your current job! (We call such contemplation void meditation.)
There are lots of package managers. I suspect most of them work on HPC machines (I use homebrew).
For HPC specifically, there is also spack, which also lets you prototype on macOS.
Spack is a package manager for supercomputers, Linux, and macOS. It makes installing scientific software easy. With Spack, you can build a package with multiple versions, configurations, platforms, and compilers, and all of these builds can coexist on the same machine.
Spack isn’t tied to a particular language; you can build a software stack in Python or R, link to libraries written in C, C++, or Fortran, and easily swap compilers. Use Spack to install in your home directory, to manage shared installations and modules on a cluster, or to build combinatorial versions of software for testing.