On getting things done on the Big Computer your supervisor is convinced will solve that Big Problem with Big Data because of some article from the 1980s that they read.
I’m told the experience of HPC on campus is different for, e.g. physicists, who are working on a software stack co-evolved over decades with HPC clusters, and start from different preconceptions about what computers do. For us machine-learning types, though, the usual scenario is as follows:
Your cluster uses some job manager pre-dating many modern trends in cloud computing, including the word cloud applied to computing. Perhaps your campus computing cluster is based upon Platform LSF, Torque or slurm. You don’t care about the details of this, since you have a vague suspicion that the entire world of traditional HPC is asymptotically approaching 0% market share as measured by share of jobs and the main question is merely if that approach is fast enough to save you. All you know is that your fancy-pants modern cloud-kubernetes-containerized-apache-spark-hogwild-SGD project leveraging some disreputable stuff you found on github, that will not run here in the manner Google intended. And that if you try to ask for a GPU the system responds with the error:
>>> GET OFF MY LAWN
But your indifference is not solution: you need to work in this alien environment now, because your department prefers
- to shovel grad-student labour into the sunk-cost pit at the bottom of which awaits the giant computing cluster they bought a share in 5 years ago, to
- the horrifying prospect of giving you billing privileges for the cloud.
You read the help documents that your campus IT staff wrote, but they are targeted at people who have never heard of the terminal before, and in the effort hide complexity also hide practicality. To make things seem simple they hide the complexities that you will in fact need to know, give you few useful keywords to search for, and usually presume you really want to run some FORTRAN software that you have never heard of and that has suspiciously few stars on github.
So how can you get usable calculation from the these cathedrals of computing, while filling up your time and brain space with the absolute minimum that you possibly can about anything to do with their magnificently baroque, obsolescent architecture which would detract from writing job applications for hip dotcoms?
There is a gradual convergence here, between classic-flavoured campus clusters and trendier self-service clouds. To get a feeling for how this happens, I recommend Simon Yin’s test drive of the “Nimbus” system, which is an Australian research-oriented hybrid-style system.
Getting data in and out
My new solution here is to use syncthing to move stuff from my HPC cluster to my desktop.
So far that has been so much easier than everything else that I do not bother with alternatives.
Chief benefit: since HPC job scheduling is an unpredictable confusion and inevitably my custom scripts miss a step, or get killed before the data is processed, or run out of disk space, or are cleaned up by a disk-space cleaning daemon process, or the some step of the job is easer to do off the cluster or whatever…
in this case, a system which robustly synchronises a mess of files with no limit on transfer size and minimal manual intervention seems in practice to do what I want, and more structured solutions lead to me missing things.
ssh also does well; it is a little less robust against mistakes.
What you are supposed to use
all implement a similar API to schedule execution through the Byzantine committee-designed priority greasy totem pole system.
For some reason, the IT department seems reluctant to document which one you are using.
Typically your campus cluster will come with documentation mentioning
some example commands that worked for that guy that time,
but don’t tell you what version of what package your computer is running.
They are all fairly similar though, which is to say, functional-but-dated, and uncomfortably eager to manage IPC-based concurrency, which is something I do not need.
To investigate: Apparently there is a more modern programmatic API to some of these schedulers called DRMAA. Not sure if my local cluster runs it, or if they would make it easy to find out if they did.
Poor person’s parallelism
Usually, being a penniless grad student what I need to do to get value from the cluster is to schedule as many small single-core jobs as I can, so that they will be scheduled into the gaps between all the multicore horrors that the physics folks are running.
(These options skew towards python. I have not yet needed to do this for anything else.)
snakemake which supports make-like build workflows, but on horrible campus clusters.
Seem general and powerful but complicated.
Other contenders? hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs.
Or, bite the bullet and request lots of cores, maybe that will work: Easily distributing a parallel IPython Notebook on a cluster:
Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”
Or: ipython-cluster-helper automates that.
“Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]
ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.
works on on Platform LSF, Sun Grid Engine, Torque, SLURM. Strictly python.
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.
Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
Nextflow supports Docker and Singularity containers technology.
This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.
It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.
Software dependency management
With high likelihood, the campus cluster is running some ancient decrepit edition of RHEL with too elderly a version of anything to run anything you need. There will be some weird semi-recent version of python that you can activate but in practice even that will be some awkward version that does not mesh with the latest code that you are using (you are doing research after all, right?). You will need to install everything you use. That is fine, but be aware it’s a little different than provisioning virtual machines on your new-fangled fancy cloud thingy. This is to teach you important and hugely useful lessons for later in life, such as which compile flags to set to get the matrix kernel library version 8.7.3patch3 to compile with python 2.5.4rc3 for the itanium architecture as at May 8, 2009 Why, think on how many times you will use that skill after you leave your current job! (We call such contemplation void meditation and it is a powerful mindfulness technique.)
For HPC specifically, I have also had recommended spack, which also lets you prototype on macOS.
Spack is a package manager for supercomputers, Linux, and macOS. It makes installing scientific software easy. With Spack, you can build a package with multiple versions, configurations, platforms, and compilers, and all of these builds can coexist on the same machine.
Spack isn’t tied to a particular language; you can build a software stack in Python or R, link to libraries written in C, C++, or Fortran, and easily swap compilers. Use Spack to install in your home directory, to manage shared installations and modules on a cluster, or to build combinatorial versions of software for testing.
DIY cluster. Normally HPC is based requesting jobs via slurm and some shared file system. For my end of the budget this also means requesting breaking up the job into small chunks and squeezing them in around the edges of people with a proper CPU allocation. However! some people are blessed with the ability to request to simultaneously control a predictable number of machines. For these, you can roll your own deploy of execution across some machines, which might be useful. This might work you have a bunch of unmanaged machines not on the campus cluster, which I have personally never experienced.
If you have a containerized deployment, as with all the cloud providers these days, see perhaps containerized deployment solution, singularity, if you are blessed with admins who support it.
For traditional clustershell:
ClusterShell is an event-driven open source Python library, designed to run local or distant commands in parallel on server farms or on large Linux clusters. No need to reinvent the wheel: you can use ClusterShell as a building block to create cluster aware administration scripts and system applications in Python. It will take care of common issues encountered on HPC clusters, such as operating on groups of nodes, running distributed commands using optimized execution algorithms, as well as gathering results and merging identical outputs, or retrieving return codes. ClusterShell takes advantage of existing remote shell facilities already installed on your systems, like SSH.
ClusterShell’s primary goal is to improve the administration of high-performance clusters by providing a lightweight but scalable Python API for developers. It also provides clush, clubak and nodeset/cluset, convenient command-line tools that allow traditional shell scripts to benefit from some of the library features.
A little rawer: pdsh:
pdshis a variant of the
rsh(1), which runs commands on a single remote host, pdsh can run multiple remote commands in parallel. pdsh uses a “sliding window” (or fanout) of threads to conserve resources on the initiating host while allowing some connections to time out.
For example, the following would duplicate using the ssh module to run hostname(1) across the hosts foo[0-10]:
pdsh -R exec -w foo[0-10] ssh -x -l %u %h hostname