On getting things done on the Big Computer your supervisor is convinced will solve that Big Problem with Big Data because that was the phrasing used in the previous funding application.
I’m told the experience of HPC on campus is different for, e.g. physicists, who are working on a software stack co-evolved over decades with HPC clusters, and start from different preconceptions about what computers do. For us machine-learning types, though, the usual scenario is as follows (and I have been on-boarded 4 times to different clusters with similar pain points by now):
Our cluster uses some job manager pre-dating many modern trends in cloud computing, including the word cloud applied to computing. Perhaps the cluster is based upon Platform LSF, Torque or slurm. Typically I don’t care about the details of this, since I have a vague suspicion that the entire world of traditional HPC is asymptotically approaching 0% market share as measured by share of jobs and the main question is merely if that approach is fast enough to save me. All I know is that my fancy-pants modern cloud-kubernetes-containerized-apache-spark-hogwild-SGD project leveraging some disreputable stuff you found on github, that will not run here in the manner Google intended. And that if you try to ask for containerization the system responds with the error:
>>> GET OFF MY LAWN
But indifference is no solution: I need to work in this environment now, because my department prefers
- to shovel labour into the sunk-cost pit at the bottom of which awaits the giant computing cluster they bought a share in 5 years ago, to
- the horrifying prospect of giving you billing privileges for the cloud.
You read the help documents that your campus IT staff wrote, but they are targeted at people who have never heard of the terminal before, and in their effort hide complexity hide practicality too. Rarely do they let slip few useful keywords to search for in favour of cargo-cult instructions to run some FORTRAN software that I have never heard of and that has suspiciously few stars on github.
So how can we get usable calculation from the these cathedrals of computing? while filling up our time and brain space with the absolute minimum that we possibly can about anything to do with their magnificently baroque, obsolescent architecture which would detract from writing job applications for hip dotcoms?
There is a gradual convergence here, between classic-flavoured campus clusters and trendier self-service clouds. To get a feeling for how this happens, I recommend Simon Yin’s test drive of the “Nimbus” system, which is an Australian research-oriented hybrid-style system.
UPDATE: I am no longer a grad student. I am a postdoc at CSIRO’s Data61, where we have HPC resources in an intermediate state: some containerization, some cloud and much SLURM.
Getting data in and out
My last solution here was to use syncthing to move stuff from my HPC cluster to my desktop. The new mob have a perfectly functional NFS system, so that is apparently no longer needed.
Syncthing was much easier in the previous regime, though
Chief benefit: since HPC job scheduling is an unpredictable confusion and inevitably my custom scripts miss a step, or get killed before the data is processed, or run out of disk space, or are cleaned up by a disk-space cleaning daemon process, or the some step of the job is easer to do off the cluster or whatever…
in this case, a system which robustly synchronises a mess of files with no limit on transfer size and minimal manual intervention seems in practice to do what I want, and more structured solutions lead to me missing things.
ssh also does OK; it is a little less robust against mistakes; you really need to remember what you deleted, where and when.
all implement a similar API providing concurrency guarantees as specified in the Byzantine committee-designed greasy totem pole priority system.
For some reason, the IT department often seems reluctant to document which one they are using.
Typically a campus cluster will come with documentation mentioning
some example commands that worked for that guy that time,
but don’t mention what version of what package your computer is running.
All the systems seem fairly similar though, which is to say, functional-but-dated, and uncomfortably eager to manage IPC-type parallelism, which is something I do not need enough to even work out what it is.
They are typically less eager to allocate GPUs and such, or to manage containers.
To investigate: Apparently there is a more modern programmatic API to some of these schedulers called DRMAA. Not sure if my local cluster runs it, or if they would make it easy to find out if they did.
Poor person’s parallelism
As a penniless grad student, what I needed to do to get value from the cluster, was to schedule as many small single-core jobs as I can, so that they would be scheduled into the gaps between all the multicore horrors that the physics folks ran. This is a common problem, but I did not find any generic solutions for it; I simply rolled my own ad hoc ones.
Medium-fancy multicore processing
Running deep-learning-style jobs using classic HPC tools. Probably the easiest option for my use cases is test-tube, a “Python library to easily log experiments and parallelize hyperparameter search for neural networks”. AFAICT there is nothing neural-network specific in here and it will happily schedule a whole bunch of useful types of deployment for you, generating the necessary scripts and keeping track of what is going on. Thanks for pointing me to this, Chris Jackett.
snakemake which supports
make-like build workflows, but on horrible campus clusters.
Seem general and powerful but complicated.
hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs.
Or, bite the bullet and request lots of cores, maybe that will work: Easily distributing a parallel IPython Notebook on a cluster:
Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”
Or: ipython-cluster-helper automates that.
“Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]
ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.
works on Platform LSF, Sun Grid Engine, Torque, SLURM. Strictly python.
More modern tools facilitate very sophisticated workflows with execution graphs and pipelines and such. One that was briefly pitched to us that I did not ultimately use: nextflow
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.
Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
Nextflow supports Docker and Singularity containers technology.
This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.
It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.
I think that we are actually going to be given D2iQ Kaptain: End-to-End Machine Learning Platform instead? TBC.
Software dependency management
With high likelihood, the campus cluster is running some ancient decrepit edition of RHEL with too elderly a version of anything to run anything you need. There will be some weird semi-recent version of python that you can activate but in practice even that will be some awkward version that does not mesh with the latest code that you are using (you are doing research after all, right?). You will need to install everything you use. That is fine, but be aware it’s a little different than provisioning virtual machines on your new-fangled fancy cloud thingy. This is to teach you important and hugely useful lessons for later in life, such as which compile flags to set to get the matrix kernel library version 8.7.3patch3 to compile with python 2.5.4rc3 for the itanium architecture as at May 8, 2009 Why, think on how many times you will use that skill after you leave your current job! (We call such contemplation void meditation and it is a powerful mindfulness technique.)
For HPC specifically, I have also had recommended spack, which also lets you prototype on macOS.
Spack is a package manager for supercomputers, Linux, and macOS. It makes installing scientific software easy. With Spack, you can build a package with multiple versions, configurations, platforms, and compilers, and all of these builds can coexist on the same machine.
Spack isn’t tied to a particular language; you can build a software stack in Python or R, link to libraries written in C, C++, or Fortran, and easily swap compilers. Use Spack to install in your home directory, to manage shared installations and modules on a cluster, or to build combinatorial versions of software for testing.
DIY cluster. Normally HPC is based requesting jobs via slurm and some shared file system. For my end of the budget this also means requesting breaking up the job into small chunks and squeezing them in around the edges of people with a proper CPU allocation. However! some people are blessed with the ability to request to simultaneously control a predictable number of machines. For these, you can roll your own deploy of execution across some machines, which might be useful. This might work you have a bunch of unmanaged machines not on the campus cluster, which I have personally never experienced.
If you have a containerized deployment, as with all the cloud providers these days, see perhaps containerized deployment solution, singularity, if you are blessed with admins who support it.
For traditional clustershell:
ClusterShell is an event-driven open source Python library, designed to run local or distant commands in parallel on server farms or on large Linux clusters. No need to reinvent the wheel: you can use ClusterShell as a building block to create cluster aware administration scripts and system applications in Python. It will take care of common issues encountered on HPC clusters, such as operating on groups of nodes, running distributed commands using optimized execution algorithms, as well as gathering results and merging identical outputs, or retrieving return codes. ClusterShell takes advantage of existing remote shell facilities already installed on your systems, like SSH.
ClusterShell’s primary goal is to improve the administration of high-performance clusters by providing a lightweight but scalable Python API for developers. It also provides clush, clubak and nodeset/cluset, convenient command-line tools that allow traditional shell scripts to benefit from some of the library features.
A little rawer: pdsh:
pdshis a variant of the
rsh(1), which runs commands on a single remote host, pdsh can run multiple remote commands in parallel. pdsh uses a “sliding window” (or fanout) of threads to conserve resources on the initiating host while allowing some connections to time out.
For example, the following would duplicate using the ssh module to run hostname(1) across the hosts foo[0-10]:
pdsh -R exec -w foo[0-10] ssh -x -l %u %h hostname