Python cluster computing

Parallel computing, wherein a head process spawns workers executing some python function

August 23, 2016 — October 23, 2022

computers are awful
concurrency hell
distributed
number crunching
premature optimization
workflow

Various cluster options.

1 SSH tunnel management

Not a cluster thing per se, but I need to do it from time to time:

2 Pathos

Pathos

[…] is a framework for heterogenous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogenous resources. Pathos provides stagers and launchers for parallel and distributed computing, where each launcher contains the syntactic logic to configure and launch jobs in an execution environment. Some examples of included launchers are: a queue-less MPI-based launcher, a ssh-based launcher, and a multiprocessing launcher. Pathos also provides a map-reduce algorithm for each of the available launchers, thus greatly lowering the barrier for users to extend their code to parallel and distributed resources. Pathos provides the ability to interact with batch schedulers and queuing systems, thus allowing large computations to be easily launched on high-performance computing resources.

Integrates well with your jupyter notebook which is the main thing, but much like jupyter notebooks themselves, you are on your own when it comes to reproducibility and might want to use it in concert with one of the other solutions here to achieve that.

3 joblib

4 Dask

dask seems to parallelize certain python tasks well and claims to scale up elastically. It’s purely for python.

5 Dispy

dispy (HT cpill) seems to be a python solution if I have a mess of machines lying around to borrow.

dispy is a comprehensive, yet easy to use framework for creating and using compute clusters to execute computations in parallel across multiple processors in a single machine (SMP), among many machines in a cluster, grid or cloud. dispy is well suited for data parallel (SIMD) paradigm where a computation (Python function or standalone program) is evaluated with different (large) datasets independently with no communication among computation tasks (except for computation tasks sending Provisional/Intermediate Results or Transferring Files to the client

6 ipython native

Ipython spawning overview. ipyparallel is the built-in jupyter option with less pluggability but much ease.

7 joblib/dask.distributed

joblib is a simple python scientific computing library with basis mapreduce and some nice caching that integrate well. Not fancy, but super easy, which is what an academic usually wants, since fancy would imply we have a personnel budget.

>>> from math import sqrt
>>> from joblib import Parallel, delayed
>>> Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

dask.distributed is a similar project which expands slightly on joblib to handle networked computer clusters and also does load management even without a cluster. In fact it integrates with joblib.

8 pytorch

has special needs. See pytorch distributed.