Python cluster computing
Parallel computing, wherein a head process spawns workers executing some python function
August 23, 2016 — September 28, 2024
Various cluster options.
1 SSH tunnel management
Not a cluster thing per se, but I need to do it from time to time:
2 Local parallelism
2.1 joblib
joblib is a simple python scientific computing library with basic mapreduce and some nice caching that integrates well. Not fancy, but super easy, which is what an academic usually wants, since fancy would imply we have a personnel budget.
2.2 Multiprocess
For scientific computing I might want to use multiprocess
instead.
2.3 ipython native
Ipython spawning overview. ipyparallel is the built-in jupyter option with less pluggability but much ease.
3 Pathos
[…] is a framework for heterogeneous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogeneous resources. Pathos provides stagers and launchers for parallel and distributed computing, where each launcher contains the syntactic logic to configure and launch jobs in an execution environment. Some examples of included launchers are: a queue-less MPI-based launcher, a ssh-based launcher, and a multiprocessing launcher. Pathos also provides a map-reduce algorithm for each of the available launchers, thus greatly lowering the barrier for users to extend their code to parallel and distributed resources. Pathos provides the ability to interact with batch schedulers and queuing systems, thus allowing large computations to be easily launched on high-performance computing resources.
Integrates well with jupyter notebooks.
4 Dask
dask seems to parallelize certain python tasks well and claims to scale up elastically. It’s purely for python.
4.1 dask.distributed
dask.distributed expands slightly on joblib to handle networked computer clusters and also does load management even without a cluster. In fact it integrates with joblib.
5 Dispy
dispy (HT cpill) seems to be a python solution if I have a mess of machines lying around to borrow.
dispy is a comprehensive, yet easy to use framework for creating and using compute clusters to execute computations in parallel across multiple processors in a single machine (SMP), among many machines in a cluster, grid or cloud. dispy is well suited for data parallel (SIMD) paradigm where a computation (Python function or standalone program) is evaluated with different (large) datasets independently with no communication among computation tasks (except for computation tasks sending Provisional/Intermediate Results or Transferring Files to the client
6 pytorch
has special needs. See pytorch distributed.