Cloud machine learning

Cloudimificating my artificial data learning intelligence brain clever science analyticserisation

August 23, 2016 — September 1, 2021

computers are awful
concurrency hell
distributed
number crunching
premature optimization
workflow

Content warning:

This notebook is hopelessly outdated and taxonomically problematic.

Figure 1

Doing compute-heavy workload in the cloud. My use for the cloud in this notebook is strictly for the purposes of large-data and or large-model ML. I don’t discuss here about serving web pages or streaming videos or whatever, or “deploying” anything. That is someone else’s job.

I get lost in all the options for parallel computing on the cheap. I summarise some such for myself here. There are many summaries of this topic area, e.g. Cloud Native Computing foundation provides a map Landscape. However, if that map solves anything for anyone, that anyone is not me. To me, that map is an underexplained information deluge masquerading as actionable advice, much like the onboarding documentation for many of the key players.

Better approach: I find some specific things that solve the problems I have, and generalise as needed.

Fashion dictates this should be called “cloud” computing, although I’m interested in finding methods that also work without a cloud, as such. In fact, I would prefer frictionless switching between local and cloud computing according to debugging and processing power needs.

Within that specialty, I mostly want to do embarrassingly parallel computation. That is, I run many calculations/simulations with limited shared state and aggregate them in some way at the end. UPDATE: Some of my stuff is deep learning now, which is not quite as embarrassingly parallel, but rather excruciatingly parallel.

Additional material to this theme under scientific computation workflow and stream processing, and extra-large gradient-descent-specialised stuff is udner gradient descent at scale. I might need to consider also how to store my data.

A local (i.e. Australian) hub for this stuff is C3DIS Conference, the Collaborative Conference on Computational and DAta Intensive Science.

Algorithms, implementations thereof, and providers of parallel processing services are all coupled closely. Nonetheless I’ll try to draw a distinction between the three.

Since I am not a startup trying to do machine-learning on the cheap, but a researcher implementing algorithms, it’s essential what whatever I use can get me access “under the hood” of machine learning; I’m writing my own algorithms. Using frameworks which allow me to employ only someone else’s statistical algorithms is pointless for my job. OTOH, I want to know as little as possible about the mechanics and annoyances of parallelisation and compute cloud configuration and all that nonsense, although that does indeed impinge upon my algorithm design etc.

All the parts of ML algorithm design relate through complicated workflows, and there are a lot of ways to skin this ML cat. I’ll mention various tools that handle various bits of the workflow in a giant jumble.

Stuff specific to plain old HPC clusters, as seen on campus, I discuss at hpc hell.

1 Maanged services

1.1 Determined

determined promises that

Going from Notebook to Cluster has never been simpler

Seems to be a system for exploratory deep learning training setups.

Determined allows you to focus on the task at hand — training models.

Jump right into a purpose-created environment for DL work — and spend your time getting your models right without having to worry about setup, teardown, and other boilerplate code that can be automated away.

We’ve done this enough to know what you don’t want to spend your time doing.

Includes:

A built-in training loop abstraction that supports experiment tracking, efficient data loading, fault tolerance, and flexible customization.

High-performance distributed training with no code changes.

Automatic hyperparameter optimization based on cutting-edge research.

1.2 Ray

Ray Does Ray fit in here? It does remote future execution and fancy dependency graphs. Is that nice? It sounds nice. USP:

tasks and actors use the same Ray API and are used the same way. This unification of parallel tasks and actors has important benefits, both for simplifying the use of Ray and for building powerful applications through composition.

By way of comparison, popular data processing systems such as Apache Hadoop and Apache Spark allow stateless tasks (functions with no side effects) to operate on immutable data. This assumption simplifies the overall system design and makes it easier for applications to reason about correctness.

However, shared mutable state is common in machine learning applications. That state could be the weights of a neural network, the state of a third-party simulator, or a representation of interactions with the physical world. Ray’s actor abstraction provides an intuitive way to define and manage mutable state in a thread-safe way.

What makes this especially powerful is the way that Ray unifies the actor abstraction with the task-parallel abstraction inheriting the benefits of both approaches. Ray uses an underlying dynamic task graph to implement both actors and stateless tasks in the same framework. As a consequence, these two abstractions are completely interoperable.

I have a faint suspicion that even needing to know what tasks and actors are is a red flag for my role as a researcher who is being paid not to care about implementation details.

There is a commerical spinoff called Anyscale that … builds a ray-compliant cloud for you? idk

1.3 MLFlow

MLFlow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud). MLflow’s current components are:

  • MLflow Tracking: An API to log parameters, code, and results in machine learning experiments and compare them using an interactive UI.
  • MLflow Projects: A code packaging format for reproducible runs using Conda and Docker, so you can share your ML code with others.
  • MLflow Models: A model packaging format and tools that let you easily deploy the same model (from any ML library) to batch and real-time scoring on platforms such as Docker, Apache Spark, Azure ML and AWS SageMaker.
  • MLflow Model Registry: A centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of MLflow Models.

TBC?

1.4 Nextflow

Nextflow

Nextflow is built around the idea that Linux is the lingua franca of data science.

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

Nextflow allows you to write a computational pipeline by making it simpler to put together many different tasks.

You may reuse your existing scripts and tools and you don’t need to learn a new language or API to start using it.

Nextflow supports Docker and Singularity containers technology.

This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.

Nextflow provides an abstraction layer between your pipeline’s logic and the execution layer, so that it can be executed on multiple platforms without it changing.

It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.

1.5 Coiled

Coiled was recommended by some people at work.

You love building in Python for Data Science. But training models on your laptop is limiting. Coiled supercharges the Python tools you already use for scale and performance. Move to the cloud with a few clicks—and stay focused on solving important problems.

1.6 Flyte

flyte

From the introduction:

With data now being a primary asset for companies, executing large-scale compute jobs is critical to the business, but problematic from an operational standpoint. Scaling, monitoring, and managing compute clusters becomes a burden on each product team, slowing down iteration and subsequently product innovation. Moreover, these workflows often have complex data dependencies. Without platform abstraction, dependency management becomes untenable and makes collaboration and reuse across teams impossible.

Flyte’s mission is to increase development velocity for machine learning and data processing by abstracting this overhead. We make reliable, scalable, orchestrated compute a solved problem, allowing teams to focus on business logic rather than machines. Furthermore, we enable sharing and reuse across tenants so a problem only needs to be solved once. This is increasingly important as the lines between data and machine learning converge, including the roles of those who work on them.

All Flyte tasks and workflows have strongly typed inputs and outputs. This makes it possible to parameterize your workflows, have rich data lineage, and use cached versions of pre-computed artifacts. If, for example, you’re doing hyperparameter optimization, you can easily invoke different parameters with each run. Additionally, if the run invokes a task that was already computed from a previous execution, Flyte will smartly use the cached output, saving both time and money.

1.7 Airflow

airflow is Airbnb’s hybrid parallel-happy workflow tool:

While you can get up and running with Airflow in just a few commands, the complete architecture has the following components:

  • The job definitions, in source control.

  • A rich CLI (command line interface) to test, run, backfill, describe and clear parts of your DAGs.

  • A web application, to explore your DAGs’ definition, their dependencies, progress, metadata and logs. The web server is packaged with Airflow and is built on top of the Flask Python web framework.

  • A metadata repository, typically a MySQL or Postgres database that Airflow uses to keep track of task job statuses and other persistent information.

  • An array of workers, running the jobs task instances in a distributed fashion.

  • Scheduler processes, that fire up the task instances that are ready to run.

1.8 SystemML

SystemML created by IBM, now run by the Apache Foundation:

SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single-node, in-memory computations, to distributed computations on Apache Hadoop and Apache Spark.

1.9 Pathos

Pathos

[…] is a framework for heterogenous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogenous resources. Pathos provides stagers and launchers for parallel and distributed computing, where each launcher contains the syntactic logic to configure and launch jobs in an execution environment. Some examples of included launchers are: a queue-less MPI-based launcher, a ssh-based launcher, and a multiprocessing launcher. Pathos also provides a map-reduce algorithm for each of the available launchers, thus greatly lowering the barrier for users to extend their code to parallel and distributed resources. Pathos provides the ability to interact with batch schedulers and queuing systems, thus allowing large computations to be easily launched on high-performance computing resources.

Integrates well with your jupyter notebook which is the main thing, but much like jupyter notebooks themselves, you are on your own when it comes to reproducibility and might want to use it in concert with one of the other solutions here to achieve that.

1.10 Dataflow

Dataflow/ Beam is google’s job handling as used in google cloud (see below), but they open-sourced this bit. Comparison with spark. The claim has been made that Beam subsumes Spark. It’s not clear how easy it is to make a cluster of these things, but it can also use Apache Spark for processing. Here’s an example using docker. Java only, for now. You can maybe plug it in to Tensorflow? 🤷‍♂

1.11 Turi

Turi (formerly Dato (formally Graphlab)) claims to automate much cloud stuff, using their own libraries, which are a little… opaque. (update - recently they opensourced a bunch so perhaps this has changed?) They have similar, but not identical, APIs to python’s scikit-learn. Their (open source) Tensorflow competitor mxnet claims to be the fastest thingy ever times whatever you said plus one.

1.12 Spark

Spark is…

…an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. […] By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited to machine learning algorithms.

Spark requires a cluster manager and a distributed storage system. […] Spark also supports a pseudo-distributed mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in the scenario, Spark is running on a single machine with one worker per CPU core.

Spark had over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects.

Sounds lavishly well-endowed with a community. Not an option for me, since the supercomputer I use has its own weird proprietary job management system. Although possibly I could set up temporary multiprocess clusters on our cluster? It does support python, for example.

It has, in fact, confusingly many cluster modes:

  • Standalone — a simple cluster manager included with Spark that makes it easy to set up a private cluster.
  • Apache Mesos — a general cluster manager that can also run Hadoop MapReduce and service applications.
  • Hadoop YARN — the resource manager in Hadoop 2.
  • basic EC2 launch scripts for free, btw

Interesting application: Communication-Efficient Distributed Dual Coordinate Ascent (CoCoA):

By leveraging the primal-dual structure of these optimization problems, COCOA effectively combines partial results from local computation while avoiding conflict with updates simultaneously computed on other machines. In each round, COCOA employs steps of an arbitrary dual optimization method on the local data on each machine, in parallel. A single update vector is then communicated to the master node.

i.e. cunning optimisation stunts to do efficient distribution of optimisation problems over various machines.

Spark integrates with scikit-learn via spark-sklearn:

This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).

i.e. this one is for when your data fits in memory, but the optimisation over hyperparameters needs to be parallelised.

1.13 Thunder

Thunder does time series and image analysis — seems to be a python/spark gig specialising in certain operations but at scale:

Spatial and temporal data is all around us, whether images from satellites or time series from electronic or biological sensors. These kinds of data are also the bread and butter of neuroscience. Almost all raw neural data consists of electrophysiological time series, or time-varying images of fluorescence or resonance.

Thunder is a library for analyzing large spatial and temporal data. Its core components are:

  • Methods for loading distributed collections of images and time series data

  • Data structures and methods for working with these data types

  • Analysis methods for extracting patterns from these data

It is built on top of the Spark distributed computing platform.

Not clear how easy it is to extend to more general operations.

1.14 Dask

dask seems to parallelize certain python tasks well and claims to scale up elastically. It’s purely for python.

1.15 Tessera

Tessera is a map-reduce library for R:

The Tessera computational environment is powered by a statistical approach, Divide and Recombine. At the front end, the analyst programs in R. At the back end is a distributed parallel computational environment such as Hadoop. In between are three Tessera packages: datadr, Trelliscope, and RHIPE. These packages enable the data scientist to communicate with the back end with simple R commands.

1.16 Thrill

Thrill is a C++ framework for distributed Big Data batch computations on a cluster of machines. It is currently being designed and developed as a research project at Karlsruhe Institute of Technology and is in early testing.

Some of the main goals for the design are:

  • To create a high-performance Big Data batch processing framework.

  • Expose a powerful C++ user interface, that is efficiently tied to the framework’s internals. The interface supports the Map/Reduce paradigm, but also versatile “dataflow graph” style computations like Apache Spark or Apache Flink with host language control flow. […]

  • Leverage newest C++11 and C++14 features like lambda functions and auto types to make writing user programs easy and convenient.

  • Enable compilation of binary programs with full compile-time optimization runnable directly on hardware without a virtual machine interpreter. Exploit cache effects due to less indirections than in Java and other languages. Save energy and money by reducing computation overhead.

  • Due to the zero-overhead concept of C++, enable applications to process small datatypes efficiently with no overhead.

  • […]

1.17 Dataiku

Dataiku is some kind of enterprise-happy exploratory data analytics swiss army knife thing that deploys jobs to other people’s clouds for you. Pricing and actual feature set unclear because someone let the breathless marketing people do word salad infographic fingerpainting all over the website.

2 Experiment tracking for ML

See experiment tracking.

3 Compute providers

See cloud compute providers.

3.1 RONIN cloud

RONIN (ALLCAPS apparently obligatory) is

an incredibly simplistic web application that allows researchers and scientists to launch complex compute resources within minutes, without the nerding.

It seems to handle provisioning virtual machines in an especially friendly way for ML. It also seems to be frighteningly sparsely documented especially with regard to certain key features for me: How do I design my own machine with my desired data and code to actually do a specific thing? I’m sure there is an answer to this, it is just not anywhere obvious on the project site.

3.2 Parallel tasks on that “High Performance” computing cluster without modern conveniences, but the campus spent lots of money on and it IS free so uh…

See HPC hell.

3.3 ad hoc private cluster

For python see python cluster.

4 Scientific VMs

(To work out - should I be listing Docker container images instead? Much hipper, seems less tedious.)

5 Serialising python tasks

See python pickling.

6 Scientific containers

6.1 Apptainer/Singularity

See Apptainer.

6.2 NVIDIA

NVIDIA provides GPU images with all their crap pre-downloaded and licences pre-approved. Tensorman is an alternative or something for this. See pop-os/tensorman: Utility for easy management of Tensorflow containers.