Cloud machine learning

Cloudimificating my artificial data learning intelligence brain clever science analyticserisation

how it works

how it works

I get lost in all the options for parallel computing on the cheap. I summarise for myself here.

There are roadmaps here, e.g. the one by Cloud Native Computing foundation, Landscape. However, for me it exemplifies my precise problems with the industry, in that it mistakes an underexplained information deluge for actionable advice.

So, back to the old-skool: Lets find some specific things that work, implement solutions to the problems I have, and generalise as needed.

Fashion dictates this should be “cloud” computing, although I’m also interested in using the same methods without a cloud, as such. In fact, I would prefer frictionless switching between such things according to debugging and processing power needs.

My emphasis is strictly on the cloud doing large data analyses. I don’t care about serving web pages or streaming videos or whatever, or “deploying” anything. I’m a researcher, not an engineer.

In particular I mostly want to do embarrassingly parallel computation That is, I run many calculations/simulations with absolutely no shared state and aggregate them in some way at the end. This avoids much of graph computing complexity. Some of my stuff is deep learning now, whcih is not quite as straightforward.

Additional material to this theme under scientific computation workflow and stream processing. I might need to consider how to store my data.

Software

Algorithms, implementations thereof, and providers of parallel processing services are all coupled closely. Nonetheless I’ll try to draw a distinction between the three.

Since I am not a startup trying to do machine-learning on the cheap, but a grad student implementing algorithms, it’s essential what whatever I use can get me access “under the hood”; I can’t just hand in someone else’s library as my dissertation.

  • Plain old HPC clusters, as seen on campus, I discuss at scientific workbooks and build tools.

  • Ray

    USP:

    tasks and actors use the same Ray API and are used the same way. This unification of parallel tasks and actors has important benefits, both for simplifying the use of Ray and for building powerful applications through composition.

    By way of comparison, popular data processing systems such as Apache Hadoop and Apache Spark allow stateless tasks (functions with no side effects) to operate on immutable data. This assumption simplifies the overall system design and makes it easier for applications to reason about correctness.

    However, shared mutable state is common in machine learning applications. That state could be the weights of a neural network, the state of a third-party simulator, or a representation of interactions with the physical world. Ray’s actor abstraction provides an intuitive way to define and manage mutable state in a thread-safe way.

    What makes this especially powerful is the way that Ray unifies the actor abstraction with the task-parallel abstraction inheriting the benefits of both approaches. Ray uses an underlying dynamic task graph to implement both actors and stateless tasks in the same framework. As a consequence, these two abstractions are completely interoperable.

  • airflow is Airbnb’s hybrid parallel-happy workflow tool:

    While you can get up and running with Airflow in just a few commands, the complete architecture has the following components:

    • The job definitions, in source control.

    • A rich CLI (command line interface) to test, run, backfill, describe and clear parts of your DAGs.

    • A web application, to explore your DAGs’ definition, their dependencies, progress, metadata and logs. The web server is packaged with Airflow and is built on top of the Flask Python web framework.

    • A metadata repository, typically a MySQL or Postgres database that Airflow uses to keep track of task job statuses and other persistent information.

    • An array of workers, running the jobs task instances in a distributed fashion.

    • Scheduler processes, that fire up the task instances that are ready to run.

  • SystemML created by IBM, now run by the Apache Foundation:

    SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single-node, in-memory computations, to distributed computations on Apache Hadoop and Apache Spark.

  • Tensorflow is the hot new Google one, coming from the training of artificial neural networks, but more broadly applicable. See my notes It handles general optimisation problems, especially for neural networks. Listed here as a parallel option because it has some parallel support, especially on google’ own infrastructure.. C++/Python.

  • Turi (formerly Dato (formally Graphlab)) claims to automate this stuff, using their own libraries, which are a little… opaque. (update - recently they opensourced a bunch so perhaps this has changed?) They have similar, but not identical, APIs to python’s scikit-learn. Their (open source) Tensorflow competitor mxnet claims to the the fastest thingy ever times whatever you said plus one.

  • Spark is…

    …an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited to machine learning algorithms.

    Spark requires a cluster manager and a distributed storage system. Spark also supports a pseudo-distributed mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in the scenario, Spark is running on a single machine with one worker per CPU core.

    Spark had over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects.

    Sounds lavishly well-endowed with a community. Not an option for me, since the supercomputer I use has its own weird proprietary job management system. Although possibly I could set up temporary multiprocess clusters on our cluster? It does support python, for example.

    It has in fact, confusingly many cluster modes:

    • Standalone – a simple cluster manager included with Spark that makes it easy to set up a private cluster.
    • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
    • Hadoop YARN – the resource manager in Hadoop 2.
    • basic EC2 launch scripts for free, btw

    Interesting application: Communication-Efficient Distributed Dual Coordinate Ascent (CoCoA):

    By leveraging the primal-dual structure of these optimization problems, COCOA effectively combines partial results from local computation while avoiding conflict with updates simultaneously computed on other machines. In each round, COCOA employs steps of an arbitrary dual optimization method on the local data on each machine, in parallel. A single update vector is then communicated to the master node.

    i.e. cunning optimisation stunts to do efficient distribution of optimisation problems over various machines.

    Spark integrates with scikit-learn via spark-sklearn:

    This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).

    i.e. this one is for when your data fits in memory, but the optimisation over hyperparameters needs to be parallelised.

  • Dataflow/ Beam is google’s job handling as used in google cloud (see below), but they open-sourced this bit. Comparison with spark. The claim has been made that Beam subsumes Spark. It’s not clear how easy it is to make a cluster of these things, but it can also use Apache Spark for processing. Here’s an example using docker. Java only, for now.

  • dispy (HT cpill) seems to be a python solution in this area, especially if you have a mess of machines lying around to borrow.

    dispy is a comprehensive, yet easy to use framework for creating and using compute clusters to execute computations in parallel across multiple processors in a single machine (SMP), among many machines in a cluster, grid or cloud. dispy is well suited for data parallel (SIMD) paradigm where a computation (Python function or standalone program) is evaluated with different (large) datasets independently with no communication among computation tasks (except for computation tasks sending Provisional/Intermediate Results or Transferring Files to the client

  • dask seems to parallelize certain python tasks well and claims to scale up elastically. It’s purely for python.

  • Thunder does time series and image analysis - seems to be a python/spark gig specialising in certain operations but at scale:

    Spatial and temporal data is all around us, whether images from satellites or time series from electronic or biological sensors. These kinds of data are also the bread and butter of neuroscience. Almost all raw neural data consists of electrophysiological time series, or time-varying images of fluorescence or resonance.

    Thunder is a library for analyzing large spatial and temporal data. Its core components are:

    • Methods for loading distributed collections of images and time series data

    • Data structures and methods for working with these data types

    • Analysis methods for extracting patterns from these data

    It is built on top of the Spark distributed computing platform.

    Not clear how easy it is to extend to do more advanced operations.

  • Tessera is a map-reduce library for R:

    The Tessera computational environment is powered by a statistical approach, Divide and Recombine. At the front end, the analyst programs in R. At the back end is a distributed parallel computational environment such as Hadoop. In between are three Tessera packages: datadr, Trelliscope, and RHIPE. These packages enable the data scientist to communicate with the back end with simple R commands.

  • Thrill is a C++ framework for distributed Big Data batch computations on a cluster of machines. It is currently being designed and developed as a research project at Karlsruhe Institute of Technology and is in early testing.

    Some of the main goals for the design are:

    • To create a high-performance Big Data batch processing framework.

    • Expose a powerful C++ user interface, that is efficiently tied to the framework’s internals. The interface supports the Map/Reduce paradigm, but also versatile “dataflow graph” style computations like Apache Spark or Apache Flink with host language control flow.

    • Leverage newest C++11 and C++14 features like lambda functions and auto types to make writing user programs easy and convenient.

    • Enable compilation of binary programs with full compile-time optimization runnable directly on hardware without a virtual machine interpreter. Exploit cache effects due to less indirections than in Java and other languages. Save energy and money by reducing computation overhead.

    • Due to the zero-overhead concept of C++, enable applications to process small datatypes efficiently with no overhead.

  • Ray; does Ray fit in here? It does remote future execution and fancy dependency graphs. Is that nice? It sounds nice.

  • Runwayml does easy deployment of popular DNN models for you with particular focus on creatives. TBD.

  • Dataiku is some kind of enterprise-happy exploratory data analytics swiss army knife thing that deploys jobs to other people’s clouds for you. Pricing and actual feature set unclear because someone let the breathless marketing people do word salat infographic fingerpainting all over the website.

Computation node suppliers

Cloud, n.

  • A place of terror and dismay, a mysterious digital onslaught, into which we all quietly moved.

  • A fictitious place where dreams are stored. Once believed to be free and nebulous, now colonized and managed by monsters. See ‘Castle in the Air’.

  • other peoples’ computers

via Bryan Alexander’s Devil’s Dictionary of educational computing

See also Julia Carrie Wong and Matthew Cantor’s devil’s dictionary of Silicon vallue

cloud, the (n) – Servers. A way to keep more of your data off your computer and in the hands of big tech, where it can be monetized in ways you don’t understand but may have agreed to when you clicked on the Terms of Service. Usually located in a city or town whose elected officials exchanged tens of millions of dollars in tax breaks for seven full-time security guard jobs.

If you want a GPU this all becomes incredibly tedious. Anyway…

  • Vast.ai allows everyone else who overinvested in buying GPUs during the bitcoin boom to sell their excess GPU time to make back the cash. Looks fiddly to use but also cheap.

  • Floydhub do deep-learning-oriented cloud stuff and come with an easy CLI.

  • Microsoft Azure - haven’t really tried it but presumably Microsoft are good at the computers?

  • Paperspace is a node supplier specialising in GPU/machine learning ease.

  • Amazon has stepped up the ease of doing this recently. It’s still over-engineered for people who aren’t building the next instagram or whatever. See my Amazon Cloud notes

  • Google cloud might interoperate well with a bunch of google products, such as Tensorflow, although it has weirdnesses like relying on esoteric google APIs so hard to prototype offline or with shit internet. See my google cloud notes.

  • IBM has a cloud offering (documentation) but now that my brother has leftt the company and can’t get me sweet inside deals I can’t be bothered.

  • Turi is also in this business, I think? I’ve gotten confused by all their varied ventures and offerings over many renames and pivots. I’m sure they are perfectly lovely.

  • cloudera supplies nodes that even run python but AFAICT they are thousands of bucks per year, very enterprisey. Not grad student appropriate.

  • Databricks, spun off from the esteemed Apache spark team, does automated spark deployment. The product looks tasty, but has, for a solo grad student a crazily high baseline rate of USD99/month..

  • Runway.ml is

    RunwayML is a platform for creators of all kinds to use machine learning tools in intuitive ways without any coding experience. Find resources here to start creating with RunwayML quickly.

Parallel tasks on your awful ancient “High Performance” computing cluster that you hate but your campus spent lots of money on and it IS free so uh…

See HPC hell.

Local parallel tasks with python

See also the overlapping section on build tools for some other pipelining tool with less concurrency focus.

ipython native

Ipython spawning overview. ipyparallel is the built-in jupyter option with less pluggability but much ease.

joblib/dask.distributed

joblib is a simple python scientific computing library with basis mapreduce and some nice caching that integrate well. Not fancy, but super easy, which is what an academic usually wants, since fancy woudl imply we have a personnel budget.

>>> from math import sqrt
>>> from joblib import Parallel, delayed
>>> Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

dask.distributed is a similar project which expands slightly on joblib to handle networked computer clusters and also does load management even without a cluster. In fact it integrates with joblib.

pathos

pathos is one general tool here. Looks a little… sedate… in development. Looks more powerful than joblib in principle, but joblib actually ships.

spark

You could also launch spark jobs.

Scientific VMs

(To work out - should I be listing Docker container images instead? Much hipper, seems less tedious.)

Scientific containers

Singularity

Singularity promises containerized science infrastructure.

Singularity provides a single universal on-ramp from the laptop, to HPC, to cloud.

Users of singularity can build applications on their desktops and run hundreds or thousands of instances—without change—on any public cloud.

Features include:

  • Support for data-intensive workloads—The elegance of Singularity’s architecture bridges the gap between HPC and AI, deep learning/machine learning, and predictive analytics.
  • A secure, single-file-based container format—Cryptographic signatures ensure trusted, reproducible, and validated software environments during runtime and at rest.
  • Extreme mobility—Use standard file and object copy tools to transport, share, or distribute a Singularity container. Any endpoint with Singularity installed can run the container.
  • Compatibility—Designed to support complex architectures and workflows, Singularity is easily adaptable to almost any environment.
  • Simplicity—If you can use Linux®, you can use Singularity.
  • Security—Singularity blocks privilege escalation inside containers by using an immutable single-file container format that can be cryptographically signed and verified.
  • User groups—Join the knowledgeable communities via GitHub, Google Groups, or in the Slack community channel.
  • Enterprise-grade features—Leverage SingularityPRO’s Container Library, Remote Builder, and expanded ecosystem of resources.

Released in 2016, Singularity is an open source-based container platform designed for scientific and high-performance computing (HPC) environments. Used by more than 25,000 top academic, government, and enterprise users, Singularity is installed on more than 3 million cores and trusted to run over a million jobs each day.

In addition to enabling greater control over the IT environment, Singularity also supports Bring Your Own Environment (BYOE)—where entire Singularity environments can be transported between computational resources (e.g., users’ PCs) with reproducibility.

NVIDIA

It is less fancy, but NVIDIA provides GPU images with all their crap pre-downloaded and licences pre-approved.