Practical cloud machine learning

Cloudimificating my artificial data learning intelligence brain clever science analyticserisation

2016-08-23 — 2025-08-24

computers are awful
concurrency hell
distributed
number crunching
premature optimization
workflow
Figure 1

Doing compute-heavy workloads in the cloud. My use for the cloud in this notebook is strictly for large-data and/or large-model ML. I don’t discuss serving web pages, streaming videos, or “deploying” anything. That is someone else’s job.

Since I am not a startup trying to do machine learning on the cheap, but a researcher implementing algorithms, it’s essential that whatever I use can get me access “under the hood” of machine learning; I’m writing my own algorithms. Using frameworks that allow me to employ only someone else’s statistical algorithms is pointless for my job. OTOH, I want to know as little as possible about the mechanics and annoyances of parallelisation and compute cloud configuration and all that nonsense, although that does indeed impinge upon my algorithm design etc.

All the parts of ML algorithm design relate through complicated workflows, and there are a lot of ways to skin this ML cat. I’ll mention various tools that handle various bits of the workflow in a giant jumble.

Additional material to this theme under scientific computation workflow and stream processing, and extra-large gradient-descent-specialised stuff is under gradient descent at scale. I might need to consider also how to store my data. Stuff specific to plain old HPC clusters, as seen on campus, I discuss at hpc hell.

I recently had to do a lot of compute for some job applications and could not, obviously, use my employer’s computer. So I make it an excuse to learn about two computer platforms, and then asked an LLM to synthesize the lessons learn from my notebook.

What follows is that.

1 Executing GPU jobs on the cheap

I recently needed to run a bunch of compute for some job applications and couldn’t use my employer’s machine. This forced me to learn about cloud GPU platforms, which was actually useful since I’d been putting it off.

The basic problem: I have ML ideas but no local GPU. The company cluster requires paperwork and waiting. I just want to test something quickly, but instead I’m fighting with Docker and SSH keys and wondering why nothing works.

As I wrote in my development log while struggling with one particularly frustrating setup:

“Cool cool the client is going to get a great lesson in how I battle their dev environment but learn little about how I think about NNs”. — DEV_LOG.md

Through this experience, I learned about two different approaches to cloud GPU access: treating it like a “Remote Computer” (RunPod) versus treating it like a “Serverless Function” (Modal). Each has its place, but choosing the wrong one for your task makes everything unnecessarily painful.

2 Execution Model 1: The “Remote Computer” (RunPod)

The most intuitive way to get GPU power is to rent a powerful computer in the cloud. This is the model offered by services like RunPod, Lambda Labs, and the big cloud providers (AWS, GCP).

How it works:

You click a button and get a GPU-equipped virtual machine running Linux. You SSH into it and it behaves like any other remote machine. If you’re comfortable with VS Code’s Remote-SSH extension, this feels natural - the remote pod becomes an extension of your laptop.

The workflow:

  1. Spin up a pod with your desired GPU
  2. SSH into it from your terminal or VS Code
  3. Install packages, clone your repository, configure environment
  4. Run your code, edit files, debug interactively
  5. Remember to shut down the pod to stop paying (this is important!)

The friction points:

This sounds simple, but there are several annoyances:

  • Persistent data: When you destroy the pod, your dataset and model checkpoints disappear. You need to use network volumes - persistent storage that you attach and detach from different pods. This separates your data lifecycle from your compute lifecycle.

  • Repetitive setup: Every new pod means repeating the same setup steps. You can script this away - for example, I automate SSH key management by generating a disposable key, using GitHub CLI to add it to my account, running the job, then removing the key. But it’s still overhead. I could learn to use containerization, but that is precisely what I was hoping to avoid

The remote computer model works well for interactive development and debugging, when you need a stateful environment. But for batch jobs like hyperparameter sweeps, the manual overhead gets annoying.

3 Execution Model 2: The “Serverless Function” (Modal)

The alternative approach: stop thinking about computers and start thinking about functions. This is what Modal offers for GPU computing.

How it works: You write a Python function and add a decorator that says “run this on a cloud server with an L40S GPU.” No provisioning servers, no SSH, and you only pay for the seconds your function actually runs.

I made this switch partway through my project:

modal looks easier than runpod since I need to squeeze this task around my actual day job and don’t want to spin up and down infrastructure”. — DEV_LOG.md

The workflow:

  1. Define environment in code: Instead of a Dockerfile, we specify your environment in Python. Much more reproducible.

    image = (
        modal.Image.debian_slim(python_version="3.12")
        .pip_install("torch", "optuna", "datasets")
        .add_local_python_source("experiment") # Copies your code
    )
    app = modal.App("experiment-modal", image=image)
  2. Decorate your function: Turn the training script into a function and specify hardware requirements.

    @app.function(gpu="L40S", timeout=60*60)
    def train_remote(hp_overrides: dict) -> dict:
        # ... training logic here ...
        return {"final_val_loss": loss, "logs": log_content}
  3. Execute remotely: Call the function from the local machine like any Python function. We use .map() for parallel jobs.

    # Run a parallel sweep of 3 hyperparameter sets
    for result in train_remote.map([hp1, hp2, hp3]):
        print(f"Final validation loss: {result['final_val_loss']}")

How this solves the friction points:

  • Persistent data: Modal has volumes for persistent storage, similar to RunPod’s network volumes. Upload your dataset once with modal volume put and mount it read-only in all function calls.

  • Process persistence: This is where serverless really helps. Deploy your Modal app with modal deploy and your function becomes a persistent endpoint. Call it with .spawn() and Modal returns a call ID immediately while the job runs independently. You can close your laptop, go to dinner, and collect results later using the call ID. Perfect for long-running hyperparameter sweeps.

  • No lifecycle management: No servers to start or stop. Modal scales up workers for parallel jobs and scales down to zero when idle. No mental overhead and no forgotten VMs burning money.

3.1 The hybrid workflow

My final setup combines local orchestration with remote serverless execution for hyperparameter optimization.

How it works:

  1. Local brain: An Optuna script runs on my local machine, managing the HPO study in a local SQLite file. This makes the entire sweep resumable.
  2. Local loop: The script asks Optuna for new hyperparameters.
  3. Remote muscle: For each parameter set, it calls train_remote.map() on Modal. Modal spins up GPU containers, runs training jobs in parallel, and returns validation losses.
  4. Local feedback: The script tells Optuna the results, which updates its model and suggests the next batch.

This gets the best of both: familiar local control with scalable, robust serverless GPU execution.

4 Takeaways

Reflecting on this journey from a frustrating dev container to a working HPO sweep, I learned a few things:

  1. Model Your Process, Not Just Your nn.Module: Before writing code, I ask: Is this interactive debugging or a batch job? Does it need to be stateful or stateless? Does it need to survive a network disconnect? These questions guide me to the right execution model.

    • For interactive exploration where I need a persistent digital workspace, the “Remote Computer” (RunPod + VS Code) works well.
    • For scalable, repeatable batch jobs like training or HPO, the “Serverless Function” (Modal) reduces engineering friction.
  2. Separate State from Compute: Whether using network volumes (RunPod) or dedicated storage volumes (Modal), I keep my datasets and results separate from ephemeral compute instances.

  3. Automate Away the Toil: Repetitive tasks like SSH key management are a sign that I need a script. Tools like the GitHub CLI (gh) help. The goal is to make spinning up a new experiment close to a one-button operation.

These tools don’t turn researchers into full-time DevOps engineers. They provide low-friction abstractions that handle the engineering, so I can get back to what I do best—pushing the boundaries of machine learning.

Useful tools exist for this! see

5 Incoming

  • Vast.ai “Rent Cloud GPU Servers for Deep Learning and AI”; offers both VMs and serverless options