Practical cloud machine learning

Cloudimificating my artificial data learning intelligence brain clever science analyticserisation

2016-08-23 — 2025-09-26

Wherein serverless GPU functions are treated as deployable experiments, and a hybrid local Optuna loop calling remote Modal functions is described, with persistent volumes separating data from ephemeral compute.

computers are awful
concurrency hell
distributed
number crunching
premature optimization
workflow
Figure 1

Doing compute-heavy workloads in the cloud. My use of the cloud in this notebook is strictly for large-data and/or large-model ML. I don’t discuss serving web pages, streaming videos, or “deploying” anything. That is someone else’s job.

I’m not a startup trying to do machine learning on the cheap; I’m a researcher implementing algorithms. It’s essential that whatever I use gives me access “under the hood” without being a total pain in the arse to set up with bare‑metal servers, CUDA drivers, and all that nonsense.

I’m writing my own algorithms. Frameworks that only let me use someone else’s statistical algorithms are pointless for my work. At the same time, I want to know as little as possible about the mechanics and annoyances of parallelization, compute‑cloud configuration, and all that nonsense, even though these things affect my algorithm design.

All parts of ML algorithm design are connected in complex workflows, and there are many ways to skin this ML cat. I’ll mention tools that handle different bits of the workflow in a bit of a jumble.

Additional material on this theme is under scientific computation workflow and stream processing, and extra-large gradient-descent-specialized stuff is under gradient descent at scale. I might also need to consider how to store my data. I discuss stuff specific to plain old HPC clusters, as seen on campus, at hpc hell.

I recently needed a lot of compute for some job applications and, obviously, I couldn’t use my employer’s computer. So I used that as an excuse to learn about two compute platforms and asked an LLM to synthesize the lessons from my notebook.

What follows are the results.

NoteHistorical note

This is one of the older notebooks on this blog, with the first version dating back to 2016. There is absolutely no content left over from the original version. This is a moving target; treat all information here as having a short shelf-life.

TipCaveat: Other backends exist (e.g. Snakemake)

I’ve skipped over other workflow engines that can also target cloud GPUs. Most notably, Snakemake supports Kubernetes, AWS Batch, and Google Life Sciences, but I haven’t used it yet.

1 Executing GPU jobs on the cheap

I recently needed to run a bunch of compute for some job applications and couldn’t use my employer’s machine. That forced me to learn about cloud GPU platforms — useful, since I’d been putting it off.

The basic problem: I have ML ideas but no local GPU. The company cluster isn’t the right tool for quick experiments — for example, job applications. At the same time, spinning up a full-on dev container with Docker and Kubernetes felt like overkill for a few quick experiments. I just wanted to test something quickly, but instead I was fighting with SSH keys and wondering why nothing worked.

Through that experience, I learned about two approaches to cloud GPU access: treating it like a “Remote Computer” (RunPod) or treating it like a “Serverless Function” (Modal).

2 Execution Model 1: The “Remote Computer” (RunPod)

The most intuitive way to get GPU power is to rent a powerful computer in the cloud. This is the model offered by services like RunPod, Lambda Labs, and the big cloud providers (AWS, GCP).

How it works:

We click a button to get a GPU-equipped Linux virtual machine; then we SSH into it, and it behaves like any other remote machine.

The workflow:

  1. Spin up a pod with your desired GPU
  2. SSH into it from your terminal or VS Code
  3. Install packages, clone your repository, configure the environment
  4. Run your code, edit files, debug interactively
  5. Remember to shut down the pod to stop paying (this is important!)

The friction points:

  • Persistent data: When we destroy the pod, our datasets and model checkpoints disappear. We need network volumes — persistent storage we can attach and detach across pods — to separate our data lifecycle from our compute lifecycle.

  • Repetitive setup: Every new pod forces repeating the same setup. We can script it — for example, I automate SSH key management by generating a disposable key, using the GitHub CLI to add it to my account, running the job, then removing the key — but it’s still overhead. I could learn to use containerization, but that’s precisely what I was hoping to avoid.

The remote computer model works well for interactive development and debugging, when we need a stateful environment. But for batch jobs like hyperparameter sweeps, the manual overhead gets annoying.

3 Execution Model 2: The “Serverless Function” (e.g. Modal)

The alternative approach is to stop thinking about computers and start thinking about functions. This is what Modal offers for GPU computing.

How it works: We write a Python function and add a decorator that says “run this on a cloud server with an L40S GPU.” No provisioning of servers, no SSH, and we only pay for the seconds our function runs.

I made this switch partway through my project, because modal looked easier than runpod. I needed to squeeze this task around my day job and didn’t want to spin up and down infrastructure. Hat tip to Sandy Fraser, whose nifty z0u/mi-ni showed me how simple the serverless style could be for my use case.

The workflow:

  1. Define environment in code: We specify the environment in plain Python.

    image = (
        modal.Image.debian_slim(python_version="3.12")
        .pip_install("torch", "optuna", "datasets")
        .add_local_python_source("experiment") # Copies your code
    )
    app = modal.App("experiment-modal", image=image)
  2. Decorate our function for remote execution: We should convert the training script into a function and declare the hardware requirements, for example GPUs and memory.

    @app.function(gpu="L40S", timeout=60*60)
    def train_remote(hp_overrides: dict) -> dict:
        # ... training logic here ...
        return {"final_val_loss": loss, "logs": log_content}
  3. Execute remotely: We call the function from our local machine, just like any Python function, using .call(). Modal packages the code, sends it to the cloud, runs it, and returns the results.

    We can run it in parallel for free using .map().

    # Run a parallel sweep of 3 hyperparameter sets
    for result in train_remote.map([hp1, hp2, hp3]):
        print(f"Final validation loss: {result['final_val_loss']}")

Modal provides volumes for persistent storage, similar to RunPod’s network volumes. We upload our dataset once with modal volume put and mount it read-only in all function calls.

We’ve gained some advantages, though

  • Process persistence: This is where serverless helps. I can deploy my Modal app with modal deploy; my function becomes a persistent endpoint. We call the function with .call(), and Modal returns a call ID immediately while the job runs. I can close my laptop, go to dinner, and collect the results later using the call ID. Perfect for long-running hyperparameter sweeps.

  • No lifecycle management: There are no servers to start or stop. Modal scales up workers for parallel jobs and scales down to zero when idle. This means less mental overhead and a lower risk of forgotten VMs burning money.

4 Execution Model 3: The “Deployed App” (Modal)

Somewhere halfway between the two previous options: Instead of treating each function call like a throwaway “lambda,” we deploy an entire Modal App. This gives us a stable, long-lived endpoint. This avoids long build times, which can be significant if, for example, the image needs to install a massive PyTorch distro.

We still decorate Python functions, but now we modal deploy the whole file (modal_app.py). Modal packages the code and environment into an image, registers the functions in the control plane, and makes them callable by name from any client. It’s something like “serverless functions with versioning and persistence.”

The workflow:

  1. Define image + app once: Just like before, we specify the environment in code. We also create a persistent volume mount and commit artifacts back to it.

    import modal
    
    image = (
        modal.Image.debian_slim(python_version="3.11")
        .pip_install_from_pyproject("pyproject.toml", optional_dependencies=["modal"])
        .add_local_python_source(".")
    )
    
    artifacts_volume = modal.Volume.from_name("llc-artifacts", create_if_missing=True)
    app = modal.App("llc-experiments", image=image)
  2. Deploy:

    @app.function(gpu="L40S", timeout=60*60, mounts=[artifacts_volume])
    def train_remote(hp_overrides: dict) -> dict:
        # ... training logic here ...
        return {"final_val_loss": loss, "logs": log_content}
  3. Invoke:

    # Run a parallel sweep of 3 hyperparameter sets
    for result in train_remote.map([hp1, hp2, hp3]):
        print(f"Final validation loss: {result['final_val_loss']}")

4.1 My hybrid workflow

My final setup for a recent job interview combined local orchestration with remote serverless execution for hyperparameter optimization.

How it works:

  1. Local brain: An Optuna script runs on my local machine, managing the HPO study in a local SQLite file. This makes the entire sweep resumable.
  2. Local loop: The script asks Optuna for new hyperparameters.
  3. Remote muscle: For each parameter set, it calls train_remote.map() on Modal. Modal spins up GPU containers, runs training jobs in parallel, and returns validation losses.
  4. Local feedback: The script reports the results to Optuna, which updates its model and suggests the next batch.

This gets the best of both worlds: familiar local control with scalable serverless GPU execution.

5 Why I didn’t just use SageMaker or equivalent

  1. Because SageMaker is its own thing with its own learning curve
  2. Because I know vanilla NN frameworks very well already (I started with TensorFlow in 2016, which makes me 150 in AI years)
  3. No one runs coding job interviews in SageMaker

6 Execution Model 4: The “Workflow Engine” (Snakemake, hypothetically)

I know snakemake primarily as a workflow engine for HPC clusters, but it also has executors for cloud backends like Kubernetes, AWS Batch, et al. I didn’t know that, or I might have used it here. Unlike the “remote computer” (RunPod) or “serverless function/app” (Modal) approaches above, Snakemake treats the job as a workflow of tasks rather than a single container or function. We describe our pipeline in a Snakefile as rules with inputs, outputs, and resources; Snakemake builds the DAG, parallelizes independent steps, and submits jobs to the chosen backend.

In the cloud setting, that backend could be Kubernetes, AWS Batch, or Google’s Life Sciences API, with data staged through S3/GCS instead of a shared filesystem. The difference is philosophical: instead of me orchestrating sweeps by hand (Optuna loop + Modal map) or babysitting a stateful VM, Snakemake would schedule training, evaluation, and preprocessing as separate jobs, resume only failed or missing ones, and manage parallelism automatically.

I haven’t tried this yet — but the next time I need something more structured than “a bunch of Modal calls in a loop,” sketching a Snakefile and pointing it at a cloud executor might be worth the setup cost.

7 Takeaways

Reflecting on the journey from a frustrating dev container to a working HPO sweep, I learned a few things:

  1. Before writing code, I now ask: Is this interactive debugging or a batch job? Does it need to be stateful or stateless? Does it need to survive a network disconnect? These questions guide me to the right execution model.

    • For interactive exploration where I need a persistent digital workspace, the “Remote Computer” (RunPod + VS Code) works well.
    • For scalable, repeatable, but disposable batch jobs like training or HPO, the “Serverless Function” (Modal) reduces engineering friction.
  2. Separate State from Compute: Whether using network volumes (RunPod) or dedicated storage volumes (Modal), I keep my datasets and results separate from ephemeral compute instances.

  3. Automate Away the Toil: Most of this has been done before and can be scripted. I automate SSH key management, environment setup, and data uploads to minimize manual steps. The cost of this stuff on cloud providers is negligible compared with my time.

Should we run all ML workloads in this way? Absolutely not. This is the kind of thing I’d consider for a team of one, doing research and prototyping, and avoiding yak-shaving.

8 Further reading