Docker containerized apps (for scientists)

Doing things that previously took 0.5 computers using 0.4 computers

2015-11-05 — 2026-03-24

Wherein Docker’s terms are disentangled, Dockerfiles are prescribed for repeatable environs, and an opaque registry timeout is remedied by setting the daemon’s DNS to 8.8.8.8.

computers are awful
diy

Assumed audience:

People who want to do containerization for machine learning research

Content warning:

The needs of ML research people are not the usual scaling-web-apps type needs of many containerization users. Obsolete advice danger.

The most popular containerization stack.

Figure 1

Containerization is a broad family of technologies. Docker is such a prominent member of that family that it’s easiest to define the others in relation to Docker. “Like Docker, but …” is a common way to introduce a new containerization tool.

Which is unfortunate, because Docker is not the most comprehensible or well-explained member of the family, and it’s not the one I typically need. Docker is well supported, but it has awful terminology, is riddled with confusing analogies, and comes with poor explanations. Fortunately, Julia Evans explains at least the filesystem, overlayfs, by example. The Google best practice page also has good illustrations that make it clear what’s going on. See also the docker cheat sheet, as noted by digithead, who also explains the annoying terminology:

Docker terminology has spawned some confusion. For instance: images vs. containers and registry vs. repository. Luckily, there’s help, for example this stack-overflow post by a brilliant, but under-appreciated, hacker on the difference between images and containers.

  • Registry - a service that stores image repositories
  • Repository - a set of Docker images, usually versions of the same application
  • Image - an immutable snapshot of a running container. An image consists of layers of file system changes stacked up on top of a base image.
  • Container - a runtime instance of an image

Essentially, with Docker, we provide a recipe for building a reproducible execution environment, and the infrastructure ensures that environment exists for our program. The recipe is ideally encapsulated in the Dockerfile. The trade-off is that setup is a bit clunkier. The upside is that the second setup (and every one after that) is, in principle, effortless and portable.

It’s often not the best-suited tool for my research-oriented needs, which can mean heavy GPU requirements, which Docker handles poorly. The things I do that do not need a GPU are usually tiny, and the massive Docker engine feels inefficient and ungainly, so I skip it. In between, there are people trying to scale web apps who love it. There’s an ecosystem targeting researchers called apptainer, but since researcher needs are rarely a priority, the ecosystem is not as rich. YMMV.

1 Installation

There’s a GUI for most of the setup called Dock Station, which might make some steps easier on some platforms. TBC.

1.1 Linux

Installing Docker is easy. Don’t forget to give ourselves permission to actually run Docker:

sudo groupadd docker
sudo usermod -aG docker $USER

1.2 macOS

Docker Desktop on macOS has historically been a resource-hungry beast — a sprawling VM that hogs CPU, drains battery, and occasionally decides it needs several gigabytes of RAM just to sit there looking important.

Pick one:

  • OrbStack (GitHub) claims to be the answer to “what if Docker on macOS didn’t make my laptop sound like a jet engine?” It’s a drop-in replacement for Docker Desktop and aims to be lightweight. If we’re on macOS and not already locked into some enterprise Docker Desktop licensing situation, this looks like the place to start.

  • Docker Desktop for Mac. The canonical option. It works, but it will remind us at every opportunity that virtualization on macOS is a solved problem that Docker has chosen to solve the expensive way.

  • A Homebrew install of the Docker engine/CLI, without Desktop.

1.3 Windows

Docker for Windows.

2 Docker with GPU

Annoying: the last time I tried this, it required manual patching so intrusive that it was easier to avoid Docker entirely. Maybe it’s better now? I’m not doing this at the moment, and the terrain is shifting. The least-awful hack right now could be simple. Or not.

This might be an advantage of Apptainer.

3 Opaque timeout error

Are we getting the following error?

Error response from daemon: Get https://registry-1.docker.io/v2/:
net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)

According to thaJeztah, the solution is to use Google DNS with Docker (or, presumably, some other non-awful DNS). We can set this by providing a JSON configuration in the Preferences panel (under daemon → advanced), for example:

{ "dns": [ "8.8.8.8", "8.8.4.4" ]}

4 Orchestrating

Docker Compose: a nice way to set up a dev environment.

5 R + Docker

See also:

bindertools is a little R helper that seeks to make the bridge to binder for analyses in R even simpler by setting up the install.R file with all packages and versions (both for CRAN and github packages) in one step. The online binder can also be launched right from R, without needing to manually input repository information into the mybinder.org interface.

5.1 Rocker

Rocker has recipes for R and Docker.

## command-line R
docker run --rm -ti rocker/r-base
## Rstudio
docker run -e PASSWORD=yourpassword --rm -p 8787:8787 rocker/rstudio
# now browse to localhost:8787. L

5.2 containerit

6 Docker Compose

Docker Compose: a nice way to set up a development environment:

Docker Compose basically lets you run a bunch of Docker containers that can communicate with each other. You configure all your containers in one file called docker-compose.yml.

7 As a package manager

Russell Jones, Whalebrew: Docker Images as ‘Native’ Commands:

As I’ve previously written, containers can be started, perform a task, then stopped in a matter of milliseconds.1 And that’s exactly what Whalebrew allows you to do in the form of Docker images aliased in your $PATH.

8 Kubernetes

Kubernetes is a large-scale container automation system. I don’t need kubernetes since I am not in a team with 500 engineers.

Footnotes

  1. EDITOR’S NOTE: Clearly I’m doing it wrong then because it is not nearly so fast on my macbook.↩︎