Containerized apps

Doing things that previously took 0.5 computers using 0.4 computers



These are rapidly evolving standards. Check the timestamps on any advice.

Containerization: A solution that fills your need for pincers by giving you a crab.

A lighter, hipper alternative to virtual machines, which, AFAICT, attempts to make provisioning services more like installing an app than building a machine, because the (recommended) unit of packaging is apps or services, rather than machines. This emphasis leads to somehow even more webinars. Related to sandboxing, in that containerized apps are sandboxed even sometimes conflicting. Containerization targets are typically intended to be light-weight reproducible execution environments for some program which might be deployed in the cloud.

I care about this because I hope to use it for practical cloud computing in a reproducible research as part of my machine learning best practice; the document is biased accordingly.

Most containerization systems in practice comprise several moving parts bolted together. They provide systems for building reproducible sets of software dependencies to execute some desired process in a controlled and not-crazy environment, separate from your other apps. There is a built-in system for making recipes for other people/machines to duplicate what you are doing exactly, or at least, close enough. That is, it is a system for doing package management by using supporting features of the host OS to make isolation easier. — Although Nix seems to do it using only clever path management.

The most common hosts for containers are, or were, Linux-ish, but I believe there are also Windows/macOS solutions these days. I have a vague notion that they run VMs to emulate a container-capable host OS. Or something?

Learn more from Julia Evans’s How Containers work.

When you first start using containers, they seem really weird. Is it a process? A virtual machine? What’s a container image? Why isn’t the networking working? I’m on a Mac but it’s somehow running Linux sort of? And now it’s written a million files to my home directory as root? What’s HAPPENING?

And you’re not wrong. Containers seem weird because they ARE really weird. They’re not just one thing, they’re what you get when you glue together 6 different features that were mostly designed to work together but have a bunch of confusing edge cases.

Go get Julia Evans’s How Containers work for an introduction. Julia Evans makes you smarter.

Another way to understand what is going on here is to build your own toy docker system. See Containers the hard way: Gocker: A mini Docker written in Go.

Also insightful, the build toolchain is well-explained by Tiny Container Challenge: Building a 6kB Containerized HTTP Server!. Here they distinguish between crap you image has from building versus crap your image needs for running.

For reproducible research

Containerization may not be the ultimate tool for reproducible research but it is a start.

Notably there are two dominant toolchains of interest in this domain, Docker, and Singularity. Docker is popular on the wider internet but Singularity is more popular for HPC. Docker is cross-platform and easy to run on your personal machine, while singularity is linux-only (i.e. if you want to use singularity from not-linux, install a linux VM.) Singularity can run many docker images (all?), so for some purposes you may as well just assume “Docker” and bear in mind that you might use Singularity as an implementation detail. 🤞

A challenge here is that most containerization tutorials are about how to deploy a webapp for my unicorn dotcom startup, which is how they get features on Hacker News. That is useless to me. I want to Do Science™️. For some reason most of the tutorials bury the lede.

Here are some more pragmatically useful ones. The thing which confused the crap out of me starting out is “what is the ‘inner loop’ of containerized workflow for science?”. I don’t care about all this stuff about deploying vast swarms of Postgresql servers or whatever, or finding someone else’s pre-rolled docker image; I need to develop some research. How do I get my code into one of these containers? (and why do so few tutorials start from there?) How do I get my data in there? How does the blasted thing get onto the cluster?

(TODO: rank in some pedagogic order.)

Tiffany Timbers gives a brisk run-through for academics. Jon Zelner goes in-depth with R in a series culminating in continuous integration for science. See Keunwoo Choi’s guide for researchers by example. The Alan Turing Institute containerization guide is an excellent introduction to how to use containers for reproducible science. Timothy Ko shows off a minimal python app development workflow cycle. Pawsey containerisation class (possibly based off software carpentry?). Alan Turing institute Docker intro.

A worked example, including many details and caveats that are normally glossed over is Jeremy Kun’s Searching for RH Counterexamples — Deploying with Docker.

Docker

The most common way of doing this; so common that it is easiest to define the alternatives with reference to this.

Docker is well supported but has awful terminology, riven with confusing analogies, and poor explanation. Fortunately we have Julia Evans who explains at least the filesystem, overlayfs by example. the google best practice page also has good illustrations which make it clear what is going on. See also the docker cheat sheet, as noted by digithead, who also explains the annoying terminology:

Docker terminology has spawned some confusion. For instance: images vs. containers and registry vs. repository. Luckily, there’s help, for example this stack-overflow post by a brilliant, but under-appreciated, hacker on the difference between images and containers.

  • Registry - a service that stores image repositories
  • Repository - a set of Docker images, usually versions of the same application
  • Image - an immutable snapshot of a running container. An image consists of layers of file system changes stacked up on top of a base image.
  • Container - a runtime instance of an image

Essentially with Docker you provide a recipe for building a reproducible execution environment and the infrastructure will ensure that environment exists for your program. The recipe is ideally encapsulated in the Dockerfile. The costs of this is that it is somewhat more clunky to set things up. The benefit is that setting things up the second time and all subsequent times is in principle effortless and portable.

Installation

NB there is a GUI for all this called Dock station which might make some things more transparent.

Linux

Installing docker is easy. Do not forget to give yourself permission to actually run docker:

sudo groupadd docker
sudo usermod -aG docker $USER

macOS

Choose one:

Docker with GPU

Annoying, last time I tried and required manual patching so intrusive that it was easier not to use Docker at all. Maybe better now? I’m not doing this at the moment, and the terrain is shifting. The currently least-awful hack could be simple. Or, not.

This might be an advantage of singularity.

Opaque timeout error

Do you get the following error?

Error response from daemon: Get https://registry-1.docker.io/v2/:
net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)

According to thaJeztah, the solution is to use google DNS for Docker (or presumably some other non-awful DNS). You can set this by providing a JSON configuration in the preference panel (under daemon -> advanced), e.g.

{ "dns": [ "8.8.8.8", "8.8.4.4" ]}

R Docker

See also

bindertools is a little R helper that seeks to make the bridge to binder for analyses in R even simpler by setting up the install.R file with all packages and versions (both for CRAN and github packages) in one step. The online binder can also be launched right from R, without needing to manually input repository information into the mybinder.org interface.

Rocker

rocker has recipes for r docker.

## command-line R
docker run --rm -ti rocker/r-base
## Rstudio
docker run -e PASSWORD=yourpassword --rm -p 8787:8787 rocker/rstudio
# now browse to localhost:8787. L

Docker compose

Docker Compose: a nice way to set up a dev environment:

Docker Compose basically lets you run a bunch of Docker containers that can communicate with each other. You configure all your containers in one file called docker-compose.yml.

Singularity

Singularity promises potentially useful container infrastructure.

A USP is that singularity containers do not need root privileges which means that they are easy to put on clusters full of incompetent dingleberries which is possibly every cluster that is silly enough to let me on.

Singularity provides a single universal on-ramp from the laptop, to HPC, to cloud.

Users of singularity can build applications on their desktops and run hundreds or thousands of instances—without change—on any public cloud.

Features include:

  • Support for data-intensive workloads—The elegance of Singularity’s architecture bridges the gap between HPC and AI, deep learning/machine learning, and predictive analytics.
  • A secure, single-file-based container format—Cryptographic signatures ensure trusted, reproducible, and validated software environments during runtime and at rest.
  • Extreme mobility—Use standard file and object copy tools to transport, share, or distribute a Singularity container. Any endpoint with Singularity installed can run the container.
  • Compatibility—Designed to support complex architectures and workflows, Singularity is easily adaptable to almost any environment.
  • Simplicity—If you can use Linux®, you can use Singularity.
  • Security—Singularity blocks privilege escalation inside containers by using an immutable single-file container format that can be cryptographically signed and verified.
  • User groups—Join the knowledgeable communities via GitHub, Google Groups, or in the Slack community channel.
  • Enterprise-grade features—Leverage SingularityPRO’s Container Library, Remote Builder, and expanded ecosystem of resources. […]

Released in 2016, Singularity is an open source-based container platform designed for scientific and high-performance computing (HPC) environments. Used by more than 25,000 top academic, government, and enterprise users, Singularity is installed on more than 3 million cores and trusted to run over a million jobs each day.

In addition to enabling greater control over the IT environment, Singularity also supports Bring Your Own Environment (BYOE)—where entire Singularity environments can be transported between computational resources (e.g., users’ PCs) with reproducibility.

Podman

podman is a different and I tgather, more general, runtime.

Podman and Buildah for Docker users:

Now that we’ve discussed some of the motivation it’s time to discuss what that means for the user migrating to Podman. There are a few things to unpack here and we’ll get into each one separately:

  • You install Podman instead of Docker. You do not need to start or manage a daemon process like the Docker daemon.
  • The commands you are familiar with in Docker work the same for Podman.
  • Podman stores its containers and images in a different place than Docker.
  • Podman and Docker images are compatible.
  • Podman does more than Docker for Kubernetes environments.

LXC

LXC is another containerization standard. Because docker is a de facto default, let’s look at this in terms of docker.

Kubernetes

Kubernetes is a large scale container automation system. I don’t need kubernetes since I am not in a team with 500 engineers.


No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.