GPU computation out of the cloud

How is deep learning awful this time?

March 23, 2017 — July 14, 2021

computers are awful

concurrency hell

premature optimization

Suspiciously similar content

Yak shaving risk.

I want an option to do machine learning without the cloud, which as we learned previously, is awful.

But also I’m a vagabond with no safe and high-bandwidth spot to store a giant GPU machine.

So, I buy a Razer Blade 2017, a surprisingly cheap laptop with modern features and comparable performance to the kind of single-GPU desktop machine I could afford. Similar steps would probably work for any machine where I run a GUI and my GPU computation on the same machine.

I don’t want to do anything fancy, just process a few gigabytes of MP3 data, and prototype some algorithms to finally run on a rented GPU in the cloud. It’s all quite simple, but I need to prototype the algorithms on a GPU before I try to get the cloud GPUs working.

There are many trade-offs and some moving targets. Check Tim Dettmers’ 2020 roundup.

This involves installing and enabling access to the GPU libraries, then compiling my ML software (e.g. tensorflow) to take best advantage of them. And then getting the hardware to actually work with said software.

Useful things to bear in mind:

It is not optional for me to restrict access to the GPU to certain programs; I must do so. Tensorflow will crash if it shares the GPU unexpectedly.
I don’t necessarily want to render to the screen using my NVIDIA; it is primarily a compute device for me, not a graphics device. Many online guides assume something different.
Switching the NVIDIA off to save power and disabling access to the NVIDIA card to prevent clashing are distinct problems but they are conflated by the software often.
There is a distinction between the basic NVIDIA drivers and the CUDA computing infrastructure which leverages them.
Switching graphics devices can be done with proprietary nvidia-prime or open-source bumblebee technologies, which are different and occasionally clash. AFAICT nvidia-prime is not granular enough for my purposes; I can use it to turn the NVIDIA off but then need to use bumblebee to turn it back on and enable access to certain apps.
You can run a GPU headless, i.e. with no access to the monitor ports, but still run a GUI on it (because you might access that X-server virtually over the network). I probably want neither for machine learning purposes.
There is a related option which is to use one video card for display (e.g. run the X server) but do background rendering on another headless card as made available by installing e.g. virtualgl to get the output of the GPU to the screen or NVIDIA’s PRIME render offloading via primus. This is probably also not what I generally want to do.
This is a rapidly moving target, check for updates in online HOWTOs.
One needs to choose between open-source NVIDIA driver nouveau and proprietary nvidia which changes things. AFAICT nouveau is “good” these days but does not support CUDA, so one should ignore any instructions that result in using nouveau to avoid weird things going wrong. (How to disable it in recent ubuntu.)
One subtle problem is that things can get weird if you turn off the NVIDIA GPU using bbswitch then load the module which leaves the card uncontactable until reboot which is… confusing.

Argh. that sounds like a lot to be aware of for a quick prototyping hack. And it is.

1 Managing the GPU

First problem, how can I get the GPU behaving nicely? I am not in fact an expert at this, but after much reading, I think the following taxonomy of options might approximately reflect reality:

1.1 Do nothing but still use the GUI

Cross fingers, hope it is not too disruptive to do nothing special and fight with the X11 system for use of the GPU. Wastes power since GPU is always in use, crashes tensorflow.

1.2 NVIDIA-on-demand

NVIDIA driver version 435.17 and higher has NVIDIA-supported GPU offloading which might do what we want, although it is not clear if they actually power down the GPU and how to invoke them, as the manual is gamer-oriented. Also when I try to use this system it introduces many weird bugs and glitches around the display freezing and failing to reboot, even the GPU is notionally not being used right now.

Possibly when doing this it would be enough to invoke an app with

__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep vendor

At the moment I cannot tolerate the burden of my computer only waking up from sleep sometimes so I am disabling this.

1.3 Headless on discrete GPU with X11 on integrated GPU

The simplest option for me in principle. NVIDIA drivers without support for the actual display, which will run (if at all) from my integrated intel graphics card, which is adequate for most purposes. However if I want to save power by turning the NVIDIA device off, can I do it this way? I think not, so I’ll be wasting 25W (which seems to be the base rate?) for doing nothing most of the time.

Here is a guide by Ruohui Wang.

if you forget to install kernel headers for your linux kernel the modules will not be built and this will work and the only indication will be a small remark in the installation log.

For me this tantalisingly simple option does not work well as the NVIDIA remains powered off and so the kernel modules complain there is nothing to talk to.

In particular

sudo modprobe nvidia-uvm nvidia

fails and there is no NVIDIA graphics listed when I do

lspci

1.4 Headless no X11 at all

Do everything from the command-line! Develop from another computer if you are so soft that you want graphics!

Yes, I could boot without graphics in Ubuntu so the GPU is only for compute, but TBH I am not this hardcore. I like graphic terminals for interacting with my code and my graphs and so on. In practice were I to actually do this I would run another computer with a graphical terminal.

1.5 Switchable graphics

archlinux lists several options here:

PRIME, the official NVIDIA method to support switchable graphics. See PRIME#PRIME render offload for details.
Nouveau: See PRIME for graphics switching and nouveau for open-source NVIDIA driver.
nvidia-xrun.
Optimus-manager u

I have only tried one, bumblebee because that was the available option at the time, and this is extremely boring.

See Askannz’ guide to NVIDIA power management.

1.5.1 Bumblebee

tl;dr this option is over-engineered but saves power, and it briefly worked for me. However I rebooted one day and it never worked again. So there is some kind of delicate race condition?

It was complex and fragile to set up for Ubuntu before approximately 17.10, so much so that I recommend switching to 18.04 or later. Instructions for earlier versions have been deleted.

AFAICT, this option allows me to switch off the graphics card except for my desired applications and they still somehow get to display stuff to my screen. The NVIDIA card is magically switched off to save power when not in use, and switched on when an app needs it. This seems flexible and also complicated. I can also manually switch power on and off for my NVIDIA using bbswitch which is nice when the laptop is not plugged in.

There are various supporting bits of technology, called primus and optimus and prime and so on that I am deeply bored by, and will mindlessly copy instructions for, until it works with the most minimal conceivable effort towards understanding what they are doing.

Bumblebee is its own small world. The documentation is improving but is still confusing. It mostly works by using the default Ubuntu packages now. Many standards and details change over time. For example as of 2019.12, the nvidia library on ubuntu is located in /usr/lib/x86_64-linux-gnu and blacklist-nvidia.conf is located in /lib/modprobe.d/. On the plus side, it is much simpler than in the old days. Things changed from slow and awful to sometimes-easy in Ubuntu 18.04 Bionic Beaver.

The master bug thread tracking the configuration changes is here. There is a scruffier thread on the Bumblebee repo. AFAICT this is partially automated in the bumblebee-nvidia package. System hardware-specific workarounds are in a special bug tracker thread. Search for your machine there and work out which zany firmware patch you will have to use in practice.

The following should be enough to set it all up by magic in principle

apt install bumblebee bumblebee-nvidia linux-headers-generic

In practice, /etc/bumblebee/bumblebee.conf gets this wrong in ubuntu 19.10 and I had to update, changing /usr/lib/nvidia-current to /usr/lib/x86_64-linux-gnu and /usr/lib32/nvidia to /usr/lib/i386-linux-gnu. Some other steps that still seem necessary include modifying /etc/environment to add

__GLVND_DISALLOW_PATCHING=1

modifying /etc/default/grub to add nogpumanager to GRUB_CMDLINE_LINUX_DEFAULT, (alternatively sudo systemctl disable gpu-manager.service) blacklisting nvidia-drm if the kernel does not unload and removing nvidia-persistenced:

sudo systemctl disable nvidia-persistenced

To have some token opengl rendering apps for benchmarking:

apt install glmark2 mesa-utils-extra

Some instructions additionally say to. force intel graphics via nvidia-prime. I think this is no longer needed.

sudo apt install nvidia-prime
sudo prime-select intel

After installing bumblebee, I can check if I can get to my NVIDIA card using

optirun nvidia-smi
#or
primusrun nvidia-smi

If I do not need to display anything (I do not), I can be more efficient and do this

optirun --no-xorg nvidia-smi

As Ruohui Wang points out, the following should tell us we are using intel per default:

glxheads

Question: how can I tell if the GPU is switched off when I’m not running optirun?

I think this

cat /proc/acpi/bbswitch

should be OFF.

Here is another alternative setup.

I ran into problems with accessing NVIDIA devices

Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user has read and write permissions for those files.

The solution involves

setfacl -m "g:nvidia-persistenced:rw" /dev/nvidia*

The nvidia-modprobe package seems designed to address this. I installed it halfway through my own setup and I’m not sure how important it was in making things work, but I am not about to do any more debugging now I have research to do.

1.6 Pop!_OS

Pop!_OS is an Ubuntu fork by a hardware manufacturer System76, which claims native support for NVIDIA switching. Maybe that would help me? A major OS reinstall sounds tedious at this point, but I have already lost many days to this problem, so I can imagine this bet being worth it. Since they claim special tensorflow support I suppose this might be a reasonable strategy.

2 CUDA

The libraries that do fancy accelerated computation using the NVIDIA graphics card.

If you are lucky these days you can avoid installing the whole CUDA toolkit. For example, pytorch does not require it unless you want to build it from source. For many other purposes, the desired app comes with the CUDA library, and for other purposes still, one can install the CUDA runtime library, which has a name like libcudart in Ubuntu.

For the whole CUDA toolkit, one can install a package with a name like nvidia-cuda-toolkit, which will be a particular dist-supported version. If I want to know where the CUDA libraries end up

dpkg --listfiles libcudart10.1

I can force usage of the libraries by injecting these into paths, e.g.

LD_LIBRARY_PATH=/some.cuda/path:$LD_LIBRARY_PATH optirun blender

I can disable the use of CUDA

CUDA_VISIBLE_DEVICES= optirun blender

If a particular CUDA version is needed, one needs to download the thing from NVIDIA. That last option is needlessly long and also interacts badly with Ubuntu settings. From the scuttlebutt on the internet, I understand it is complicated and probably not what you want to be doing unless you are being paid by a large corporation to develop GPU apps and have a tech support team.

2.1 Build Tensorflow for fancy GPU

UPDATE: no longer? Does this work now without any building? Looks like it, but I should verify with some performance checks.

Now I need to install Bazel and a whole bunch of Java. This will also be useful if I wish to do discount imitations of other Google infrastructure.

🏗

2.2 Sidestep compilation via anaconda

Maybe this will work:

conda create -n tensorflow pip python=3.6
source activate tensorflow
conda install -c anaconda tensorflow-gpu

3 External GPU

Why install a GPU in some annoying machine?

Thunderbolt 3 EGPUs enclosures are around, e.g. a Razer Core enclosure can host an NVIDIA Titan RTX.

4 NVIDIA ain’t everything

If one is really concerned about cheapness, it is worth mentioning that NVIDIA compute is not the only compute. Aside from the various data centre manufacturer chips, there are AMD Ryzen units. These largely do not work so smoothly with ML yet, but they are around, and there is significant financial incentive to get them going.