How is deep learning on Amazon EC2 awful this week?

March 16, 2017 — March 16, 2017

computers are awful
concurrency hell
premature optimization

I want to do cloud machine learning.

Let’s try this on Amazon Web Services and see what’s awful.

Yak shaving risk.

I don’t want to do anything fancy here, just process a few gigabytes of MP3 data. My data is stored in the AARNET owncloud server. It’s all quite simple, but the algorithm is just too slow without a GPU and I don’t have a GPU machine I can leave running. I’ve developed it in keras v1.2.2, which depends on tensorflow 1.0.

I was trying to use google for this, but I got lost in working out their big-data-optimised algorithms and then discovered they weren’t even going to save me any money over Amazon, so I may as well just take the easy route and do some amazon thing. Gimme a fancy computer with no fuss please, Amazon. Let me run my tensorflow.

1 Preliminaries

Howto guide from bitfusion, and the Keras run-through.

If you want to upload or config locally, you should probably get the AWS CLI.

pip3 install awscli
aws configure

You will need to set a password to use X11 GUIs.

2 Attempt 1: Ubuntu 14.04

I will use the elderly and unloved Ubuntu NVIDIA images, since they support owncloud.

First we fire up tmux to persist jobs between network implosions.

Now, install some necessary things:

sudo apt install virtualenvwrapper  # build my own updated python
sudo apt install owncloud-client-cmd   # sync my files
sudo apt install libcupti-dev # recommended CUDA tools

Great. That all works

owncloudcmd -u vice.chancellor@unsw.edu.au -p password1234 ~/Datasets https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/Datasets

Oh, that segfaults. So perhaps they don’t support Owncloud. Bugger it, I’ll download my data manually. Let me at the actual calculations.

wget -r -nH -np --cut-dirs=1 -U Mozilla --user=vice.chancellor@unsw.edu.au
--password=password1234 https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/Datasets

Huh, a 401 error. Hmm.

Well, I’ll rsync from my laptop. While that’s happening, I’ll upgrade Tensorflow.

pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.1-cp34-cp34m-linux_x86_64.whl

Oh no, turns out the shipped NVIDIA libs are too old for Tensorflow 1.0. (i.e. version 7.5 instead of the required 8.0). GOTO NVIDIA’s CUDA page, and embark upon a complicated install procedure. Oh wait, I need to register first.

<much downloading drivers and running mysterious install scripts omitted, after which it seems to claim to work.>

Oh, it’s missing ffmpeg. How about I fix that with some completely unverified packages from some guy on the internet? I could have compiled it myself, I guess?

sudo add-apt-repository ppa:mc3man/trusty-media
sudo apt install ffmpeg

Now I run my code.

Well, that bit kinda worked, except that now my tensorflow instance can’t see the video drivers at all. There’s no error, it just doesn’t see the GPU.

So I’m paying money for no reason; this calculation fact goes slightly faster on my laptop, for which I only pay the cost of electricity.

Bugger it, I’ll try to use the NVIDIA-supported AMI. That will be sweet, right?

nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63         |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap|         Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
|   0 GRID K520 Off  | 0000:00:03.0 Off |                  N/A |
| N/A 32C P0 35W / 125W |     11MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU PID Type Process name Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Summary: This turned out to be a terrible idea, as the machine ultimately doesn’t actually include updated useful GPU libraries, and I need those for Tensorflow 1.0. I still had to join NVIDIA’s development program and download gigabytes of crap. Then I broke it. If you are going to do that, you may as well just go to Ubuntu 16.04 and at least have modern libraries. Or Amazon Linux, see below.

3 Attempt 2: Amazon Linux AMI

Firstly, we need tmux for persistent jobs.

sudo yum install tmux
tmux ls
failed to connect to server
tmux new lg
[exited]

Ah, so tmux doesn’t work on AMI linux? Maybe they have no users with persistent remote jobs?

Uhhh. OK, well I’ll ignore that for now and install ffmpeg to analyse those MP3s.

sudo yum install ffmpeg
No package ffmpeg available.

Arse.

The forums recommend downloading some guy’s ffmpeg builds. (extra debugging info here) Or maybe you can install it from a repo?

ARGH my session just froze and I can’t resume it because I have no tmux. Bugger this for a game of soldiers.

4 Attempt 3: Ubuntu 16.04

We start with a recent Ubuntu AMI. Unfortunately I’m not allowed to run that on GPU instances.

5 Attempt 4: The original Ubuntu 14.04 image but I’ll take a deep breath and do the GPU driver install properly

Back to the elderly and unloved Ubuntu NVIDIA images.

Maybe i can do a cheeky upgrade?

sudo do-release-upgrade

No, that’s too terrifying.

OK, careful probing reveals that the Amazon G2 instances have NVIDIA GRID K520 GPUs. NVIDIA doesn’t list them on their main overview page, but careful searching will turn up a link to a driver numbered 367.57, so I’m probably looking for a driver number like that. And “compute capability” 3.0, I learnt from internet forums.

This is getting silly.

Hmm, maybe I can hope my code is Tensorflow 0.12 compatible?

sudo apt install python-pip python-dev python-virtualenv virtualenvwrapper
sudo apt install python3-pip python3-dev python3-virtualenv
virtualenv --system-site-packages ~/lg_virtualenv/ --python=`which python3`
source ~/lg_virtualenv/bin/activate
~/lg_virtualenv/bin/pip install --upgrade pip # or weird install errors
~/lg_virtualenv/bin/pip install audioread librosa jupyter #I think this will be fine for my app?
jupyter notebook --port=9188 workbooks

Oh crap. Turns out the version of scipy in this virtualenv is arbitrarily broken and won’t import :

from scipy.stats import poisson, geom, expon
ImportError: No module named 'scipy.lib.decorator'

What? OK, that looks like some obsolete version of scipy.

~/lg_virtualenv/bin/pip install --upgrade scipy

AAAAAAAAND now tensorflow is broken, because the scipy upgrade broke numpy, and I get RuntimeError: module compiled against API version 0xa but this version of numpy is 0x9.

OK, let’s see if i can get my virtualenv to use everything compiled from the parent distro, which will require me to work out how to set up jupyter to use a virtualenv kernel:

Instructions here.

NB that still breaks scipy whose elderly version on this computer (0.13.1) seems to be bad at virtualenv.

OK, how about I forcibly inject my code into the system python install? If not, I’ll recompile Tensorflow.

deactivate
sudo pip3 install -e .

Yes, that works. So there is some stupid interaction between jupyter and scipy and virtualenv.

I don’t care, this is day 4 of my attempt to boot up a GPU and get some work done. What is the filthiest most stupid possible solution which will make it clear to my advisor that I’m not spending all day masturbating to NASCAR?

OK, I repeat the magic ffmpeg incantation from before:

sudo add-apt-repository ppa:mc3man/trusty-media
sudo apt install ffmpeg

Now I can run my code! Is it Tensorflow 0.12 compatible? I fix my dependencies to keras 1.2.0 and give it a go:

ValueError: Input 0 is incompatible with layer conv_1: expected ndim=4, found ndim=5

Ah, so I do need Keras 1.2.2 unless I want to spend time working out why my code breaks on the older version.

This is what my Tensors should look like:

Tensor("Squeeze:0", shape=(20, ?, 128), dtype=float32)
Tensor("Reshape_4:0", shape=(?, 256, 128, 1), dtype=float32)

And this is what they actually look like

Tensor("Squeeze:0", shape=(20, ?, 128), dtype=float32)
Tensor("Reshape_4:0", shape=(?, 20, 256, 128, 1), dtype=float32)

Something stupid has happened to the batch versus normal dimensions.

OK, I don’t care, I’m not a software guy. Time to recompile tensorflow.

sudo apt install libcupti-dev
sudo add-apt-repository ppa:webupd8team/java
sudo apt install oracle-java8-installer
echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
sudo apt update && sudo apt install bazel
git clone https://github.com/tensorflow/tensorflow
pushd tensorflow
git checkout r1.0
./configure  # NB CUDA compute capability 3.0
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

Uhhhhhh turns out that crashes after not finding the right CUDA stuff. But where is the CUDA stuff on this system? Who knows? It’s not documented.

OK, how about I reinstall all the cuDNN and CUDA nonsense?

6 Attempt 5: Ubuntu16.04 I found on the internet somewhere plus complete reinstall of everything ever

Since the P2 Amazon instances have Tesla K80 CPUs which are better documented and possibly better supported, I ditch everything. I search for hvm backed Ubuntu 16.04 images in the Amazon Community AMI marketplace.

Eventually I find one which looks legit but honestly, who knows? could be full of spyware. Because of the clunky AWS EC2 design I can’t even easily link to it here, so let’s pass over that in silence and let future explorers make their own malware mistakes. Anyway, hvm AMIs are allowed to access the GPU, so I grab a p2.xlarge instance and whack Ubuntu 16.04.2 on it.

Now! Boot time!

The P2 instances are probably worth compiling for so you can use all their sweet hardware to full advantage, so at least we’ll feel good about recompiling and wasting yet more time.

The walk through I’ll follow when i need to do this. I mostly follow that one, but their advice about bazel versions is outdated. An alternative version., and the basic Ubuntu, non-AMI nvidia driver version. But wait! they are all somewhat altered by the new NVIDIA Drivers PPA for Ubuntu

Which damn driver? Let’s try to reverse engineer it from the unix driver page, or the search page. 375.39 seems to be the goods.

NB I also have to download the cuDNN libraries from developer.nvidia.com separately and upload them again.

wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb
sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb
sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64.deb
sudo dpkg -i libcudnn5-dev_5.1.10-1+cuda8.0_amd64.deb
sudo apt install cuda
sudo apt install libcupti-dev
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt --no-install-recommends install nvidia-375
sudo apt update && sudo apt install bazel
sudo apt install python3-pip python3-dev python3-virtualenv
sudo apt install ffmpeg owncloud-client-cmd # finally.
sudo pip3 install jupyter librosa pydot_ng audioread numpy scipy seaborn keras==1.2.2

I put this stuff in ~\.profile:

export CUDA_HOME=/usr/local/cuda
export CUDA_ROOT=/usr/local/cuda
export PATH=$PATH:$CUDA_ROOT/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_ROOT/lib64

Build time.

For the ./configure step, I need to know that it seem cudnn ended up in /usr/lib/x86_64-linux-gnu and cuda in /usr/local/cuda, and the python library path somehow ended up /usr/local/lib/python3.5/dist-packages. The compute capability of the K80 is 3.7, and if you want to use the G2 instances as well, it might run if you also generate the 3.0 version. Although I haven’t tested that.

git clone --recurse-submodules https://github.com/tensorflow/tensorflow
cd tensorflow
./configure
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
# Go and make a coffee, or perhaps a 3 course meal, because this takes an hour

bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg
sudo pip3 install ~/tensorflow_pkg/tensorflow-1.0.1-cp35-cp35m-linux_x86_64.whl

AAAAAH IT RUNS!

My total bill for this awful experience was

  • 37.28 USD, and
  • approximately 32 hours of work, including the 10-odd hours I pissed against the wall trying google.

Now, hopefully my algorithm does something interesting.

Addendum: I couldn’t make owncloud authenticate and I’m bored of that, so I uploaded the results into an S3 bucket.

The magic policy commands for that are

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::bucket_of_data"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::bucket_of_data/*"
            ]
        }
    ]
}

You can use this to do file sync via such commands as

aws s3 sync --exclude="/.*" ./output s3://bucket_of_data/output  # upload
aws s3 sync --exclude="/.*" s3://bucket_of_data/output ./output  # download

However, AFAICT this can never actually delete files so it’s annoying. I will probably need to manage that with git-annex or rclone. See synchronising files.