Tapping into my new GPU from Kubeflow
- Derek Ferguson
- Feb 7, 2019
- 4 min read
So, there are many, many instructions online for the various bits and pieces needed to go from installing a GPU to having it actually being used for your machine learning exercises. I'm going to record the entire path I take here for the benefit of others who may need to go down this path.
In terms of starting point, I just installed a GTX 1060 on an Ubuntu 18.04 desktop. I chose desktop because I thought that stood a chance of recognizing my GPU during installation. Thus far, it appears not.
So, we'll start with the drivers per these instructions, which says that I should install with "sudo apt-get nvidia-384 nvidia-modprobe". So, I'll start with this attempt. And, right off the bat I realize they've missed the word "install". Great... so I rerun with "sudo apt-get install nvidia-384 nvidia-modprobe" and, sure enough, it finds ~800 MB of software that it needs to install. Pretty strong argument that the GPU was not recognized when I installed the OS. So, I proceed.
Unlike these instructions, I don't get any prompt to disable my secure boot. So, I assume my secure boot is already disabled. However, trying to run "nvidia-smi" does not work immediately after the install finishes, it required a reboot, and then magically came to life! :-)

I proceeded on to install the CUDA toolkit, as the blog suggests. Only change here was substituting "vi" for "vim" on my particular machine.
I don't know if it was required or not, but I rebooted before trying the sample code, and this is where I found the next difference. The instructions say to build the examples from scratch, but I found no source code. Instead, I found the binaries at "/usr/local/cuda-9.0/extras/demo_suite". Running "deviceQuery" here produced some great output, further confirming that not only are the drivers installed, but they are recognizing my specific card (the 1060).

The rest of that article seemed to work fine - downloading the cuDNN, unzipping it in /usr/local and then refreshing the loaded libraries.
So, now that we have NVIDIA installed, let's move on to installing a Docker version that will work with it, per these instructions.
The first thing I notice is that the very first command can't be run by a regular user, so I switch to root for all the remaining instructions. Everything works fine until "sudo apt-get install -y nvidia-docker2" at which point it tells me I can't proceed without installing a different version of Docker and yet, that version can't be installed for reasons not specified. I decide to try a reboot to see if maybe some libraries are locked or something.

The reboot doesn't help, so I decide to check to see what I may still have installed from Docker. Sure enough, "sudo docker version" returns information. So, I follow the instructions here to try a fuller uninstall before proceeding.
Based on some advice I find on an online forum, I continue onwards to run everything up to but NOT including "INSTALL DOCKER CE" beneath this. The theory here is that this should re-point me at the right repository for the correct Docker bits, as something prior on this machine may have pointed me to the wrong repo.
Sure enough - re-running "sudo apt-get install -y nvidia-docker2" now kicks off the full process! :-). After it finishes, I complete the rest of the instructions on the original page above and I'm rewarded with the ability to run a genuine GPU-requiring Docker image. The only change being that the command to do so requires sudo on my machine, unlike in the instructions.
Now that we have a Docker version for Nvidia GPU, let's install a version of Kubernetes that works with it by following these directions here.
This all worked perfectly, but the important thing to understand about this is that you only need to run the first set of instructions. Once you've run "kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml" and it worked -- you're done with this part!
I tried to apply an additional set of instructions after this but hit an error and realized that they were just redoing parts of what was already done above. The only additional thing I had to do - which seemed a little odd, but... OK - was start specifying the kubeflow namespace as a part of my forwarding command. So, now my forwarding command for the HTTP traffic on the Kubeflow UI is...
sudo kubectl port-forward `kubectl get pods --selector=servi=ambassador -n kubeflow -o jsonpath='{.items[0].metadata.name}'` -n kubeflow 8080:80 --address 0.0.0.0
So, at this point -- you've got Kubeflow ready to work with your GPU! Go ahead and spawn a JupyterHub notebook, choosing one of the target images from the drop-down that specifies GPU and try some simple code with Tensorflow. If you can spawn a session and do something without the error we saw before... you're probably on GPU.
After I wrote the above, I found a better approach, of which I will include a screenshot to end this blog. This approach basically assigns some work to a GPU device. If this works, you are *definitely* on your GPU.

Comments