Kubeflow "Unboxing" - Part 7a - TFJob: Motivation
- Derek Ferguson
- Dec 26, 2018
- 3 min read
So, the little personal technology project at the heart of this is a very modest semantic analysis effort. Like all good technology projects begin, I'm going to take some similar code that already works and pound it into shape for my own purposes.
Putting this code into a JupyterHub notebook revealed many alleged syntax errors. This puzzled me, because the code works fine on Google's own Colab environment. After a bit of investigation, I realized this is because the Google Colab environment has later versions of both TF (1.13 - basically nightly bits) and Keras (2.2.4-tf).
So, looking at the images in the drop-down for JupyterHub leads to disappointment - 1.10 is the latest there. But, these are just suggestions, so I investigated at https://console.cloud.google.com/gcr/images/kubeflow-images-public/GLOBAL and found that there is a 1.12 image available, if only for KubeFlow 0.4. Keep in mind I'm on .3, but I'm a gambler and it doesn't seem like there should be too many dependencies between the two, so I went ahead and loaded up...
gcr.io/kubeflow-images-public/tensorflow-1.12.0-notebook-cpu:v0.4.0
Unfortunately, even this image didn't seem quite up to the challenge of the syntax changes since the original version, However, this is easily remedied with the addition to two small lines to the start of the file:
get_ipython().system('pip3 install tf-nightly --upgrade')
get_ipython().system('pip3 install keras --upgrade')
Keep in mind, I'm using Python 3, of course.
For simplicity's sake, I'll put the sample code for this at https://github.com/JavaDerek/KubeflowExperiments as I evolve it.
After this, the script runs through to completion and does what it is supposed to do, which is - over the course of 3 training epochs - evolve a style of semi-random communication that resembles the speech patterns of William Shakespeare, based of repeatedly training against his complete writings.
However, this lead me to my next problem: running 3 epochs on this Jupyter notebook took longer than running 30 epochs on Google Collab environment. Clearly, I was not making the most of my cluster.
So, the first thing I tried was requesting one of the GPU-based notebook images. Do I have GPU capacity on my cluster? Well, I wasn't sure... I guess I was hoping I might have picked up some at some point in time. It turns out that I'm allowed to load this image, but trying to run it in the notebook gives the error below about swig_import_helper(). OK, I don't have GPU capacity on my cluster (at the very least, I haven't done any pre-work to configure it.)

So, the next thing I thought was the maybe I could just request an obscene amount of conventional CPU and memory. This wound up spinning for a long time in the UI, even though I saw that k8s generated a scheduling error right away.

I couldn't figure out how to rescue the specific Jupyter user from this condition. Logging out and logging back in went straight back to waiting for this image to be spun up. Killing the pod didn't help. But, thankfully - Jupyter notebook users only seem to cost additional PV's, so I just spun up a new notebook and was able to run through to completion with 3 CPU's and 8 times my previous memory allocation. This didn't help, so - clearly, I have to dive into the world of TFJob to try to distribute my training across more of my cluster.
Comments