Kubeflow "Unboxing" - Part 7b - TFJob: Jupyter to Docker

Derek Ferguson
Dec 27, 2018
2 min read

So, Distributed Tensorflow is a technology we're going to use to try to spread training work across our cluster. Background reading on this is -- https://www.tensorflow.org/deploy/distributed.

TFJob is the Kubeflow construct that is going to help us with Distributed Tensorflow. Background reading on this:

* https://github.com/Azure/kubeflow-labs/tree/master/6-tfjob (first half is non-Azure-specific)

* https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow

* https://www.kubeflow.org/docs/guides/components/tftraining/

Reading through this, the first thing you realize is that you've got to bundle your TensorFlow script up as a self-executing Docker image in order for this to be usable. So, let's do that first. We won't make any changes to our script in order to make it work in a distributed manner yet - just take it "as is" and load it into a Docker image.

My first instinct (a bad one) was to grab the same image we used for the Jupyter notebook. But, if you stop and think about that for a second, you can easily see that that is overkill - we don't need the notebook, just Tensorflow and Keras. When you think about it further, the next thing you think is - why not grab an image that already has the bits in it (TF 1.13 and Keras 2.2.4-tf) that we had to alter our script to add when run in the Notebook?

I couldn't find anything immediately fitting that spec, so I created a Dockerfile that does the job based on the tensorflow/tensorflow:nightly-py3 Docker image that's available directly off Docker Hub. I've put the code for that up on the GitHub repo, but basically it just takes the base image above, installs the latest version of Keras onto it, then drops our Python script in the filesystem and runs it as the image's primary entry point.

The Python code itself doesn't require any code changes at all to run in Docker - *except* that the two lines I had at the start to upgrade TF and Keras within the Jupyter Notebook don't work in stand-alone Python. So, I put a "try/except" around them to allow us to still use the same code in the notebook if we want to, but not blow it up on the desktop or in Docker.

You can check it out for yourself at... https://github.com/JavaDerek/KubeflowExperiments

Next step is to adapt the code for use in a Distributed environment.

kNative from Scratch - a failed attempt

Recording multi-track TD-50 Drums with Pro Tools

Notes on GPT-2 (345M) use with custom text

Kubeflow "Unboxing" - Part 7b - TFJob: Jupyter to Docker

Comentarios