Kubeflow "Unboxing" - Part 7c - In-Proc Cluster & TFJob Peek

Derek Ferguson
Dec 29, 2018
5 min read

OK, it all comes down to this. In order to take our current script that takes 90 minutes to perform 3 epochs of training on one CPU and try to spread it across multiple CPUs on our cluster to see if we can speed it up, we need to:

1) create a TFJob spec for Kubeflow that will automatically generate and manage the requested amount of training capacity on our cluster

2) amend our Python TensorFlow code to...

a) Read in information about the pods that Kubeflow has set up for this training

b) Use TensorFlow's distributed API's to setup communication with the other pods

c) Kick off our processing using the right role assigned to our code's current pod by Kubeflow

The instructions at https://github.com/Azure/kubeflow-labs/tree/master/6-tfjob give a pretty good overview of how to configure a TFJob, but rather than manually editing another YAML file, let's try the UI editor that is built into Kubeflow.

So, I go to the main Kubeflow UI and click "Create" in the upper, right-hand corner.

I notice immediately that there is a "Deploy" button at the bottom of the form. So, this gives me some concern -- does this mean that as soon as I fill this form out and hit the button, it is actually going to try to perform a distributed training? I was sort of thinking this would generate a YAML file for us and put it in some sort of storage for us to submit multiple time.

To avoid wasting time, I fill out some junk values and hit "Deploy" to see immediately what it does. It does nothing. No error message. OK, so I hit the back button and look in the browser to see if there are any TFJobs now. No such luck.

I can't resist trying one more thing. So, I go in and create another TFJob. This time, I just give it a name and hit Deploy. It takes me back to the browser and shows my name in there as a TFJob. Running kubectl confirms that this has created a TFJob out on the cluster.

So... good to know that this is going to be creating TFJobs using the v1alpha2 spec -- I saw that referenced in some of the documentation - can't remember the context, but it may come in handy to know later on. Something more important I realized as a part of this whole process is that, if this GUI is going to submit this job right on the spot, I need to fix my code and publish its Docker image to an accessible repository before I do anything with the TFJob itself. (Hoping it will be cool looking in the Docker Hub repo, as I've never published anything to gcr before. Wouldn't be terrible if I have to, but another thing to learn.)

OK, so according to https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow, after k8s spins up all the pods for my distributed training based on the TFJob I provide, it is going to pass an environment variable (called TF_CONFIG) into each of those pods telling them about the each other and their specific role in the training. So, step #1 in my code changes will be to pretty much copy/paste the code to decipher this environment variable. I'll put it almost at the top of my code -- right after the bit that is responsible for upgrading TF and Keras if running inside a Notebook. In fact, it would probably be pretty slick to do this in a way where it can still run in non-Distributed mode inside the Jupyter notebook -- but I'll come back to that after I get this working in Distributed form.

After pasting that code in, I'm switching back to https://www.tensorflow.org/deploy/distributed - specifically the bit at the bottom about the example trainer program. This isn't *exactly* what we want, because it is getting its cluster information passed in on the command line, whereas we're getting it in the TF_CONFIG environment variable from Kubeflow, so we'll skip the bit where it sets up the various FLAG attributes, as we already parsed that out of the JSON in the environment variable above.

I'm already regretting not having made my code backwards-compatible with the Jupyter Notebook. After about an hour of reading, I can see that the crux of this distributed approach will be leveraging TensorFlow's MonitoredTrainingSession, which knows how to coordinate interactions with all the other pods in the training. However, this relies on the TensorFlow session, and the code I want to work uses Keras. There is some difference of opinion on how to get Keras to work in the TensorFlow session, so I think I am best off doing a single-node "cluster" right on my desktop (or within a single running Docker image, more precisely) to sort out the code changes to get to a TF session first, and then add in the distributed pieces.

So, I address the first issue first - wanting to make it backwards-compatible with Jupyter Notebook execution - basically I just wrap the "pip install" commands in some try/except blocks. Not elegant, but it works.

At this point, I use the presence or absence of the TF_CONFIG environment variable to determine whether we are running on k8s via TFJob or in a stand-alone Docker image. If we get this environment variable, we are running via TFJob, so I harvest all the parameters about what this specific pod is supposed to do from TF_CONFIG, as per the code at the bottom of https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow. If we don't have this environment variable, then I know we are running in a standalone Docker image and I set up a single distribute node right within the current Docker image per the code sample at the top of https://www.tensorflow.org/deploy/distributed.

What comes next is the most experimental bit - telling Keras to use Distributed Tensorflow. Part 3 of https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html has a suggestion, so we try that - basically just grabbing the session off the server we just created and telling Keras to use that.

The result? Well, my first attempts at running this with a very simple "docker run -it dotnetderek/tf-keras-latest:v1" caused it to blow up with the simple message "Killed" after < 10 iterations of the first epoch. That's not a great error message, and occurred at a point when TensorFlow was sort of in the middle of doing its own thing - not my user code, so the only likely explanation would be something in the internals.

Backing out the code establishing a separate TensorFlow server seemed to make it go back to normal. Interestingly, backing out the bit that tells Keras to use the session improved things, but didn't make them perfect. Backing out the code that even spins up the separate server put things completely back to normal. After reading, memory exhaustion seemed to be strongly suggested.

For this, I simply bumped up the memory allocation in the Docker Desktop to 8 GB and - to my considerable relief - not only does it run now, but goes about 3 times faster than before!

So, at this point, we have code that should work fine in a properly-sized Docker container. Let's keep this in mind as we try it on an actual distributed cluster. The code so far is committed to GitHub as the 3rd commit of https://github.com/JavaDerek/KubeflowExperiments/blob/master/PredictWords.py.

kNative from Scratch - a failed attempt

Recording multi-track TD-50 Drums with Pro Tools

Notes on GPT-2 (345M) use with custom text

Kubeflow "Unboxing" - Part 7c - In-Proc Cluster & TFJob Peek

コメント