Kubeflow "Unboxing" - Part 7d - Multi-Pod Cluster with TFJob
- Derek Ferguson
- Dec 31, 2018
- 6 min read
In order to get from here to a multi-pod cluster, the first thing we need to do is actually publish the Docker image we created in the previous instalment. "docker push dotnetderek/tf-keras-latest:v1" seems to get the job done, so we can proceed to try that TFJob UI again. But first, let's think about what the absolute simplest cluster we can create this way might be.
1) Hit "Create"
2) Name it "TryASinglePodCluster"
3) Put it in the "kubeflow" namespace
4) Add a Replica Type
5) Make it a "Chief" (which can also be a worker, we are told at https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow)
6) Since we're running our Python script from a Docker ENTRYPOINT, there's no need I can see for a Run command or arguments, so we'll leave that blank
7) For a first try, we'll leave everything else blank, too -- let's just see what happens and then adjust
8) Click "Deploy"
Nothing happens... Second click on "Deploy" -- still nothing.
Web Console shows that it is giving an error about not being able to parse the JSON, but no sign of this in the Web UI.

Sadly, the Network monitor reveals that the server is considering this a Bad Request.

Let's try giving a command of "python" and an argument of our script name ("/usr/src/app/PredictWords.py", per the Dockerfile). We can change the Dockerfile and/or TFJobs accordingly if this will at least go through.
Adding in some guessed values for the limits section changes the error from 400 to 500.

So, I'm thinking, at this point - let's try crafting and submitting a TFJob from scratch and then working our way back to this UI, if desirable.
According to https://www.kubeflow.org/docs/guides/components/tftraining/, you get a ksonnet prototype to help creating TFJob's manually, so let's give that a whirl.
Honestly, despite all this blogging, I could remember at first in exactly which directory I was supposed to run ks. Finally, I found the directory called "new-kubeflow" I had created during my more recent install and running "ks generate tf-job-simple FirstRun --name=FirstRun" caused it to generate a jsonnet component, which I'm told is perfect for running a sample job already out there on the Internet.
Scrolling down in the TF Training documentation, there is a list of 4 things we need to change to make this run our job, so let's follow that lead. Looking at the file, it really does look like it wants to be the thing to kick off the Python script, so let's rebuild the Docker image and republish it as v2. After republishing, we edit the jsonnet file and change the local replica to point to "docker.io/dotnetderek/tf-keras-latest:v2". Down in the worker job, we'll just change it to launch our Python script and get rid of all the parameters.

Notice also that I haven't specified the path for PredictWords.py, but instead changed the workingDir to be that directory. Hopefully that works, as I made identical changes to the next node, which is for the parameter server.
So, at this point it should be ready to run. I bring up the TFJob dashboard - which should show it running if this takes successfully and run the command shown below and get the response shown below.

So, as I suspected, this gives us the feedback that the TFJobs UI was missing... I shouldn't have used upper-case in my name for this. So, I regenerate as "secondrun", make the same edits and run using the same command (except with the updated name).
And *then* it appears in the TFJobs console in Kubeflow. Rock on!

After a little while, the first item disappears - that's the Parameter Server. The second item - the worker - indicates a failure. Clicking on the logs doesn't work for either of these and the web console doesn't indicate any errors or any web requests generated by clicking on them. I'd think they weren't links, were it not for the fact that my cursor changes when I hover over them.
So, I turn to "kubectl describe pod secondrun-worker-0 -n kubeflow". It doesn't give me a ton to go on, but it *does* show me that the proper environment variable was passed, so this is some good news. We're making progress!

So, after some debugging (see my separate blog post for details on this), I figured out that the Tensorflow image is looking for a CPU that supports SSE 4.2. I may have been able to compile a build that doesn't, but in this case, the computer involved is 10 years old, so I'm not sure it would contribute much to my calculations, anyhow. I elected to remove that node and re-run the deployment.

OK, so it looks like I have to add a "pip install simplejson" to the mix in order to use this. It didn't turn up when I ran it on my desktop because, if you recall, this is the branch of code that only runs if it gets passed the TF_CONFIG environment variable and has to parse its execution parameters out of it using json.
So, after making that code change and testing it on my local machine, I push the image to Docker (v4) and the code to Github. Then I update secondrun.ksonnet to point to v4 and re-submit
Forgot to import simplejson as "json" in my Python code. Try the whole thing over again: delete, update secondrun.ksonnet to v5 of the image and re-submit.
And it appears that the fifth time was the charm. At this point, both the parameter server and the worker show as running and pulling up logs on "secondrun-worker-0" shows proper output.

And, after perhaps 30 minutes, I am informed that it has completed. The web console shows "Succeeded" and I can view the fake Shakespearean text in the logs!
So now, I'm going to try giving it another worker and seeing what it does with this. First, delete "secondrun" from TFJobs. Then, go into secondrun.jsonnet and increase the replicas of the worker to 2.

Then resubmit with "ks apply default -c secondrun".
The Web UI immediately reflects 1 parameter server and 2 workers with a status of Pending. As expected, this quickly changes to Running, because I already got the Docker image previously. I note the time at which I started it: 7:49 PM in the evening. Doing some kubectl describes I see that both workers have been assigned to the same machine... well, I guess that's OK - it has multiple cores. Pulling logs on both of them... the output is indistinguishable. I can only hope that Keras has had the sense internally to divide the work between the nodes, despite the fact that there is nothing visible at this point to make me think these 2 TF workers are doing anything other than both running exactly the same workloads.
It ended at 9:03 PM - so... definitely not a time saver of any note. Just for giggles, I'm going to try to turn the number of replicas up to 8, just to see how they get assigned out and what they do. Started at 9:11 PM and 7 of the 8 pods got assigned to my Mac Mini from 2011. Only 1 got assigned to the instance of Linux I have running on my old MacBook from 2012 (which was a top-of-the-line laptop at the time). Not sure what happened, but 8 seemed to result in all of them disappearing and the K8S networking going out on my Mini, requiring it to be restarted.
Retrying with 4 overnight turned out to demonstrably have 0 impact on the execution time. In fact, it took 90 minutes to run, whereas top speed on my desktop was 30 minutes. So... even making allowances for older hardware, clearly the blog post that said Keras would work with TensorFlow Distributed just by connecting the session was misleading. We must have to tag the bits of code that are supposed to be distributed also.
So, to wrap this up, it turned out that the final secret ingredient was putting all of my code after the cluster setup inside a very simple execution bracket...
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % task_id,
cluster=cluster_spec)):
That code basically works with the cluster configuration code we added earlier to tell the the code running in whichever pod which role it needs to play and lets it get on with it.
Unfortunately, once I got this running, I was presented with an error from TensorFlow that Eager Execution is not supported in a Distributed environment. I guess that makes sense, since distribution requires a graph and TensorFlow Distributed has no graph. But, in terms of using my current code base as the vehicle for exercising KubeFlow's TFJob abilities, I think this wraps it up for that. Time to move on to the next piece! :-)
Comments