Building a Kubeflow Pipelines demo - Part Three
- Derek Ferguson
- Mar 17, 2019
- 4 min read
So, this step is really the meat of the process -- the actual training. Having said that, I am (perhaps overly optimistically) hoping that this will be the smoothest implementation experience yet. Step 1 - I had to learn how to create a Kubeflow pipeline and steps within it... decent learning curve! Step 2 - I had to understand how to connect the steps together... easier, but still an undertaking. Now, knowing both of those parts, I really just need to cookie-cutter Step 2 and change my Python code and a little bit of the pipeline connection.
To this point, I actually decide to start step 3 by running a primitive version of the Python code *straight* from the pipeline! Basically, I add a step 3 to the pipeline that spins up a "train" image taking no inputs and no outputs, then I create a Python script that spins up a blank, untrained keras model, a Dockerfile that runs that Python script on a Tensorflow image and a build script that puts it all together.
I discover at this point that my top-level build script was not doing what I thought it was doing. It really needs to go into each individual subdirectory to kick off those Docker builds, so I refactor that a little bit. Then I rebuild, upload my new pipeline file to Kubeflow pipelines and... boom! :-)

So, this is really cool, because it answers a question I had before and wasn't sure how to get answered. In the more advanced pipeline samples, there are lines connecting early, early steps with very late steps and I didn't quite understand how control flow could move like that. This reminds me that we aren't look so much at control flow here as we are at data flow. And, as it happens, I'm very happy now that I chose to refactor the outputs from my preprocess step, because it makes it much clearer to me what a diagram like this means.
Essentially, I start by downloading 2 pairs of data. The first pair is a training image set and its accompanying labels. The second pair is a test image set and its accompanying labels. 2 pairs of 2 items = 4 items. Now, in preprocess, we take the image set from either pair and normalize them, but we don't touch the labels - as explained in Part 2 of this series. So, for that reason, in step 3 ("train") I decided to pull the labels directly from the first step.
So, that is what this image is graphically showing us: the training and test labels flowing directly from "download" to "train" while the images themselves (training and test) must first past through preprocessing. Slick, isn't it? This will make for a great talking point in my upcoming demos! :-) . But, does it work at this point... or rather, does it at least execute without exploding? Why yes... yes, it does! :-)

So, I hack on it a bit more, adding in additional code for the training. One of the first things I add is code to load in the training images and labels, but I accidentally add it *after* the code that tries to use it. This answers another question for me: what does it look like in the pipeline when something blows up?

So, I re-code, re-upload and re-try. Something uglier happens this time... it gets to the end of one of the steps and just hangs -- no explanation given. Wouldn't be happy if this happened in a demo. In fact, it got worse after I wrote that sentence - my entire MacBook froze up and had to be cold re-started. I can only hope that the KF Pipelines behavior was a symptom of some OS-level issue on the Mac and not the cause!

So - just to be safe - I restored to the most recent snapshot of my MiniKF virtual machine (Minio running, but none of my pipelines yet) and retry - expecting this will be slow, because it will need to re-download all the Docker images. And yes, it is slow, but it works just fine. So... I guess, now I need to stuff the model in Minio so I can pass it to a separate Evaluation step.
I've not been down this path before, so I'll use the code shown in this article. This is getting a little on the long side in terms of execution time, so I will try running this from the command line inside my MiniKF virtual host going forward - until I can circle back and figure out how to re-run from a single step forward via the pipeline UI.
Side note: one pleasant surprise at this point is that my terminal windows both recalled their state when I spawn them with "terminal" from the MacOS finder. Rock on!
And saving the model to disk works on the first try, so let me add the code to push it into Minio and save the object name into an output file, so we can pass it to the next step. And that also works on the first try. So, I will modify the pipeline to pass that as an output file parameter and retry from inside Kubeflow, end-to-end!
And, once again - everything gets stuck at the end of my preprocess step - the output shows clearly that it hit the end of the script, but... no close out. I notice that there is a blank line at the end of my Python script so... maybe that is in some way preventing Docker and/or KF from recognizing that the script is hitting the end? After waiting a couple of minutes - it ends of its own accord. So... something odd going on there, but not clear exactly what.
In any event, step 3 is finished! :-)
Comments