Building a Kubeflow Pipelines demo - Part Two
- Derek Ferguson
- Mar 17, 2019
- 5 min read
So, the key to adding a second step to the pipeline I started in my previous blog will be reading the command line arguments where the object names will be passed in and reading those objects out of Minio.
Reading in command line arguments appears to be a matter of importing the sys package and using its argv collection. I guess I haven't had call to do that before - seems easy enough.
Looking at the Minio docs, it appears that the recommendation is to read the data out of the buckets and into a file. This is just as well, as I had previously written a little code for this step to read the numpy arrays back out of files - so fine... we'll use the second Docker image's file system as a bit of a pass-through for this data. Maybe not quite as efficient as going straight to and from numpy arrays and Minio but... not my primary focus now by a long shot!
So, the next step before I test this will be to actually add this Docker image in as the second step in my pipeline. As I start this, I realize I was incorrectly calling the step #1 variable in my pipeline Python "preprocess" before - so I change its name to "download", to match the label in the pipeline. Now, my second step will be "preprocess."
I change my build.sh script for this Docker image to upload the full Docker image to Docker hub and re-run it. This leads me to spot an error where I had forgotten to change the name of my script when I copied this verbiage from the previous step. I fix that and all is well.
Now I copy the second step from the article here and modify it to match my needs. Comparing the inputs here to the outputs in the previous step, I realize that I had another error where I used the same output file name for 2 of my variables, producing one inaccurate bucket name output in the previous step. So, I fix this.
This leads me to another cause for concern. I notice that the order of variables in the output panel from Kubeflow and their names are not an exact match with what I had called for in my script. After thinking about this for a moment, though, I realize that in the subsequent bits of the pipeline, I never refer to the output variables by their sequence position and I *do* refer to them by their step origin name... this must be why dashes are forbidden in these names - because Argo must use that dash to differentiate between the step name and the variable name.
Then I notice a truly bizarre bug I made last night (Sunday Morning Derek is not altogether thrilled with Saturday Night Derek at the moment) - all of the variables in the code branch of step 1 that runs if we are starting on a later step are incorrect. All I really had to do was copy these from the bit of code 10 lines above... so... not sure what was going on there, but it is fixed now. :-)
As I go down this path, I realize I'm getting into an "all or nothing" testing model for this Docker image. Either I code this entire Python step to completion and test in the pipeline, or I add some code at the start to default values if the command line arguments haven't been passed in -- which is what I'll use as a signal that this isn't running as a part of the pipeline. So, I do that, re-push to Docker hub and give it a try.
So, at this point I realize I made a mistake uploading the Docker image and have actually mapped preprocess back to my download image. This is exacerbated by the fact that this additional attempt to update the image causes me to run out of disk space. Trying to delete the data folder doesn't address the issue, so I decide to restore to the previous snapshot image. I figure I'll need this ability for my demos anyhow - so now is as good a time as any!
I roll back to the initial image and it auto-starts. I do a "vagrant ssh" and verify I'm able to log in. Navigating to 10.10.10.10 from my laptop browser takes me to the Kubeflow UI. Looking in the Pipelines shows no sign of the pipeline upon which I've been working. So, I feel reset. I run my Minio startup command and... it appears to start up.
I pull my download image and run it. It is back to working - so whatever disk space I'd exhausted and couldn't reclaim via a simple "rm -rf" in the data directory has been reclaimed.
Now I pull my new preprocess image and run that. Hmm... specified key does not exist. What did I do wrong now? The good news is, though, that I definitely have a good formula for restoring from a snapshot. I realize I need to take another snapshot pretty soon. Maybe I'll restore from scratch, start minio and upload my pipeline with 2 steps and that can be my second snapshot. But first, I have to sort out this error.
And it appears to be an easy one. In crafting my defaults for running from step 2, I assigned values for the normalized data - not the pre-normalized data. In other words, I gave the keys to which this step should ultimately save, not the ones from which it should pull. Recode and rerun... boom -- success! :-)
Extending this to the other 3 arrays is equally easy. And then, after all of that, we add 2 lines of code to normalize this data between 0 and 1, and then cookie-cutter the code from step 1 to stuff our normalized data back into buckets in S3. At this point, I realize that only the images needed to be normalized, not the labels (what would that even mean). I could go back and only pass 2 data points into this step, but given my rate of coding errors so far, this seems likely to be unproductive... I'll leave the input to this step as all 4 parameters, and only adjust the output to be 2. :-)
At this point, I have run out of space yet again. Rather than continue futzing with the file system and restarting snapshots, I make a change to the download script to delete the fashiomnist bucket whenever it runs - with error handling in case it doesn't yet exist. And then the command to create the bucket will run after and will NOT have error handling - because we might as well abort if we can't create the bucket at this point.
And now I discover that the bucket has to be empty in order for us to remove it. :-( . I refactor the code multiple times to get a nice clean startup process for the download step every time, and yet it still seems that Minio is running out of space. It must be something in the logging that Minio is doing. Well, this step should help greatly, at least... I reset MiniKF to the original snapshot once again and retry it. Success! :-)
So, the final step now is to get the pipeline working. I think I've already gotten it written, but now is the time to try it out. As with the first step, I compile it into a .tar.gz file and upload it. Looks like it recognizes it, at least! :-)

So now to try a run! To my delight, both steps run. And then, when I check the output of step #2 (preprocess), I realize that it is my messed up image from earlier, where I basically cloned "download". So, I have to fix the pipeline to point to "latest", reupload and retry. And with that, step 2 is done... it is truly a thing of beauty! :-)

Comments