Building a Kubeflow Pipelines demo - Part One
- Derek Ferguson
- Mar 16, 2019
- 5 min read
So, looking at my calendar, I notice two things. First, It has been quite a while since I've blogged anything. Second, I have 6 Kubeflow demos coming up of one sort or another. My laptop doesn't have a GPU, all of the existing Pipeline demos for Kubeflow (because, let's face it, the Pipelines are the eye candy) use public cloud and I don't trust Internet connections during demos. So, all signs point towards me needing to create a demo from scratch.
I will head into this using these two articles as my starting point...
https://towardsdatascience.com/how-to-create-and-deploy-a-kubeflow-machine-learning-pipeline-part-1-efea7a4b650f
https://www.tensorflow.org/tutorials/keras/basic_classification
The idea will be to create a pipeline that goes from scratch to a Fashion MNIST prediction service all running locally on my notebook in real time.
Code will be kept at https://github.com/JavaDerek/FashionMnistKF.
So, I see that every phase in a Kubeflow pipeline is a Docker image. Fine... so I start by creating a folder for my first Docker image, and adding the Python code needed to download the Fashion MNIST data. Now I need to figure out how to pass it to the next step in the Pipeline -- the preprocessor. I see that files can be passed directly from the first demo link above, but is that the only thing I can pass? What about 4 very large arrays, since that is what I have now?
Looking at https://www.kubeflow.org/docs/pipelines/build-component/, I see right away that I have not followed standard naming conventions. I called my first step "one". Eek -- let's rename that to "download".
So, I take the code to download the MNIST fashion data and put it in one Docker image - persisting each of the 4 arrays to Numpy files at the end. Now, my intention is to pass these files to the next step, because I see that "file_outputs" is a mechanism available in the pipeline. After creating my second Docker image, to read in these files and normalize their data, I realize I have misunderstood what this parameter is. The underlying Argo workflow doesn't provide any sort of standard persistence mechanism for passing around files - this is just a way to read data out of files and pass it as command-line parameters to your additional Docker images. This won't work for me, because I have binary data - and it is 60+ MB in many cases. :-(
So, after thinking for a bit, I decide to establish a Minio container. First, I run "vagrant ssh" to get into my Virtual Box where the KF cluster is running. Next, I follow the directions here and here. "docker pull minio/minio" runs fine. And then "docker run -p 9000:9000 --name minio1 -e "MINIO_ACCESS_KEY=AKIAIOSFODNN7EXAMPLE" -e "MINIO_SECRET_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" -v /mnt/data:/data -v /mnt/config:/root/.minio minio/minio server /data &" . Only issue here is that the base instructions don't suggest putting it in the background, so it blocks my terminal, and so I kill it, remove the container and rerun with the ampersand at the end, as shown.
And now, I realize... something. I'm running Minio in my MiniKF cluster, but trying to test my code from my desktop. Port 9000 isn't exposed from the MiniKF cluster, so - of course - any attempt to connect from desktop to MiniKF cluster fails. Seems time to start pushing these images to Docker hub, so I can grab them inside the virtual machine and run them to test... I'll need that later for the actual demo, anyhow. So, with that, I update my build.sh scripts to do the tagging and pushing, also.
Or... maybe not. After pushing up to Docker.io, pulling down inside my MiniKF virtual machine and trying there... this doesn't seem to work either. As an interesting side discovery, I learn here that the Minio client doesn't attempt a connection when instantiated, only when you try to perform some operation. This makes sense, but it means that my earlier celebration at having made a connection because I had constructed the client was premature.
At this point - I have my forehead palming moment: this is all running in its own docker container, so "localhost" is not pointing to my Minio instance - it is pointing right back at the separate Docker where my code is running. I'll try the assigned IP for now, realizing this will be a giant "to do" to hook up a proper K8S service later.
Disappointing when this blows up again, but I notice that the error is different -- complaining about SSL. I had just removed the parameter suggesting a secure connection, so let me try putting that back.
After another 30 minutes of pulling my hair out, I come to realize that "Secure=true" is the default for the Minio Python client but, the Docker image for Minio is not ready to run SSL immediately out of the box (would need to run cert set ups and such). So, the final remaining essential is to explicitly set "secure=False" in the constructor.
Now that Minio is up-and-running, I proceed to save my downloaded numpy arrays with fashion MNIST data into it. Only discovery here is that when you save a numpy array, it automatically appends a ".npy" extension if you don't specify an alternative, so the operation to put the file in S3 after creating it needs to specify this extension.
It occurs to me at this point that I should probably be generating random names for these objects, to allow multiple copies of this pipeline to run concurrently. However, I will circle back to that once this main demo is running end-to-end. Too many further unknowns to spend time on this very well-known problem and solution right now.
For now, I add code to the Python script to put hard-coded object names in each of 4 files and update the file kfp_fashion_mnist.py with the Python DSL for the first step in my proposed workflow. Following these instructions, I install the pipeline preprocessor and try to preprocess my pipeline.
I immediately receive an error telling me that only letters, numbers and dashes are permitted. Upon closer inspection, I see that this is referring to my choice of names for the file output variables for this step - which included underscores. I replace those with simple camel-casing to distinguish the words within the name, recompile and -- success!
I take the .tar.gz file that has been produced and navigate into the Kubeflow Pipeline UI, clicking Upload, giving it a name and -- success!

I create a run. And OMG, it actually works on the first attempt! :-) . Clicking the "download" box brings up the context pane and allows me to inspect first the Logs, showing all the same stuff I had previously seen in the console during testing.

And then clicking the Input/Output tab, I am able to see that the names of the objects have been properly set in the Output variables. I am officially ready to begin the subsequent steps!

Comments