Changing the Fashion MNIST demo to use persistent volumes
- Derek Ferguson
- Jul 21, 2019
- 5 min read
Before I start this tale, let me say - all the initial testing below was done against the MiniKF virtual machine image of a full Kubeflow deployment. This virtual machine (as of the time of writing) runs a set of bits that are in advance of KF 5.1. As a result, you will see steps that will not work on "vanilla" .5.1. There are some steps online for upgrading to a ".5.1+" version, but I have not yet been able to make those work. So, for now - these are steps just for MiniKF.
So, functionality was recently added to Kubeflow pipelines to let developers easily leverage persistent volumes for sharing data between pipeline steps. In this blog, I intend to walk through an upgrade / branch of my Fashion MNIST demo to use persistent volumes instead of writing everything out to S3.
I start by cloning my GitHub repo.
I then follow the instructions in the "Commit 6" portion of this Kubeflow issue's discussion. Specifically, I add this code right before step 1 of my existing Fashion MNIST pipeline...
vop = dsl.VolumeOp(
name="create_pvc",
resource_name="my-pvc",
mode=["ReadWriteMany"],
size="1Gi"
)
I try to compile my pipeline and it blows up immediately. It doesn't recognize this syntax. I check my versions of kfp and find out that I'm on 0.1.19. Hmm... looks like I could have been on 0.1.24. I upgrade and retry. It doesn't help. So, I remove the "mode" attribute and retry. That works... OK - so, I just have to hope that it will allow me to read and write from it many times by default. :-)
vop = dsl.VolumeOp(
name="create_pvc",
resource_name="my-pvc",
size="1Gi"
)
Now I try to add this to step 1 in my pipeline using the same instructions and it blows up saying that it doesn't recognize the "volumes" parameter of the ContainerOp constructor. Thankfully, these instructions show me that the attribute is now called "pvolumes." So my entire step is...
download = dsl.ContainerOp(
name='download',
# image needs to be a compile-time string
image='docker.io/dotnetderek/download:latest',
arguments=[
download_and_preprocess
],
file_outputs={
'trainImages':'/trainImagesObjectName.txt',
'trainLabels':'/trainLabelsObjectName.txt',
'testImages':'/testImagesObjectName.txt',
'testLabels':'/testLabelsObjectName.txt'
},
pvolumes={
"/mnt": vop.volume
}
)
Uploading my Fashion MNIST pipeline to try this out, the pipeline runs perfectly up to the point in the download step where it tries to interact with Minio, which I haven't set up on cluster. That's fine - just a reminder to upload the code for download.py.
So, looking at the code, I first delete all references to Minio and then realize that, by extension, this means I don't need to do anything with passing S3 keys out of this step any more, either. Instead, I change the lines that wrote temporarily to the file system before loading into S3 to write into my /mnt mount, so that the next step will (hopefully) be able to read the data directly from there.
pickle.dump( train_images, open( "/mnt/train_images", "wb" ), pickle.HIGHEST_PROTOCOL )
pickle.dump( train_labels, open( "/mnt/train_labels", "wb" ), pickle.HIGHEST_PROTOCOL )
pickle.dump( test_images, open( "/mnt/test_images", "wb" ), pickle.HIGHEST_PROTOCOL )
pickle.dump( test_labels, open( "/mnt/test_labels", "wb" ), pickle.HIGHEST_PROTOCOL )
The complete script is now 24 lines, instead of 90. I repackage it as a docker image called "docker:vop" instead of "docker:latest" (I will make it "latest" if this works :-) ) and reattempt. It runs all the way to the end of step 1. So, by all appearances, writing this data to a persistent volume is a piece of cake. Will reading it back in step 2 be equally easy?
Something I very nearly miss as I start to edit the preprocess.py file is that it is looking for the switch that tells it whether to run or not to be the fifth command line argument passed in. But, I just killed all the S3 file pointers out of the pipeline script, so that flag is now the first command line argument. I make this switch in the code first.
I go through the rest of the code and remove all the S3 references and pointing the input to come directly from the same /mnt directory, which I have updated the pipeline file to mount as a part of my second ContainerOp, also. I then load the rebuilt pipeline into my cluster and immediately notice that the Download and Preprocess steps have become parallel steps after the volume creation, rather than sequenced one after the other. Of course, this won't work, because the images have to be downloaded before they can be preprocessed.

So - I will have to go back and pass *something* between the download and pre-process steps to signal that preprocess can't happen until download is finished.
Ultimately, I just elect to pass a single file, into which I just write the string "ok".
print("starting to write conclusion message")
text_file = open("downloadOk.txt", "w")
text_file.write('ok')
text_file.close()
Sure enough, this puts the steps into proper sequence and I am able to get the data that was written to the store in the download step read back and preprocessed in the preprocess step. This bodes very well for my ability to finish the pipeline in this manner! :-)

So, now onto the Training step. From past experience, we now know that we need to maintain the dependencies on the previous steps in order for this to work. So, I borrow the "downloadOk" logic from the download step and add it to the end of the preprocess step. Then I change the build script to build a ":vop" version of the image, like the previous 2 steps. Then the removal of Minio references from the Dockerfile and the Python code. Then the addition of the mount to both the pipeline step and the paths in the Python code.
Interesting wrinkle as I proceed to try out the Training step: the Preprocess step blows up, after having worked fine the first time. I inspect the place where it stopped working and see it is when it attempts to write out the preprocessed data to the share. I theorize that maybe the file mode I'm using won't overwrite an existing file and that the share is sticking around between Pipeline runs (which will be a good thing, otherwise how could we even retrieve the data later?) . I change the mode at the end to "wb+" and re-run. Preprocess succeeds!
Training stumps me for a good long while. I discover as a part of changing it that the names of the files shifted from having underscores in the later steps to not having them in the earlier steps. So, training blows up looking for data files that don't exist. I change this and rerun it and - all is well! But then I add the "Evaluate" stage and training starts to fail at the end, even though it can be seen in the logs that it has performed all of its steps. I comment out "Evaluate" again and all is well again. Clearly, there is something breaking in the move from the Training step to the Evaluate step!
I remove all the stuff about output variables from the Train step and it starts working again. I put it back and it begins failing again. So, I rename the output variable and have it output specifically to the root of the filesystem, which none of the other steps needed. That works! Not sure what happened here, but - glad to be past it!
Evaluate "just works", thankfully. The only change I needed to make beyond the ones above were that the code was expecting the test files to have ".npy" extensions, so I had to remove those.
Also, I comment out the code at the end that tries to serve everything up to a special version of TensorFlow Serving that I had on the same machine. That is a bigger problem for a different day.
Comments