The sad story of the Tensorboard Artifact viewer

Derek Ferguson
Apr 7, 2019
4 min read

So, one of the features I've been noticing about the pipelines as I've built them out has been the ability to serve up various artifacts - like Tensorboards. I've not done this before, so I'm curious to see how easy (or difficult) this will be.

The starting point is this article. It indicates that we need to change the Python code in our steps that will be producing Tensorboards to have them create a JSON file at the root of their file systems. That JSON will provide a pointer to the location of our Tensorboard logs. OK, so... that sends us back even a little further in our process, because we now need to change our Python to actually produce the required logs.

For this, we'll try to apply some of the guidance found here. It seems a matter of adding some logging lines to our training code - and maybe to our evaluation code after that. One change that I have made between my last blog post and this is that I have added a pipeline parameter to say whether it should perform the download and pre-processing, or just go straight to the training. Unfortunately, the only way to do this is for the Docker images to actually have knowledge of this flag and check the command line argument to see if they should abort operation early or not. Hopefully there will be some way to interact with these flags at the pipeline level in a future version.

Reading through a bit of the guidance, I'm seeing that it is really aimed at Tensorflow in its pure state, rather than the Keras version that we are using in our demo. So, a little more Google magic leads me to this video. I'll follow it and see how it goes.

The first deviation I take from the video is that they use the time feature to put the logs in a time-based directory. This would make sense if we were going to re-run this multiple times in the same container - or on the same actual server. In our case, though, we will be running this just 1 time for each instance of our training container, and we need to be able to tell Kubeflow where to find the logs, so I hard code it to a single location just off the root of the file system.

Wow -- according to that, I was just able to instrument my entire training with Tensorboard in just 2 lines of actual code. Will be fascinating to see if that actually works!

But before I can test for success, I have to go back to my original article above and add some Python to output the right metadata. Adding the code to output that is quite straightforward - I just follow the instructions and the example for outputting text files that I already had from outputting the output file parameters in this code.

But, when I try to re-run it -- everything seems to blow up, even with the stuff that was previously working. I've been having an issue in the amount of memory on my laptop being "only" 16 GB. When MiniKF hits that memory limit, weird things start to happen... I'm wondering if this might be part of it.

Also, I'm seeing odd references to Arrikto, and there was some chat on the Pipelines Slack about something having broken recently that worked in the past. I'm wondering if I should be bold and try deleting my entire installation and rebuilding based on the README instructions at my demo's Github. On one hand, it might solve the problem AND verify my instructions. On the other hand, it might blow everything up and put me in a situation where I'm not able to demo again until it gets fixed further upstream. I decide to be bold!

arrikto/minikf (virtualbox, 0.13.0-pre.857.g986fbbe)

That's the version number currently given for MiniKF. So, running "vagrant box update" -- says that this is the latest version. That sort of kills my idea that upgrading Vagrant might resolve my issue. :-( . I can still hope that rebuilding from scratch might reduce the number of things going on in the box and thereby reduce the memory pressure though. So, I push on...

I go straight for vagrant init arrikto/minikf and vagrant up, hoping that it might install next to my existing installation. Looking at the Virtual Box UI afterwards, I seem to have gotten my wish -- a new box created right below my old one.

Flash forward a few hours later. Having run down all of these resourcing issues, I was nowhere near closer to addressing my main issue- which was that clicking "artifacts" on the training step simply produced a spinning wheel. Finally, I decided to look into the browser console and - sure enough - it turns out that the viewer can only show log data that is stored in Google Cloud Storage or a minio location. Since the whole point of this is to create a self-contained demo, I'm not going to touch GCS - I will have to try to stuff my log directory into Minio and return its location.

Except, thankfully -- before I spend too much more time on this effort, I see this bug report and realize that the Minio support is not yet finished, either. That was an unfortunate day. But, at least I feel more comfortable with tearing down and rebuilding my Mini KF instance now! :-)

Also - I'll leave the code in place. It isn't hurting anything and - at some point in the future - maybe it will be useful.

kNative from Scratch - a failed attempt

Recording multi-track TD-50 Drums with Pro Tools

Notes on GPT-2 (345M) use with custom text

The sad story of the Tensorboard Artifact viewer

Comments