KubeFlow "Unboxing" - Part 1 - the K8s Cluster
- Derek Ferguson
- Dec 16, 2018
- 5 min read
I recently returned from Kubecon in Seattle. Given my focus on machine learning, I focused almost exclusively on the machine learning track at this event. Of all the technologies presented, two in particular caught my attention. The first is an effort called Kubeflow which, as I understand it, should provide a relatively "turn key" solution to go from developing models in Jupyter notebook to having them trained and available for predictions on a pre-existing Kubernetes cluster. The other is an IBM offering known as "Fabric for Deep Learning" (ffdl - pronounced "fiddle"), which seems to offer more-or-less the same thing, but is particular notable for having won some prestigious prizes recently and being integrated with Watson.
It is something of a coin toss at this point, but since the IBM ecosphere tends to be a world onto itself - and a world with which I have little prior experience, I am going to try KubeFlow first. The purpose of this blog post will be twofold. First - to give myself notes as to what I have tried and haven't tried, as I have noticed subsequent to my past posting on the value of interruptions that - sadly - I am not as good at documenting my systems as I would like to be. More usefully to you, I also intend to make the blog something people can follow who wish to set up a KubeFlow instance of their own.
Step 1 -- establishing my baseline.
So, when I left off with my home k8s cluster - it is July. My first step, then, will be to figure out how much of it is still remaining.
* Network - in June, this was a hybrid wireless and wired network with a couple of the nodes plugged directly into my wireless hub via ethernet cable. Since then, I have invested in an eero wireless network and wired ethernet in every room of the house. So, my controller is wireless, my most powerful node is wireless and my two less powerful nodes are plugged into an ethernet port in another room - meaning they have to go through a wireless switch before getting to the eero hub. The one blessing in all of this is that everything ultimately goes into the eero network, so there should be no challenges with network line of sight.
* Kubenetes Controller -- a VirtualBox running in Bridged mode on my primary laptop. My primary laptop is a top-of-the-line MacBook, so I figure this gets the most power. First discovery was that running "kubectl get pods" simply hung and trying to do "kubeadm init" warned that the ports were already in use. So - getting the sense that my cluster was left in an inconsistent state. I'm not going to mess with trying to figure out what broke, since I had nothing terribly important on it anyhow - so going to try to upgrade and build out from there.
systemctl enable docker.service (then enter credentials many times)
First init attempt complains that the Docker version is too old, so follow directions at https://docs.docker.com/install/linux/docker-ce/ubuntu/#install-docker-ce
Of course, there was an issue. Trying to run the final install command for 18.06.1 gave an error that one of the dependencies had failed. Now, while this has been running, Ubuntu (desktop, by the way) has been telling me about some urgently-needed upgrades. So, I'm going to let those install and then try this again. Maybe that process had a lock on some essential files?
Interestingly, this general update from the UI also hit an error which it indicated was with Docker. :-( . So, I reported it (refused to report, since it isn't Ubuntu software) and moved on. Next, it prompts me specifically to upgrade some Docker stuff via the UI. This looks promising, so I'll let it try. Says it needs to restart... OK -- sounds like my theory about locked files is holding.
And, when the restart is finished - I see that Docker 18.09 is installed and running. That's .03 greater than the version that Kubernetes said is verified, but we'll give this a whirl.
However, kubeadm init gives another issue now -- the number of CPUs is less than required. It wants at least 2 cores. This wasn't an issue on the previous version - must be something new. So, I stop the OS, go to the Virtual Box CPU configuration, give it another CPU and restart.
kubeadm init <-- works now!
So now on to the nodes to see how they have faired over the past 6 months. :-/
* My previous MacBook -- so, this will be the most powerful node, after kmaster (honestly can't remember if the master can also run a pod or not - guess I'll find out).
We'll start naively by simply trying to run the command on it that would join it to our new cluster. First, of course, the command needs to be copy/pasted via pastebin.com. Unfortunately, I see that I didn't get the copy/paste integration hooked up with this Ubuntu console-only instance. So, let me just try a simple "sudo kubeadm join" first with no parameters to see what the initial complaints are. OK, Docker isn't running and... this has already been joined to a cluster. Also, I'm going to guess this is an ancient version. Trying to update this just results in a bunch of locked files and - all things considered - I'd rather work with a Desktop image, anyhow. So, I'm just going to install from scratch...
Docker installed using the same instructions as the remove-and-upgrade above -- no problem.
Now to install kubernetes... I used pure command line stuff last time, and I'd be inclined to do the same - but there's something new called "conjure" from Ubuntu since last time. I read through that and it looks like this would've been a great way to start a new cluster, but I'm not sure it is going to play nice with joining my pre-existing cluster, unfortunately. So, I'll just go with the previous command lines.
sudo passwd root
su root
apt-get update && apt-get install -y apt-transport-https
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main
EOF
apt-get update
swapoff -a
apt-get install kubeadm kubectl kubelet kubernetes-cni
Mac Mini -- this one seems to be in fairly decent shape. The K8s version is v1.10.4, but it responded nicely to a "kubeadm reset" and "kubeadm join" -- so let's see how this goes!
Old PC -- also in decent shape. K8s was on v1.10.2, but also allowed me to reset and rejoin.
Having now got my cluster of 4 machines up-and-running, I thought I would revisit the version upgrade from the master.. maybe I could get my 2 older nodes up to the newest version? But, I can't do this while there are NotReady nodes in my cluster and, right now, they're all NotReady, because I haven't put down weave yet. So, I do that next...
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
Unfortunately, at this point, a problem quickly became apparent. The 2 new K8s instances (the master and the 1 node) were able to move to Ready, but the 2 old versions were not. So, I ran the following commands on the Old PC to remove k8s...
kubeadm reset sudo apt-get purge kubeadm kubectl kubelet kubernetes-cni kube* sudo apt-get autoremove sudo rm -rf ~/.kube
Then rebooted and re-installed with an abbreviated set of the above install steps -- basically just...
apt-get install kubeadm kubectl kubelet kubernetes-cni
Followed by the join command. After re-running the weave install from kubectl above and waiting about 5 minutes - the Old PC went to Ready, also.
Rerunning the above on the Mac Mini also brought it to Ready. Finally - I'm ready to install Kubeflow! :-)
Comments