top of page

Troubleshooting TFJob and Removing K8s nodes

  • Writer: Derek Ferguson
    Derek Ferguson
  • Dec 30, 2018
  • 2 min read

I'm going to take a brief pause from the main narration of my journey of discovery with Kubeflow to talk about troubleshooting TFJob or - for that matter - any broken pod in K8s and, when needed, how to remove a node from your cluster that just isn't going to get the job done.

So, in my case, after spinning up my first TFJob, I saw in the Kubeflow UI that it failed shortly after launch. At first I was puzzled as to how to figure out what went wrong. This lead me to a K8s command I guess I should already have known: logs. Since Kubeflow's UI let me know my failed pod was called secondrun-worker-0 in the "kubeflow" namespace, I simply ran:

kubectl logs secondrun-worker-0 -n kubeflow

And I got the following error message in response...

As I mentioned at the start of my Kubeflow series, I am using a set of old computers for my k8s cluster. In fact, the computer to which my pod was assigned was... 10 years old. Further digging has shown me that a computer this old has not been useful since at least TF 1.6 (https://github.com/tensorflow/tensorflow/issues/18689). And so, old friend, it is time for you to go - at least from my current cluster, if not from my entire collection of computing technology.

First - I delete the TFJob, so there is nothing left running on this node. This took some futzing. First, I had let my forwarding task in the console die, so I had to restart that. After that, although it showed the prompt below, it didn't actually do anything when I clicked "Delete". After closing and re-opening the browser, it finally seemed to work, although it is odd that it refers to the job as "test," since that isn't anywhere else in the UI.

And then, I deleted the node with "kubectl delete node gabriel", since gabriel was the name of my "non-useful" node.

And that, per "kubectl get nodes" cleaned it up! :-)

 
 
 

Comments


  • Facebook
  • Twitter
  • LinkedIn

©2018 by Machine Learning for Non-Mathematicians. Proudly created with Wix.com

bottom of page