🕸️ Ada Research Browser

node-not-ready.md
← Back

Runbook: Node Not Ready

Alert

Severity

Critical -- A NotReady node means pods scheduled on that node may be unreachable or evicted. In a 3-node cluster (1 server + 2 agents), losing a node reduces capacity by 33-50% and may affect pod scheduling and HA guarantees.

Impact

Investigation Steps

  1. Check node status:
kubectl get nodes -o wide
  1. Describe the not-ready node for condition details:
kubectl describe node <node-name>
  1. Look at the conditions section for specific failures:
kubectl get node <node-name> -o jsonpath='{.status.conditions[*]}' | jq .
  1. Check if the node is reachable via SSH:
ssh sre-admin@<node-ip> "uptime && free -h && df -h"
  1. If SSH is available, check kubelet status:
ssh sre-admin@<node-ip> "sudo systemctl status rke2-agent"
# Or for server nodes:
ssh sre-admin@<node-ip> "sudo systemctl status rke2-server"
  1. Check kubelet logs on the node:
ssh sre-admin@<node-ip> "sudo journalctl -u rke2-agent --no-pager --since '30 minutes ago' | tail -100"
  1. Check for disk pressure:
ssh sre-admin@<node-ip> "df -h && df -i"
  1. Check for memory pressure:
ssh sre-admin@<node-ip> "free -h && cat /proc/meminfo | grep -E 'MemTotal|MemAvailable|SwapTotal'"
  1. Check for PID pressure:
ssh sre-admin@<node-ip> "ps aux | wc -l"
  1. Check containerd status:
ssh sre-admin@<node-ip> "sudo systemctl status containerd"
ssh sre-admin@<node-ip> "sudo crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock ps"
  1. Check system logs for hardware or kernel errors:
ssh sre-admin@<node-ip> "sudo dmesg | tail -50"
ssh sre-admin@<node-ip> "sudo journalctl -p err --since '1 hour ago' --no-pager"
  1. Check pods that were running on the not-ready node:
kubectl get pods -A --field-selector spec.nodeName=<node-name>

Resolution

kubelet/RKE2 service stopped

  1. Restart the RKE2 service:
# For agent nodes:
ssh sre-admin@<node-ip> "sudo systemctl restart rke2-agent"

# For server nodes:
ssh sre-admin@<node-ip> "sudo systemctl restart rke2-server"
  1. Wait 1-2 minutes and verify the node returns to Ready:
kubectl get node <node-name> -w

Disk pressure

  1. Identify large files or directories:
ssh sre-admin@<node-ip> "sudo du -sh /var/log/* | sort -rh | head -10"
ssh sre-admin@<node-ip> "sudo du -sh /var/lib/rancher/rke2/* | sort -rh | head -10"
  1. Clean up container images:
ssh sre-admin@<node-ip> "sudo crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock rmi --prune"
  1. Rotate and compress old logs:
ssh sre-admin@<node-ip> "sudo journalctl --vacuum-size=500M"

Memory pressure

  1. Check for pods consuming excessive memory:
kubectl top pods -A --sort-by=memory | head -20
  1. If a specific pod is the cause, check its memory limits and consider adjusting the HelmRelease values

  2. If system-level memory pressure, check for non-Kubernetes processes:

ssh sre-admin@<node-ip> "ps aux --sort=-%mem | head -20"

Network connectivity issues

  1. Check if the node can reach the API server:
ssh sre-admin@<node-ip> "curl -k https://127.0.0.1:6443/healthz"
  1. Check firewall rules:
ssh sre-admin@<node-ip> "sudo firewall-cmd --list-all"
  1. Verify required ports are open (RKE2 uses 6443, 9345, 10250, 2379-2380)

Node completely unresponsive

  1. If SSH is not available, attempt console access via Proxmox:
# From a machine with Proxmox access
ssh root@<proxmox-host> "qm status <vmid>"
  1. If the VM is stopped, start it:
ssh root@<proxmox-host> "qm start <vmid>"
  1. If the VM is running but unresponsive, force reset:
ssh root@<proxmox-host> "qm reset <vmid>"
  1. After the node comes back, verify it rejoins the cluster:
kubectl get nodes -w

Cordon and drain (if node needs maintenance)

  1. Cordon the node to prevent new pods:
kubectl cordon <node-name>
  1. Drain existing pods:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=300s
  1. Perform maintenance
  2. Uncordon when ready:
kubectl uncordon <node-name>

Prevention

Escalation