🕸️ Ada Research Browser

flux-reconciliation-failure.md
← Back

Runbook: Flux Reconciliation Failure

Alert

Severity

Warning -- Reconciliation failures mean the cluster state has drifted from the Git repository. Changes committed to Git are not being applied. Extended failures may indicate a broken deployment or dependency issue.

Impact

Investigation Steps

  1. Get the status of all Flux Kustomizations:
flux get kustomizations -A
  1. Get the status of all HelmReleases:
flux get helmreleases -A
  1. Identify the specific failing resource and check its events:
flux logs --kind=Kustomization --name=<name> --namespace=flux-system
flux logs --kind=HelmRelease --name=<name> --namespace=<namespace>
  1. Check the Flux source-controller for Git repository sync issues:
flux get sources git -A
kubectl logs -n flux-system deployment/source-controller --tail=100
  1. Check the Flux helm-controller for Helm-specific errors:
kubectl logs -n flux-system deployment/helm-controller --tail=100
  1. Check the Flux kustomize-controller:
kubectl logs -n flux-system deployment/kustomize-controller --tail=100
  1. Verify Flux system pods are running:
kubectl get pods -n flux-system
  1. Check for resource conflicts or validation errors:
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
  1. Check if the HelmRelease has dependency issues:
kubectl get helmrelease <name> -n <namespace> -o yaml | grep -A 10 dependsOn

Resolution

HelmRelease stuck in "not ready" due to failed upgrade

  1. Check the Helm history:
helm history <release-name> -n <namespace>
  1. If a bad revision exists, let Flux retry:
flux reconcile helmrelease <name> -n <namespace> --with-source
  1. If retries are exhausted, reset the release:
flux suspend helmrelease <name> -n <namespace>
helm rollback <release-name> <last-good-revision> -n <namespace>
flux resume helmrelease <name> -n <namespace>

Kustomization failing due to invalid YAML

  1. Check the error message in the Kustomization status:
kubectl get kustomization <name> -n flux-system -o yaml | grep -A 5 'message:'
  1. Fix the YAML in the Git repository
  2. Push the fix and force reconciliation:
flux reconcile source git sre-platform -n flux-system
flux reconcile kustomization <name> -n flux-system

Git source not syncing

  1. Check the GitRepository status:
flux get sources git -A
kubectl describe gitrepository sre-platform -n flux-system
  1. Verify Git credentials are valid:
kubectl get secret flux-system -n flux-system -o yaml
  1. Test connectivity from the cluster to the Git repository:
kubectl run -n flux-system --rm -it --restart=Never curl-test --image=curlimages/curl:8.4.0 -- curl -I https://github.com

Dependency failure cascading

If component B depends on component A, and A is failing:

  1. Fix component A first
  2. Then reconcile B:
flux reconcile helmrelease <component-a> -n <namespace-a>
# Wait for A to become ready
flux reconcile helmrelease <component-b> -n <namespace-b>

The dependency chain is: istio-base -> cert-manager -> kyverno -> monitoring -> logging -> openbao -> harbor -> neuvector -> keycloak -> tempo -> velero

HelmRelease stuck with "another operation in progress"

  1. Check for stale Helm secrets:
kubectl get secrets -n <namespace> -l owner=helm
  1. If a pending install/upgrade secret exists, remove it:
kubectl delete secret sh.helm.release.v1.<name>.v<version> -n <namespace>
  1. Resume reconciliation:
flux reconcile helmrelease <name> -n <namespace>

Prevention

Escalation