Runbook: Keycloak SSO Failure
Alert
- Prometheus Alert:
KeycloakDown/KeycloakRealmLoginFailures - Grafana Dashboard: Keycloak metrics dashboard
- Firing condition: Keycloak pod is not ready, or OIDC login failure rate exceeds threshold for more than 5 minutes
Severity
Critical -- Keycloak SSO failure prevents authentication to all platform UIs (Grafana, Harbor, NeuVector). Users cannot log in to monitoring dashboards or the container registry.
Impact
- Login to Grafana via Keycloak SSO fails (fallback to local admin account still works)
- Login to Harbor via OIDC fails (local admin account still works)
- OpenBao UI authentication via OIDC fails
- NeuVector UI authentication via OIDC fails
- New user provisioning and group membership changes are blocked
- NIST IA-2 (Identification and Authentication) compliance control is degraded
Investigation Steps
- Check Keycloak pod status:
kubectl get pods -n keycloak
- Check Keycloak pod logs:
kubectl logs -n keycloak keycloak-0 --tail=200
- Check if the Keycloak database (bundled PostgreSQL) is running:
kubectl get pods -n keycloak -l app.kubernetes.io/component=postgresql
kubectl logs -n keycloak -l app.kubernetes.io/component=postgresql --tail=100
- Check the Keycloak HelmRelease:
flux get helmrelease keycloak -n keycloak
- Test Keycloak health endpoint:
kubectl exec -n keycloak keycloak-0 -- curl -s http://localhost:8080/health/ready
- Check the OIDC well-known endpoint:
kubectl port-forward -n keycloak svc/keycloak-http 8080:80 &
curl -s http://localhost:8080/realms/sre/.well-known/openid-configuration | jq .
- Check Keycloak events for failed logins:
kubectl port-forward -n keycloak svc/keycloak-http 8080:80 &
# Get admin token
TOKEN=$(curl -s -X POST "http://localhost:8080/realms/master/protocol/openid-connect/token" \
-d "client_id=admin-cli" \
-d "username=admin" \
-d "password=$(kubectl get secret keycloak -n keycloak -o jsonpath='{.data.admin-password}' | base64 -d)" \
-d "grant_type=password" | jq -r '.access_token')
# Get recent events
curl -s -H "Authorization: Bearer $TOKEN" \
"http://localhost:8080/admin/realms/sre/events?type=LOGIN_ERROR&max=20" | jq .
- Verify Istio VirtualService for Keycloak:
kubectl get virtualservice -n keycloak
kubectl describe virtualservice keycloak -n keycloak
- Check if the Keycloak service is accessible from other namespaces (e.g., Grafana):
kubectl run -n monitoring --rm -it --restart=Never curl-test --image=curlimages/curl:8.4.0 -- curl -s http://keycloak-http.keycloak.svc.cluster.local:80/realms/sre/.well-known/openid-configuration
Resolution
Keycloak pod not starting
- Check pod events:
kubectl describe pod keycloak-0 -n keycloak
- If the pod is in CrashLoopBackOff, check logs from the previous crash:
kubectl logs -n keycloak keycloak-0 --previous
- Common causes:
- Database connection failure (PostgreSQL not ready)
- Out of memory
-
Configuration error after upgrade
-
If the database is not ready, restart it first:
kubectl rollout restart statefulset -n keycloak -l app.kubernetes.io/component=postgresql
- Then restart Keycloak:
kubectl delete pod keycloak-0 -n keycloak
OIDC token endpoint unreachable from Grafana
- Grafana reaches Keycloak via internal service URL. Verify the service exists:
kubectl get svc keycloak-http -n keycloak
- Check that the Grafana OIDC configuration points to the correct URL:
kubectl get helmrelease kube-prometheus-stack -n monitoring -o yaml | grep -A 5 "token_url"
The token URL should be: http://keycloak-http.keycloak.svc.cluster.local:80/realms/sre/protocol/openid-connect/token
- If the URL is incorrect, update the monitoring HelmRelease values in Git
SSO redirect loop
-
This usually indicates a mismatch between the Keycloak hostname and the redirect URL.
-
Check the Keycloak hostname configuration:
kubectl get pod keycloak-0 -n keycloak -o yaml | grep -A 2 "KC_HOSTNAME"
-
Verify
KC_HOSTNAMEmatches the external URL used by clients (e.g.,keycloak.apps.sre.example.com) -
Check
KC_HOSTNAME_PORTis set correctly (should be443for HTTPS via Istio gateway) -
Verify the OIDC client redirect URIs in Keycloak match the actual application URLs
OIDC client misconfigured
- Access the Keycloak admin console (port-forward or via Istio gateway)
- Navigate to the SRE realm -> Clients
- Verify each client has correct:
- Valid Redirect URIs
- Web Origins
-
Client secret matches what is configured in the consuming service
-
For Grafana, the client configuration should be:
| Setting | Value |
|---|---|
| Client ID | grafana |
| Valid Redirect URIs | https://grafana.apps.sre.example.com/* |
| Web Origins | https://grafana.apps.sre.example.com |
Keycloak database corruption
- If Keycloak logs show database errors:
kubectl logs -n keycloak keycloak-0 --tail=200 | grep -i "database\|postgres\|sql"
- Check PostgreSQL pod:
kubectl logs -n keycloak -l app.kubernetes.io/component=postgresql --tail=100
- If the database is corrupted and persistence is disabled (lab environment), restart both:
kubectl delete pod -n keycloak --all
- If persistence is enabled, restore from the most recent Velero backup of the keycloak namespace
Emergency: bypass Keycloak for platform access
If Keycloak is completely down and you need access to Grafana:
- Use the local admin account:
- Username:
admin -
Password:
prom-operator(or checkgrafana-admin-credentialssecret) -
For Harbor:
- Username:
admin -
Password: check
harbor-coresecret (HARBOR_ADMIN_PASSWORDfield) -
For NeuVector:
- Username:
admin - Password:
admin(default)
Prevention
- Monitor Keycloak health via the metrics ServiceMonitor (enabled in the HelmRelease)
- Set alerts on login failure rate and pod restart count
- Back up the Keycloak realm configuration regularly (export via admin API)
- Test SSO login after any Keycloak or client application upgrade
- Maintain local admin credentials as a break-glass mechanism for all platform UIs
- Keep the
KC_HOSTNAMEandKC_HOSTNAME_PORTenvironment variables synchronized with the Istio gateway configuration
Escalation
- If Keycloak is down and no local admin credentials are available: this is a P1 -- platform operators are locked out
- If user authentication data is lost (database corruption without backup): escalate to platform team lead for re-initialization of the SRE realm
- If SSO failures are intermittent: collect logs during failure windows and check for resource exhaustion or network instability