Runbook: Istio mTLS Failure
Alert
- Prometheus Alert:
IstioMTLSError/IstioPilotConflictOutbound/Istio5xxResponseRate - Grafana Dashboard: Istio Mesh dashboard
- Firing condition: Services report TLS handshake failures, 503 errors from sidecars, or PeerAuthentication STRICT mode rejecting connections
Severity
Critical -- mTLS failures break service-to-service communication within the mesh. In STRICT mode (the SRE platform default), any service without a valid Istio sidecar proxy will be unable to communicate with mesh services.
Impact
- Service-to-service communication fails with 503 or connection reset errors
- Pods without Istio sidecars cannot reach mesh services (STRICT mTLS rejects plaintext)
- Application health checks may fail if probes go through the sidecar
- Ingress traffic through the Istio gateway may be affected
- Compliance violation: NIST SC-8 (Transmission Confidentiality) control is not being met
Investigation Steps
- Check Istiod (control plane) status:
kubectl get pods -n istio-system -l app=istiod
kubectl logs -n istio-system deployment/istiod --tail=100
- Check the PeerAuthentication policy:
kubectl get peerauthentication -A
- Verify mTLS mode for a specific namespace:
kubectl get peerauthentication -n <namespace> -o yaml
- Check proxy status for a failing pod:
istioctl proxy-status
- If
istioctlis not available, check the sidecar proxy logs:
kubectl logs <pod-name> -n <namespace> -c istio-proxy --tail=100
- Check for TLS handshake errors:
kubectl logs <pod-name> -n <namespace> -c istio-proxy --tail=200 | grep -i "tls\|handshake\|ssl\|certificate"
- Check if the destination service has sidecar injection enabled:
kubectl get namespace <namespace> --show-labels | grep istio-injection
- Verify the sidecar is present on both source and destination pods:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].name}'
- Check Istio DestinationRules that might override mTLS settings:
kubectl get destinationrules -A
kubectl get destinationrules -A -o yaml | grep -B 5 -A 5 "tls"
- Check the Istio HelmRelease status:
flux get helmrelease istio-base -n istio-system
flux get helmrelease istiod -n istio-system
- Verify certificates are valid in the proxy:
kubectl exec <pod-name> -n <namespace> -c istio-proxy -- openssl s_client -connect <destination-svc>.<destination-ns>.svc.cluster.local:<port> -tls1_2 2>/dev/null | openssl x509 -noout -dates
Resolution
Pod missing Istio sidecar
- Verify the namespace has the injection label:
kubectl get namespace <namespace> -o jsonpath='{.metadata.labels.istio-injection}'
- If missing, add the label:
kubectl label namespace <namespace> istio-injection=enabled --overwrite
- Restart the pods to inject the sidecar:
kubectl rollout restart deployment <name> -n <namespace>
mTLS failing between namespaces
- Check if both namespaces have STRICT PeerAuthentication:
kubectl get peerauthentication -n <source-namespace>
kubectl get peerauthentication -n <destination-namespace>
- Ensure no DestinationRule is disabling mTLS:
kubectl get destinationrules -n <namespace> -o yaml | grep -A 10 "trafficPolicy"
- If a service needs to accept plaintext (e.g., external health check), create a permissive PeerAuthentication for that specific port:
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: allow-plaintext-health
namespace: <namespace>
spec:
selector:
matchLabels:
app: <app-name>
portLevelMtls:
8080:
mode: PERMISSIVE
Platform namespace communication (Istio injection disabled)
Platform namespaces (kube-system, monitoring, logging, kyverno, etc.) do not have Istio injection enabled. If a platform service needs to reach a mesh service:
- Create a DestinationRule to disable mTLS for that specific service:
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: <service>-plaintext
namespace: <mesh-namespace>
spec:
host: <service>.<mesh-namespace>.svc.cluster.local
trafficPolicy:
tls:
mode: DISABLE
- Or set the PeerAuthentication to PERMISSIVE for that service
Istiod certificate rotation failure
- Check Istiod logs for certificate errors:
kubectl logs -n istio-system deployment/istiod --tail=200 | grep -i "cert\|ca\|root"
- Check if the Istio root CA secret exists:
kubectl get secret istio-ca-secret -n istio-system
- If certificates are expired, restart Istiod to trigger re-issuance:
kubectl rollout restart deployment istiod -n istio-system
- Then restart all application pods to get new certificates:
for ns in $(kubectl get namespaces -l istio-injection=enabled -o name); do
kubectl rollout restart deployment -n ${ns##*/} 2>/dev/null
done
Istio sidecar injection webhook failure
- Check the webhook configuration:
kubectl get mutatingwebhookconfigurations istio-sidecar-injector -o yaml
- Verify the webhook service is reachable:
kubectl get svc -n istio-system istiod
kubectl get endpoints -n istio-system istiod
- If the webhook is down, restart Istiod:
kubectl rollout restart deployment istiod -n istio-system
Prevention
- Verify sidecar injection is working after any Istio upgrade
- Monitor
istio_requests_totalwithresponse_code=503for early detection of mTLS issues - Monitor
pilot_proxy_convergence_timefor slow configuration propagation - Ensure all tenant namespaces have
istio-injection: enabledlabel (enforced by Kyverno) - Document which platform namespaces intentionally do NOT have Istio injection
- Test mTLS connectivity after PeerAuthentication or DestinationRule changes
- Keep Istiod and proxy versions in sync (currently
1.25.2)
Escalation
- If Istiod is completely down: this is a P1 -- no new proxies can connect, and certificate rotation stops
- If mTLS failures affect the Istio ingress gateway: all external traffic is affected -- escalate immediately
- If certificate rotation failure is cluster-wide: all proxy-to-proxy communication will eventually fail as certificates expire