🕸️ Ada Research Browser

loki-ingestion-failure.md
← Back

Runbook: Loki Ingestion Failure

Alert

Severity

Warning -- Log ingestion failure means platform and application logs are being dropped. This creates gaps in observability and audit trails, violating NIST AU-2 (Audit Events) and AU-4 (Audit Storage Capacity) controls.

Impact

Investigation Steps

  1. Check Loki pod status:
kubectl get pods -n logging -l app.kubernetes.io/name=loki
  1. Check Loki logs for errors:
kubectl logs -n logging -l app.kubernetes.io/name=loki --tail=200
  1. Check Alloy (log collector) pod status:
kubectl get pods -n logging -l app.kubernetes.io/name=alloy
  1. Check Alloy logs for delivery errors:
kubectl logs -n logging -l app.kubernetes.io/name=alloy --tail=200 | grep -i "error\|429\|500\|dropped\|failed"
  1. Check Loki metrics for ingestion rate:
Prometheus query: rate(loki_distributor_bytes_received_total[5m])
  1. Check for rate limiting:
kubectl logs -n logging -l app.kubernetes.io/name=loki --tail=200 | grep -i "rate\|limit\|429"
  1. Check Loki storage (filesystem in SingleBinary mode):
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- df -h /tmp/loki
  1. Check the HelmRelease status:
flux get helmrelease loki -n logging
flux get helmrelease alloy -n logging
  1. Check if Loki can be reached from Alloy:
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=alloy -o name | head -1) -- wget -q -O- http://loki.logging.svc.cluster.local:3100/ready
  1. Check Loki readiness:
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- wget -q -O- http://localhost:3100/ready

Resolution

Loki storage full

  1. Check current disk usage:
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- du -sh /tmp/loki/chunks /tmp/loki/rules
  1. If the storage is full, reduce the retention period temporarily:
# In the Loki HelmRelease values
loki:
  limits_config:
    retention_period: "168h"  # Reduce from 720h to 168h (7 days)
  1. Push the change to Git and reconcile:
flux reconcile helmrelease loki -n logging --with-source
  1. If using PVC storage, expand the PVC (if the StorageClass supports expansion):
kubectl edit pvc storage-loki-0 -n logging
  1. For immediate relief, restart Loki to trigger compaction:
kubectl rollout restart statefulset loki -n logging

Loki rate limiting (429 errors)

  1. Check the current rate limits:
kubectl get helmrelease loki -n logging -o yaml | grep -A 10 "limits_config"
  1. Increase the ingestion rate limits:
loki:
  limits_config:
    ingestion_rate_mb: 10
    ingestion_burst_size_mb: 20
    per_stream_rate_limit: "5MB"
    per_stream_rate_limit_burst: "15MB"
  1. Alternatively, reduce log volume by filtering noisy namespaces in the Alloy configuration

Alloy pods not collecting logs

  1. Check Alloy DaemonSet:
kubectl get daemonset alloy -n logging
  1. Verify all nodes have an Alloy pod:
kubectl get pods -n logging -l app.kubernetes.io/name=alloy -o wide
  1. If a pod is missing, check node taints:
kubectl describe node <node-name> | grep Taints
  1. Restart the Alloy DaemonSet:
kubectl rollout restart daemonset alloy -n logging

Alloy cannot reach Loki

  1. Test connectivity:
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=alloy -o name | head -1) -- wget -q -O- http://loki.logging.svc.cluster.local:3100/ready
  1. Check NetworkPolicies in the logging namespace:
kubectl get networkpolicies -n logging
  1. Verify the Loki service exists and has endpoints:
kubectl get svc loki -n logging
kubectl get endpoints loki -n logging

Loki pod crash-looping

  1. Check logs from the previous crash:
kubectl logs -n logging -l app.kubernetes.io/name=loki --previous --tail=100
  1. Common causes:
  2. Out of memory (check resource limits)
  3. Corrupted WAL (write-ahead log)
  4. Schema migration failure after upgrade

  5. If WAL corruption is suspected:

# Delete the WAL directory (data since last flush will be lost)
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- rm -rf /tmp/loki/wal
kubectl rollout restart statefulset loki -n logging
  1. If out of memory, increase limits in the HelmRelease

High cardinality labels causing memory pressure

  1. Check for high cardinality:
LogQL query in Grafana: count(count by (__name__)({__name__=~".+"}))
  1. Identify namespaces producing the most log volume:
LogQL: sum by (namespace) (rate({namespace=~".+"} | __error__="" [5m]))
  1. Add label drop rules in the Alloy configuration for high-cardinality labels that are not useful

Emergency: log data gap recovery

If logs were lost during the outage:

  1. Check if the Alloy pods buffered logs locally (they do not persist to disk by default -- logs during the outage are lost)
  2. For critical audit events, check the Kubernetes API audit log on the server node:
ssh sre-admin@<server-ip> "sudo cat /var/lib/rancher/rke2/server/logs/audit.log | tail -500"
  1. Check node journals for events during the gap:
ssh sre-admin@<node-ip> "sudo journalctl --since '<start-time>' --until '<end-time>'"

Prevention

Escalation