Kubernetes Troubleshooting Playbook — 26 Real Scenarios

By IT Defined Team | April 20, 2026

Real-world Kubernetes troubleshooting scenarios — CrashLoopBackOff, ImagePullBackOff, networking, OOMKilled, and 22 more. With kubectl commands and root causes.

Why this playbook exists

I've been running Kubernetes troubleshooting workshops for IT Defined's batches for two years now. We have a 200-student program where we throw broken clusters at people and time how long they take to fix things. Patterns emerged.

Most failures aren't novel. The same 25-30 failure modes account for 90% of real-world Kubernetes incidents. Pod stuck Pending. Image won't pull. Service can't be reached. OOM kills. DNS broken.

If you can confidently debug these 26 scenarios, you'll handle most production incidents. If you can't — well, you'll learn fast in your first on-call rotation, but better to learn here.

Each scenario has the symptom, the diagnosis commands, the most likely root causes, and the fix. I'll skip the basics like "what is a pod" — this is for people who already know Kubernetes basics and want to get better at debugging.

Scenario 1: CrashLoopBackOff

Symptom: kubectl get pods shows STATUS as CrashLoopBackOff with restart count climbing.
Diagnosis:
kubectl describe pod POD_NAME
kubectl logs POD_NAME --previous

Likely causes: app crashes on startup (config error, missing env var, can't connect to DB), liveness probe too aggressive (kills app before it's ready), command/args misconfigured in the manifest.

Fix: read the previous container's logs. The reason is usually right there. If logs are empty, the container died before logging — check if the entrypoint exists, check command and args.

Scenario 2: ImagePullBackOff or ErrImagePull

Symptom: pod stuck, image pull keeps failing.
Diagnosis: kubectl describe pod, look at the events at the bottom.

Likely causes: image name typo, image doesn't exist in the registry, registry credentials missing or expired (imagePullSecrets), wrong region (ECR is regional), node IAM role can't pull from ECR.

Fix: copy the image name from the manifest, run docker pull manually from a workstation that has registry access. If that works, it's a node permission issue. If it doesn't, the image really isn't there.

Scenario 3: Pod stuck Pending forever

Symptom: pod created, never schedules to a node.
Diagnosis: kubectl describe pod and look at events. Common messages: "0/3 nodes available: insufficient cpu" or "didn't match node selector."

Likely causes: insufficient cluster capacity (need to scale nodes), resource requests too high, taints/tolerations mismatch, node selector pointing to nonexistent labels, PVC not bound.

Fix: check kubectl describe nodes for available resources. If maxed out, autoscale. If selectors don't match, fix the manifest.

Scenario 4: OOMKilled

Symptom: pod restarting, kubectl describe shows "Last State: Terminated, Reason: OOMKilled."

Likely causes: container exceeded its memory limit, JVM not configured for container limits, memory leak.

Fix: increase memory limits if the workload genuinely needs more. For Java apps, set -Xmx properly or use -XX:MaxRAMPercentage. Investigate memory leaks in the application code.

Scenario 5: Service unreachable ("connection refused")

Symptom: another pod tries to call your service, gets connection refused.
Diagnosis:
kubectl get svc
kubectl get endpoints SVC_NAME
kubectl describe svc SVC_NAME

Likely causes: no endpoints (selector doesn't match any pod labels), pod not listening on the port the service expects, network policy blocking traffic.

Fix: 99% of the time it's a label selector mismatch. The service's selector has to match the pod's labels exactly.

Scenario 6: DNS resolution failing inside the cluster

Symptom: pod can't resolve service.namespace.svc.cluster.local.
Diagnosis: kubectl exec into the pod, run nslookup or dig. Check kubectl get pods -n kube-system for CoreDNS pods.

Likely causes: CoreDNS pods crashed or not ready, NetworkPolicy blocking DNS, /etc/resolv.conf misconfigured in pod.

Fix: restart CoreDNS pods if they're misbehaving. On EKS, sometimes the CoreDNS deployment needs to be scaled up — defaults are too low for busy clusters.

Scenario 7: Ingress returns 502 Bad Gateway

Symptom: ALB or NGINX ingress returns 502.

Likely causes: backend pod is down, target group health check failing, port mismatch between service and pod, slow startup so ALB marks unhealthy.

Fix: check target group health in the AWS console. If pods are unhealthy, fix the readiness probe. If pods are healthy but you still get 502, check the service-to-pod port mapping.

Scenario 8: Persistent Volume Claim stuck Pending

Symptom: PVC won't bind, pods waiting for it can't start.

Likely causes: no StorageClass set, StorageClass has wrong provisioner, EBS CSI driver not installed, region mismatch, IAM permissions for the EBS CSI driver.

Fix on EKS: install the EBS CSI driver as an EKS add-on. Make sure the service account has the right IAM role via IRSA. Verify with kubectl describe pvc — events tell you what failed.

Scenario 9: Node Not Ready

Symptom: kubectl get nodes shows NotReady.

Likely causes: kubelet crashed, container runtime issue, disk pressure, network plugin failure, memory pressure on the node.

Fix: SSH to the node (or use SSM Session Manager). Check journalctl -u kubelet. Check disk usage. Often it's disk full from log accumulation.

Scenario 10: HorizontalPodAutoscaler not scaling

Symptom: traffic spike, pods don't scale up.

Likely causes: metrics-server not installed or broken, HPA targeting CPU but pod has no CPU requests set, max replicas reached, scale-up policy too conservative.

Fix: kubectl get hpa shows current vs target metrics. If <unknown> appears under metrics, metrics-server is broken. If pods don't have requests set, HPA can't compute percentage.

Scenario 11: Helm upgrade stuck

Symptom: helm upgrade hangs, never completes, or fails with timeout.
Diagnosis: kubectl get pods --watch in another terminal.

Likely causes: pod can't start (one of the earlier scenarios), pre-install hook failing, --wait flag waiting for resources that won't become ready.

Fix: helm rollback if you need to. Then debug the actual pod failure. Don't blame Helm — Helm is just waiting for Kubernetes.

Scenario 12: kubectl exec fails

Symptom: "Unable to connect to the server" or similar when execing into a pod.

Likely causes: kubeconfig expired, EKS auth token expired, network issue between you and the cluster, pod doesn't have a shell installed (alpine images sometimes lack bash).

Fix: aws eks update-kubeconfig to refresh credentials. If that's fine, try kubectl exec with sh instead of bash.

Scenario 13: Image too large, slow startup

Symptom: pod takes 5+ minutes to pull image.

Likely causes: Dockerfile not optimized — single-stage build, includes build tools in final image, copies entire repo.

Fix: multi-stage builds, alpine or distroless base images. Get final image under 200MB if possible.

Scenario 14: ConfigMap update not reflected in pod

Symptom: you updated a ConfigMap, pod still uses old values.

Likely causes: ConfigMap mounted as env vars only updates on pod restart, even mounted as a volume the app may have cached the values.

Fix: kubectl rollout restart deployment/X. There's no automatic propagation. Use a tool like Reloader if you want auto-restart on ConfigMap change.

Scenario 15: Secret rotation breaks pods

Symptom: rotated DB password, pods now can't connect.

Likely causes: pods cached the old credentials, secret update doesn't propagate to running pods immediately.

Fix: rolling restart the deployment after rotation. For zero-downtime rotations, use AWS Secrets Manager with the secrets-store-csi-driver and configure auto-rotation properly.

Scenario 16: Network Policy blocking traffic

Symptom: services that worked yesterday can't reach each other.

Likely causes: someone added a NetworkPolicy that's too restrictive, default-deny policy without explicit allows.

Fix: kubectl get networkpolicy in the namespace. Review what's allowed. Add explicit egress rules for kube-dns at minimum.

Scenario 17: PodDisruptionBudget blocks node drain

Symptom: trying to drain a node for upgrade, kubectl drain hangs.

Likely causes: PDB requires N pods available, draining would violate it, deployment only has 1 replica.

Fix: scale up the deployment temporarily, or adjust the PDB. For single-replica deployments, you can't have a PDB that requires availability — that's a bug in the manifest.

Scenario 18: ClusterAutoscaler not provisioning new nodes

Symptom: pods stuck Pending due to insufficient resources, but no new nodes are coming up.

Likely causes: ASG max reached, autoscaler IAM perms wrong, autoscaler logs showing errors, scale-up disabled.

Fix: check kubectl logs -n kube-system deploy/cluster-autoscaler. On EKS in 2026, prefer Karpenter over cluster-autoscaler — it's more flexible.

Scenario 19: kube-proxy or CNI plugin issues

Symptom: weird intermittent connectivity, packet loss.

Likely causes: kube-proxy crashed, IPAM exhausted (AWS VPC CNI is sensitive to this on EKS), conflicting CNI configurations.

Fix: kubectl get pods -n kube-system -l k8s-app=kube-proxy. Check logs. On EKS, IP exhaustion often requires enabling prefix delegation or moving to a CIDR with more IPs.

Scenario 20: Job stuck or won't complete

Symptom: a Job created pods that finished, but the Job itself shows incomplete.

Likely causes: completions field misconfigured, parallel pods needed but only 1 ran, BackoffLimit exceeded.

Fix: kubectl describe job shows reason. Most common is BackoffLimit reached because pods kept failing — fix the underlying pod failure first.

Scenario 21: ServiceAccount can't access AWS resource

Symptom: pod with IRSA gets AccessDenied calling AWS APIs.

Likely causes: trust policy on the IAM role doesn't include the right service account, eksctl/Terraform misconfigured the OIDC provider, pod not actually using the SA.

Fix: kubectl describe pod, check the SA name. kubectl get sa -o yaml, check the eks.amazonaws.com/role-arn annotation. Then check that role's trust policy.

Scenario 22: Webhook admission controller blocking deploys

Symptom: every kubectl apply fails with webhook error.

Likely causes: admission webhook pod is down, certificate expired, network issue between kube-api and webhook.

Fix: kubectl get validatingwebhookconfiguration, kubectl get mutatingwebhookconfiguration. Check if the backing pods are healthy. If a webhook is broken and blocking everything, you can temporarily delete the webhook config (carefully).

Scenario 23: HPA scaled to 0, app dies

Symptom: deployment scaled to 0 replicas overnight.

Likely causes: scale-to-zero misconfigured, KEDA wrongly scaled to zero, manual scaling that wasn't documented.

Fix: kubectl describe hpa shows scaling events. Check audit logs for who scaled what.

Scenario 24: Liveness probe killing healthy pods

Symptom: pods restart constantly even though the app is fine.

Likely causes: probe too aggressive — initialDelaySeconds too short, app needs longer to start, probe endpoint too expensive (queries DB on every check).

Fix: increase initialDelaySeconds. Use a separate /healthz endpoint that doesn't query the DB. Use startup probes for slow-starting apps.

Scenario 25: PV stuck after PVC deletion

Symptom: deleted a PVC, the underlying PV is stuck Released and can't be reused.

Likely causes: reclaim policy is Retain, finalizers preventing cleanup.

Fix: edit the PV, remove finalizers if needed, change reclaim policy or just delete the PV manually. The underlying EBS volume still needs to be deleted separately if you don't want the bill.

Scenario 26: Cluster upgrade breaks workloads

Symptom: upgraded EKS from 1.28 to 1.29, things broke.

Likely causes: deprecated APIs removed in new version, deprecated container runtimes (dockershim was removed in 1.24), CNI version compatibility, manifest using removed fields.

Fix: always run kubectl-deprecations or pluto BEFORE upgrading. Read the Kubernetes release notes. Test in staging first. Always.

How to use this playbook

Bookmark it. When you hit a real incident, search this page for keywords from the symptom. We've covered most of the day-to-day stuff.

If you want to actually practice these scenarios in a safe environment, our Kubernetes troubleshooting labs at IT Defined are exactly this — broken clusters with planted issues, you fix them under time pressure. It's the closest thing to real on-call experience without the 3am alerts.

Frequently asked questions

How do I get better at Kubernetes troubleshooting?

Practice on broken clusters. Tools like KillerCoda, KodeKloud, or just intentionally breaking your own cluster. Reading docs only gets you so far.

What's the most common cause of incidents in production Kubernetes?

Honestly, configuration errors. Misconfigured probes, wrong selectors, resource limits set too low. Pure infrastructure failures are rarer than human errors.

Should I memorize all 26 scenarios?

No. Understand the diagnostic patterns. kubectl describe pod, kubectl logs, kubectl get events. Once you internalize how to investigate, the specific scenarios become recognizable patterns.

Is k9s helpful for troubleshooting?

Yes, a lot. It's a TUI for kubectl. Faster navigation, real-time updates. Use it.

About IT Defined

IT Defined is a software training institute in Whitefield, Bangalore, offering hands-on programs in AWS DevOps, Full-Stack MERN, Python, and Cybersecurity. We've trained over 2,000 students with live projects, mock interviews, and placement support.

Visit: itdefined.org  |  Phone: +91 6363730986  |  Email: info@itdefined.org

The 26-Lab Kubernetes Troubleshooting Playbook Every DevOps Engineer Needs