AWS DevOps Interview Questions and Answers (60 Real Questions from 2026 Hiring)

By IT Defined Team

April 30, 2026 Interview Questions

60 real AWS DevOps interview questions from 2026 hiring rounds in Bangalore — fundamentals, scenario-based, and hands-on. Categorized by experience level.

By IT Defined Team | April 12, 2026

60 real AWS DevOps interview questions from 2026 hiring rounds in Bangalore — fundamentals, scenario-based, and hands-on. Categorized by experience level.

How to use this guide

We track interview questions our students get asked across hundreds of interviews each year. This list is the 60 most common as of 2026 hiring rounds, organized by experience level. The answers here aren't exhaustive — interview answers should be 60-90 seconds, not essays. I've kept them short.

Don't memorize. Understand. Most interviewers can tell when you're regurgitating.

Fundamentals (0-2 years experience) — Questions 1-20

Q1: What is DevOps and why does it matter?

DevOps is a practice combining development and operations to ship software faster and more reliably. It matters because companies that can ship daily outcompete companies that ship monthly. Three pillars: culture (collaboration), automation (CI/CD, IaC), measurement (metrics, observability).

Q2: Explain the difference between IaaS, PaaS, and SaaS.

IaaS gives you virtual machines and networking — you manage everything above (EC2, S3). PaaS gives you a runtime — you push code and it runs (Elastic Beanstalk, Heroku). SaaS gives you a complete app (Gmail, Salesforce). DevOps mostly works with IaaS.

Q3: What's the difference between continuous integration and continuous delivery?

CI is automatically building and testing code on every commit. CD extends that to making releases automatic — either continuous deployment (every passing build goes to prod) or continuous delivery (every passing build is releasable, but releases are gated).

Q4: Walk me through what happens when you type kubectl apply -f deployment.yaml.

kubectl reads the YAML, sends it to the API server, which validates it and stores it in etcd. The deployment controller notices the new resource, creates a ReplicaSet, which creates Pods. The scheduler assigns Pods to nodes. The kubelet on each node pulls the image and starts the container.

Q5: What's the difference between a Deployment and a StatefulSet?

Deployments are for stateless apps. Pod names are random, pods are interchangeable. StatefulSets are for stateful apps — pods get stable names (web-0, web-1), stable storage that survives pod restarts, ordered deployment and scaling. Use StatefulSet for databases, message queues, anything where identity matters.

Q6: What is a Docker image vs a Docker container?

An image is the read-only template — your code plus dependencies plus runtime. A container is a running instance of an image. Image is the class, container is the object.

Q7: How does a multi-stage Docker build help?

Multi-stage builds let you use one image for building (with compilers, dev tools) and a different smaller image for the final runtime. Final image is much smaller — 50MB instead of 1GB — which means faster pulls, smaller attack surface, lower storage cost.

Q8: What's the difference between a public subnet and a private subnet?

A public subnet has a route to an Internet Gateway, so resources in it can be reached from the internet (if security groups allow). A private subnet has no IGW route — resources can only reach the internet via NAT Gateway, and the internet can't reach them directly.

Q9: Explain IAM users vs roles.

Users are for permanent identities (humans, applications with long-lived keys). Roles are for temporary credentials assumed via STS — used by EC2 instances, Lambda functions, federated identities. Best practice: roles for everything that supports them.

Q10: What's a security group vs a NACL?

Security groups are stateful, attached to instances/ENIs, allow rules only. NACLs are stateless, attached to subnets, allow and deny rules. Most of the time you'll use security groups; NACLs are for rare layered defense.

Q11: What is Terraform state and why does it matter?

State is Terraform's record of what infrastructure it manages and what attributes those resources have. Without state, Terraform doesn't know what exists. Lose state = Terraform thinks nothing exists, will try to recreate everything. Always use remote state with locking in production.

Q12: What's the difference between terraform plan and terraform apply?

plan shows you what changes will be made without making them. apply makes the changes. Always plan before apply in production, ideally as part of a code review.

Q13: What's a CI/CD pipeline?

An automated workflow that takes source code from commit to production. Typical stages: build, unit test, integration test, security scan, package, deploy to staging, smoke test, deploy to prod, post-deploy verification.

Q14: How do you handle secrets in CI/CD?

Never commit them to Git. In CI, use the platform's secret store (GitHub secrets, GitLab CI/CD variables). In production, use AWS Secrets Manager or SSM Parameter Store, retrieved at runtime. Rotate regularly. Use OIDC federation instead of long-lived access keys when possible.

Q15: What is observability and how does it differ from monitoring?

Monitoring tells you when something is wrong (predefined metrics and alerts). Observability lets you ask why it's wrong (rich data — metrics, logs, traces — that you can query in unforeseen ways). Three pillars: metrics, logs, traces.

Q16: Explain blue-green deployment.

You have two identical production environments. New version goes to the idle one (green). Smoke test it. Switch the load balancer to point at green. If problems, switch back to blue instantly. Zero-downtime deploys, easy rollbacks. Costs more (2x infrastructure).

Q17: What's the difference between a process and a thread?

A process is an isolated execution context with its own memory. A thread runs inside a process and shares memory with other threads. Containers are processes (with isolation features). DevOps engineers should know this.

Q18: What's the OSI model? Tell me about layer 4 vs layer 7.

OSI is the 7-layer network model. Layer 4 is transport (TCP, UDP) — load balancers at this layer route by IP and port. Layer 7 is application (HTTP, gRPC) — load balancers at this layer can route by URL, headers, etc. ALB is L7, NLB is L4.

Q19: What does CIDR notation mean? What's 10.0.0.0/16?

CIDR is how you write IP ranges. 10.0.0.0/16 means the first 16 bits are fixed (10.0) and the remaining 16 are variable. So this range covers 10.0.0.0 through 10.0.255.255 — 65,536 addresses. /24 would be 256 addresses, /28 would be 16.

Q20: What is GitOps?

An operational model where Git is the source of truth for infrastructure and applications. Changes happen by committing to Git; an operator (ArgoCD, Flux) reconciles the cluster state to match Git. Better audit trail, easier rollbacks, separation between CI and deploys.

Scenario-based (2-5 years experience) — Questions 21-40

Q21: A production deployment failed and the rollback also failed. What do you do?

Stop the bleeding first — divert traffic to a healthy environment if possible. Then debug methodically. Check what state the system is in (mid-rollback? partial deploy?). Don't make it worse. Communicate to stakeholders. Once stable, do a proper postmortem. Don't fix and forget.

Q22: How would you design a CI/CD pipeline for a microservices app with 50 services?

Monorepo or polyrepo decision first. If polyrepo, each service has its own pipeline triggered on its repo changes. If monorepo, use path filters. Shared steps (linting, security scans) factored into reusable workflows. Deploy via GitOps (ArgoCD) so pipeline doesn't need cluster access. Coordinate cross-service changes via change windows or feature flags, not pipeline orchestration.

Q23: Your EKS cluster's CPU is at 90% but pods are still being scheduled. What's wrong?

Pods are scheduled based on requests, not actual usage. So if pods request 100m CPU but use 800m, the scheduler thinks there's room when there isn't. Fix: set realistic requests, use VPA recommendations, monitor actual vs requested.

Q24: Postmortem: a Lambda function caused a 4-hour outage. Walk me through how you'd run the postmortem.

Blameless first. Timeline: when did it start, when detected, when mitigated, when resolved. Root cause analysis with 5-whys. Contributing factors (not just the immediate cause). Action items with owners and dates. Share publicly within the company. Don't fire anyone.

Q25: How would you migrate a monolithic app from EC2 to EKS?

Strangler pattern. Start with simplest stateless components. Containerize, deploy alongside monolith, route some traffic to new version. Gradually migrate components. Keep the database where it is initially — that's the hardest move. Don't try to migrate everything at once.

Q26: Your Terraform state file got corrupted. How do you recover?

Hopefully you have S3 versioning enabled — restore previous version. If not, terraform import to rebuild state from existing resources. This is painful for large infrastructure. Lesson learned: enable versioning, take periodic backups.

Q27: A container is using 100% memory and getting OOMKilled. The app developer says "it must be a Kubernetes problem." How do you respond?

It's not a Kubernetes problem. The container is using all the memory it was allocated. Either the app has a memory leak, or the limit is set too low. Profile the application — most languages have heap dump tools. Probably a leak.

Q28: How do you handle disaster recovery for a multi-region deployment?

Define RPO and RTO with the business. RPO is how much data you can lose, RTO is how long downtime can be. Pilot light or warm standby in a second region. Database replication (Aurora Global Database, RDS read replicas with promote). Automated failover with Route 53 health checks. Test the DR plan quarterly — untested DR is fiction.

Q29: A service has 99.95% uptime. The SLA is 99.9%. Should you celebrate?

Yes, but cautiously. Look at the trend — is uptime improving or deteriorating? Look at the error budget — at 99.9% SLA, your monthly budget is about 43 minutes of downtime; you used some of that, can you afford another incident? Don't get complacent.

Q30: How would you debug high latency that only appears in production?

Distributed tracing first (X-Ray, Jaeger). Find which service or call is slow. Then drill down: is it CPU, network, database, downstream service? Check recent changes (any deploys?). Compare to historical baselines. Production-only issues are often load-dependent — try to reproduce in staging with realistic load.

Q31: Your team's biggest cost is NAT Gateway. How do you reduce it?

VPC endpoints for AWS services (S3, DynamoDB, ECR) eliminate NAT traffic. Pull-through cache for Docker Hub. Move chatty workloads to private endpoints. Sometimes a VPC peering or PrivateLink is cheaper than NAT for cross-VPC traffic.

Q32: How would you implement zero-downtime database migration?

Backwards-compatible schema changes. Deploy the application that can write to both old and new schema. Migrate data in the background. Switch reads to new schema. Stop writing to old schema. Drop old schema. Each step can be rolled back individually.

Q33: A pipeline takes 45 minutes. How do you speed it up?

Profile first — where is the time going? Common wins: parallelize independent jobs, use better caching (Docker layer cache, dependency cache), use fewer steps, run tests in parallel, use larger runners for build-heavy stages, use incremental builds where possible.

Q34: How would you secure an EKS cluster?

Defense in depth. Cluster: private API endpoint, restricted via security groups. Nodes: bottlerocket OS, no SSH, SSM only. Network policies on Kubernetes side. RBAC tightly scoped, no default service account permissions. IRSA for AWS access. Secrets via External Secrets Operator. Image scanning in CI (Trivy). Runtime security (Falco). Regular CIS benchmarks. Audit logs centralized.

Q35: What metrics would you set SLOs on for a web service?

Availability (% of successful requests), latency (p99 response time under threshold), throughput (handling expected load). Maybe quality (error rate). Define each as a percentage over a time window. Tie alerts to error budget burn rate, not raw thresholds.

Q36: How do you handle traffic for a service that has 100x spikes during sales events?

Predictable spikes — pre-scale. Schedule scaling 30 minutes before. Reserved capacity if needed. Unpredictable spikes — aggressive HPA with low scale-up threshold, plus cluster autoscaler or Karpenter for nodes. Caching aggressively. CDN for static. Queue-based architecture for async work to absorb bursts.

Q37: Your colleague pushed access keys to GitHub. What do you do?

Rotate the keys immediately — disable in IAM, generate new ones if needed. Force-push the commit out of history is not enough; assume the keys are compromised. Audit CloudTrail for suspicious activity from those keys. Revoke any sessions. Then post-mortem the process — why did this happen, how to prevent.

Q38: How do you decide between EKS and ECS for a new project?

Team skills. Workload complexity. Multi-cloud requirements. Ecosystem needs (need ArgoCD, Istio? EKS). Operational capacity (small team without K8s experience? ECS). Don't pick based on what's trendy.

Q39: What's your approach to capacity planning?

Historical baseline. Growth projections from product/business. Buffer for spikes. Test at 2x expected peak. Reserved capacity for baseline. Spot or on-demand for burst. Re-evaluate quarterly. Cost vs availability tradeoffs are honest conversations with the business, not unilateral DevOps decisions.

Q40: Walk me through an incident you've handled.

Pick a real incident. Be honest about what went wrong. Explain timeline, what you did, what you learned. Don't pick something you handled perfectly — interviewers want to see how you think under pressure and what you learned.

Hands-on / Senior (5+ years) — Questions 41-60

Q41: Design a multi-account AWS setup for an organization.

AWS Organizations with management account. Separate accounts for prod, non-prod, security, logging, sandbox. SCPs for guardrails. Centralized logging via CloudTrail to dedicated logging account. Identity Center for SSO. Permission Sets per role. Networking — Transit Gateway for connectivity, or separate VPCs per account. Cost allocation tags. Reserved Instance sharing across the org.

Q42: Design a multi-region active-active architecture.

Global Accelerator or Route 53 latency routing for traffic distribution. Aurora Global Database for writes (single primary region with cross-region replicas). Or DynamoDB Global Tables for true multi-master. Stateless app tier in each region. S3 Cross-Region Replication for shared assets. Discuss the consistency tradeoffs honestly.

Q43: Design a CI/CD platform for 200 engineers.

Self-service onboarding. Standardized pipeline templates. Plugin model for team-specific needs. SOC2 controls — approval gates, audit logs, secrets management. Cost tracking per team. Good DX matters — slow pipelines damage productivity at scale.

Q44: Your service hits a sudden 503 surge from one specific region. Walk me through investigation.

Confirm scope — only one region, only some users, all? Check upstream dependencies in that region. Check provider status. CloudFront/Route 53 metrics. AWS service health. If it's one AZ, maybe an AZ failure (route around it). If the whole region, escalate.

Q45: How do you onboard a new engineer to your DevOps team?

Day 1: access, tools, runbooks. Week 1: shadow on-call. Month 1: own a small project end-to-end with mentorship. Month 3: take an on-call rotation. Documented runbooks are critical — if you have to verbally explain everything, that's a smell.

Q46: Design a backup strategy for a regulated industry.

RPO/RTO defined per data class. Automated backups with retention matched to compliance (often 7 years). Cross-region copies. Periodic restore tests — backups you've never restored aren't backups. Air-gapped or immutable backups for ransomware protection. Documentation. Audit trails.

Q47: A team wants to introduce a new database technology you haven't worked with. How do you evaluate?

Start with the actual problem — does the existing solution really not fit? Production criteria: operational maturity, available expertise, support, cost, scaling characteristics, recovery story. Run a small pilot. Avoid resume-driven decisions.

Q48: Build a chaos engineering practice for a 100-engineer org.

Start small — game days in non-prod, predetermined scenarios. Build muscle. Move to prod with conservative experiments. Tools like Chaos Mesh or AWS Fault Injection Simulator. Make it a regular practice, not a one-time event. Buy-in from leadership is critical — chaos that causes a customer-facing outage will end the program.

Q49: How do you measure DevOps team performance?

DORA metrics: deployment frequency, lead time for changes, change failure rate, MTTR. Plus organizational health: on-call burden, learning, employee satisfaction. Optimize for outcomes, not vanity metrics.

Q50: Migrate from CloudFormation to Terraform with zero downtime.

terraformer or former2 to generate Terraform from existing infra. Carefully validate generated code matches reality. Move resource by resource — terraform import + remove from CloudFormation stack. Don't try to do everything in one go. Months, not days.

Q51: How do you handle a compromised IAM key in production?

Disable immediately. Generate replacement. Audit CloudTrail for unauthorized actions. Revoke any STS sessions. Notify security team. Force credential rotation across affected services. Post-mortem the process.

Q52: Design observability for a 100-microservices system.

Standardized logging format (JSON), shipped to centralized store. Distributed tracing with OpenTelemetry — every service instrumented, traces correlated by request ID. Metrics with consistent naming convention. Service catalog so engineers can find dashboards. Alert on SLO burn, not raw thresholds. Avoid alert fatigue ruthlessly.

Q53: A senior leader says "we're overspending on AWS by 50%, fix it." Where do you start?

Data first. Cost Explorer breakdowns by service. CUR analysis. Identify top 5 cost drivers. Quick wins (idle resources, snapshots, right-sizing) for fast credibility. Reservations for baseline. Architectural changes for long-term. Build a FinOps practice. Communicate trade-offs.

Q54: Build a platform engineering team — what's the goal?

Reduce cognitive load on developers. Provide paved paths — opinionated, well-supported workflows for common tasks. Self-service for the 80% case, escape hatches for edge cases. Measure platform success by developer productivity, not platform team output.

Q55: How do you debug a network issue between two AWS accounts via Transit Gateway?

Reachability Analyzer first — that often tells you exactly what's blocked. Check route tables on both sides. Security groups, NACLs. Resource Access Manager (RAM) sharing. CloudWatch logs on the TGW. tcpdump on instances if necessary. VPC Flow Logs for traffic visibility.

Q56: AI/agent-based DevOps — what do you make of it?

Real shift, not hype. AWS DevOps Agent in 2026 actually works for routine incident triage. Doesn't replace senior engineers — replaces some tier-1 work. Future of DevOps is humans designing systems and reviewing AI outputs, not humans running kubectl all day.

Q57: How do you handle a colleague who pushes directly to main in production?

First conversation, not first escalation. Why are they doing it? Sometimes the process is broken. If process is fine, set up branch protection rules so they can't anymore. Document expectations. Repeated violations become a manager conversation.

Q58: What's your stance on Kubernetes vs serverless?

Both have places. Lambda + Step Functions for event-driven, async, infrequent workloads. Kubernetes for steady-state, complex orchestration, microservices. Use the right tool. The all-Lambda or all-Kubernetes stances are both wrong for most companies.

Q59: How do you approach technical debt in infrastructure code?

Track it. Allocate 20% of capacity to it. Pay it down incrementally. Communicate cost of NOT fixing it (security, cost, velocity) in business terms. Don't let it accumulate to the point of needing a rewrite — that almost never goes well.

Q60: What questions do you have for us?

Always have questions. Good ones: what does on-call look like, what's the team's biggest current challenge, what does success in this role look like in 6 months, how does the team handle production incidents. Bad questions: what's the WFH policy (look it up), can I see your salary band (later).

Final advice

Practice answering these out loud. Reading them isn't the same as saying them under pressure. If you can answer 50 of these confidently in 60 seconds each, you'll do well in most Bangalore DevOps interviews.

Our AWS DevOps program at IT Defined includes mock interview rounds with feedback specifically on these patterns. Even if you go solo — practice with a friend. Verbal practice is the differentiator.

Frequently asked questions

How long should answers be?

60-90 seconds for most questions. Long enough to show depth, short enough to invite follow-ups. Watch the interviewer's body language.

What if I don't know the answer?

Say so honestly, then think out loud about how you'd find out. Interviewers value honesty + problem-solving over fake confidence.

Should I memorize answers?

No. Understand concepts. Memorized answers sound robotic and break under follow-up questions.

About IT Defined

IT Defined is a software training institute in Whitefield, Bangalore, offering hands-on programs in AWS DevOps, Full-Stack MERN, Python, and Cybersecurity. We've trained over 2,000 students with live projects, mock interviews, and placement support.

Visit: itdefined.org  |  Phone: +91 6363730986  |  Email: info@itdefined.org