What topics do DevOps engineer interviews typically cover?
DevOps interviews consistently cover CI/CD pipeline design and tooling (Jenkins, GitHub Actions, GitLab CI), container technology (Docker images, Kubernetes deployments), infrastructure as code (Terraform, CloudFormation), observability (logging, metrics, tracing), and deployment strategies. Behavioral questions about incident response and on-call experience also appear.
DevOps roles attract a wide range of interview styles, but the technical depth converges around a consistent set of topics: continuous integration and delivery pipelines, container orchestration, infrastructure as code , observability, and the cultural principles that underpin DevOps practice. This article covers the specific technical questions and topics that appear in DevOps engineer interviews, with the framing and depth that distinguishes candidates with real operational experience.
CI/CD Pipeline Questions
Pipeline Design and Deployment Strategies
"Walk me through a CI/CD pipeline you have built or maintained."
This is a common opening question that tests depth of practical experience. Strong answers describe a real pipeline with specific tools, explain the rationale for key decisions, and address how the pipeline handles failures.
A typical pipeline for a containerized application:
Code push to feature branch
-> Lint and static analysis (eslint, pylint, gosec)
-> Unit tests
-> Build Docker image
-> Push to container registry with commit SHA tag
-> Integration tests against ephemeral environment
-> Security scan of image (Trivy, Snyk)
Merge to main
-> All above steps
-> Tag image as "candidate"
-> Deploy to staging environment
-> Run smoke tests
-> Manual approval gate (for production)
-> Deploy to production with blue/green or rolling strategy
The interviewer is listening for whether you mention: test coverage gates, artifact versioning, environment-specific configuration management, and rollback strategy.
"What is the difference between a rolling deployment, blue/green deployment, and canary release?"
| Strategy | How It Works | Key Benefit | Key Risk |
|---|---|---|---|
| Rolling | Gradually replace old instances with new | No extra infrastructure cost | Both versions run simultaneously during update |
| Blue/Green | Maintain two identical environments; switch traffic | Instant rollback by switching back | Requires double infrastructure cost |
| Canary | Route small percentage of traffic to new version | Validates new version with real traffic before full rollout | Requires traffic splitting infrastructure |
In Kubernetes, rolling is the default deployment strategy. Blue/green is commonly implemented with weighted routing in a load balancer or service mesh. Canary releases are implemented with Flagger, Argo Rollouts, or native service mesh traffic splitting.
"What is a pipeline artifact and how do you manage artifact versioning?"
An artifact is the output of a build step—a compiled binary, Docker image, JAR file, or zip package. Artifact versioning ensures that the exact build can be reproduced and traced. Common schemes: semantic versioning (1.4.2), build number, or git commit SHA. Container images should be tagged with immutable identifiers (commit SHA) rather than mutable tags like latest in production systems.
Container and Kubernetes Questions
Docker and Kubernetes Fundamentals
"Explain the difference between a Docker image and a Docker container."
An image is a read-only, layered filesystem snapshot defined by a Dockerfile. A container is a running instance of an image. Multiple containers can run from the same image simultaneously. Images are immutable; containers have a writable layer that is discarded when the container stops (unless mounted to a volume).
"What is a Kubernetes Pod and how is it different from a container?"
A Pod is the smallest deployable unit in Kubernetes—it contains one or more containers that share a network namespace and storage volumes. Containers within a Pod communicate over localhost. The main container runs the application; sidecar containers provide auxiliary functions (log shipping, service mesh proxy, credential injection). Pods are ephemeral—when a Pod dies, Kubernetes creates a new one with a different IP.
"Explain the relationship between a Deployment, ReplicaSet, and Pod in Kubernetes."
A Deployment is the high-level object that defines the desired state: which container image to run and how many replicas. A Deployment manages a ReplicaSet, which maintains the specified number of Pod replicas. When you update a Deployment (change the image), Kubernetes creates a new ReplicaSet and gradually scales it up while scaling down the old one. This is the rolling update mechanism.
# Simplified Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: registry.example.com/web-app:abc123
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
Resource requests and limits appear in most Kubernetes interview discussions because they affect scheduling decisions and cluster stability.
"What happens when a Pod's memory limit is exceeded?"
When a container exceeds its memory limit, the Linux OOM (Out of Memory) killer terminates the container. Kubernetes records this as an OOMKilled event. If the Pod has a restart policy, Kubernetes restarts it (with backoff). Without memory limits, a runaway container can consume all available node memory, causing node instability and cascading failures across unrelated workloads.
Infrastructure as Code Questions
Terraform State and Modules
"What is Terraform state and why is it important?"
Terraform state is a JSON file that records the mapping between your configuration and the real resources provisioned in the cloud. Terraform uses state to determine what changes to apply on the next terraform plan. Without state, Terraform cannot know which resources exist or which it manages.
Remote state storage (S3 + DynamoDB for AWS, GCS for Google Cloud) is essential for team workflows:
S3 stores the state file
DynamoDB provides state locking to prevent concurrent modifications
Encryption prevents credentials in state from being exposed
"What is a Terraform module and when would you create one?"
A module is a reusable, self-contained configuration that accepts inputs and produces outputs. Create a module when:
The same infrastructure pattern is needed in multiple places (e.g., a standard VPC layout)
You want to enforce organization standards (naming conventions, required tags, security defaults)
You want to hide complexity from teams consuming the infrastructure
Public modules from the Terraform Registry (official AWS, Azure, GCP modules) provide a starting point, but organizations typically maintain internal modules with their own conventions.
"What is configuration drift and how do you detect and prevent it?"
Configuration drift is the divergence between the desired state defined in code and the actual state of infrastructure—typically caused by manual changes made outside the IaC workflow. Detection: run terraform plan regularly (or in a scheduled pipeline) and alert on non-empty plans. Prevention: restrict who can modify infrastructure directly (IAM policies, cloud guardrails) and enforce all changes through the IaC workflow.
Observability Questions
The Three Pillars and Golden Signals
"The four golden signals—latency, traffic, errors, saturation—are the minimum viable monitoring set for any service. If you cannot answer those four questions about a system, you cannot operate it reliably." — Betsy Beyer, editor of Site Reliability Engineering (O'Reilly Media), Google SRE team
"What is the difference between logging, metrics, and tracing? When do you use each?"
| Signal Type | What It Captures | Best For |
|---|---|---|
| Logs | Discrete events with context | Understanding what happened in a specific transaction |
| Metrics | Numeric measurements over time | Alerting on system health, capacity planning |
| Traces | End-to-end request flow across services | Diagnosing latency issues in distributed systems |
The three together constitute the "three pillars of observability." A production incident typically starts with a metric alert, is investigated using logs, and—for microservices—requires tracing to identify which service in the call chain is causing the problem.
"What are the golden signals of monitoring?"
From the Google Site Reliability Engineering book, the four golden signals are:
Latency: how long requests take, distinguishing successful and failed requests
Traffic: how much demand the system is handling (requests per second)
Errors: rate of failed requests
Saturation: how close the system is to capacity (CPU, memory, queue depth)
Interviewers for SRE-adjacent roles often reference this framework and expect candidates to be familiar with it.
DevOps Culture and Process Questions
Secrets Management and Team Practices
Senior DevOps interviews include questions about team practices:
"How do you manage secrets in a CI/CD pipeline?"
Never store secrets in version control. Common patterns:
Environment variables injected by the CI system at runtime (GitHub Actions secrets, GitLab CI variables)
Integration with a secrets manager (HashiCorp Vault, AWS Secrets Manager) called at runtime
Short-lived credentials via cloud identity (assuming an IAM role in AWS, using Workload Identity in GCP)
The worst pattern is hardcoded credentials in code or configuration files committed to the repository. Static analysis tools like GitLeaks and truffleHog scan for committed secrets.
See also: Technical Interview Formats Explained: What to Expect at Each Stage
DevOps role compensation and certification mix
DevOps is one of the most heavily credentialed disciplines in IT - not because certifications are strictly required, but because the technical surface area is so wide that credentials serve as signal bundles. Current US 2024-2025 salary ranges, drawn from the Robert Half 2024 Technology Salary Guide [1] and Glassdoor aggregated data [2].
| Role | Seniority | US salary range (2024-2025) | Typical cert mix |
|---|---|---|---|
| Junior DevOps Engineer | Entry | $85,000-$115,000 | AWS CCP, AZ-900, Terraform Associate |
| DevOps Engineer | Mid | $115,000-$160,000 | AWS SAA, CKA, Terraform Associate |
| Senior DevOps Engineer | Senior | $150,000-$205,000 | AWS DOP, CKA, CKS |
| Platform Engineer | Senior | $160,000-$215,000 | AWS DOP, CKA, Terraform Associate |
| Site Reliability Engineer | Senior | $160,000-$230,000 | AWS DOP, CKS, specialized vendor certs |
| DevOps Architect | Senior | $175,000-$240,000 | AWS SAP, AWS DOP, CKA |
| Staff SRE | Staff | $220,000-$320,000 | Depth over breadth; often no new certs |
| FAANG L5 SRE | Staff | $300,000-$480,000 TC | System design interviews matter more than certs |
Certification signal value for DevOps interviews
| Certification | Current exam code | Fee | Interview signal |
|---|---|---|---|
| AWS DevOps Engineer Professional | DOP-C02 | $300 | Strong signal for AWS-focused DevOps |
| AWS Solutions Architect Professional | SAP-C02 | $300 | Complementary to DOP; architecture judgment |
| CKA | CKA | $395 | Near-prerequisite for Kubernetes-heavy roles |
| CKAD | CKAD | $395 | Developer-oriented Kubernetes signal |
| CKS | CKS | $395 | Security-focused Kubernetes; senior signal |
| Terraform Associate | 003 | $70.50 | High ROI; widely respected IaC credential |
| HashiCorp Vault Associate | 002 | $70.50 | Secrets management specialty |
| GitHub Actions | N/A | N/A | Community respect; not formal cert |
| Google Professional Cloud DevOps Engineer | PCDE | $200 | GCP-focused equivalent of AWS DOP |
Incident response and on-call expectations
Senior DevOps interviews almost always include questions about on-call philosophy and incident response. Candidates without practical experience can prepare by studying real postmortems published by companies like GitLab, Cloudflare, and Google SRE.
Incident severity classification - SEV-1 through SEV-4 or P0 through P4, with clear response-time SLAs.
Blameless postmortem culture - focus on systemic causes, not individual fault.
Error budgets - service-level objective (SLO) tied to an allowable downtime per period; budget exhaustion triggers feature-freeze.
Paging hygiene - actionable alerts only; non-actionable alerts route to ticket queues, not pages.
Rotation sustainability - on-call rotations of at least one week with compensation or time-off-in-lieu.
"The best SRE teams treat operational load as a design input rather than an afterthought. If your service requires twelve people to run it reliably, the system is undersupplied with automation - no amount of heroism will scale that model. The goal of SRE is to reduce the incremental operational cost of each new service toward zero through automation and well-designed runbooks." - Benjamin Treynor Sloss, VP of Engineering at Google, founding director of Site Reliability Engineering [3].
Platform engineering and the shift from DevOps
The role titled "DevOps Engineer" is slowly being supplanted by "Platform Engineer" in mature organizations. The distinction matters in interviews because the expectations differ.
| Aspect | DevOps Engineer | Platform Engineer |
|---|---|---|
| Customer | Generic team support | Internal developer teams as customers |
| Deliverable | Pipelines, infrastructure per project | Self-service platforms and golden paths |
| Metrics | Deploy frequency, MTTR | Developer onboarding time, platform NPS |
| Primary skill | Automation breadth | Product management + platform engineering |
| Current market trend | Stable | Growing 15-25% YoY |
Candidates interviewing for roles at companies with mature internal platforms (Stripe, Shopify, HashiCorp) should read the Team Topologies book and the CNCF Platform Engineering whitepaper before interviews.
Pipeline-as-code and GitOps specific questions
"What is GitOps and how does it differ from traditional CI/CD?"
GitOps treats the Git repository as the single source of truth for both application code and operational state. Changes are proposed via pull requests, reviewed, and applied automatically by a reconciliation agent (ArgoCD or Flux for Kubernetes). The key distinction from traditional CI/CD is that deployments are pulled by the reconciliation agent rather than pushed by the CI server.
"What are the benefits and challenges of trunk-based development?"
Trunk-based development keeps all developers committing to a single main branch, using feature flags rather than long-lived branches to isolate in-progress work. Benefits: faster integration, fewer merge conflicts, continuous delivery alignment. Challenges: requires disciplined feature-flagging, mature test automation, and cultural acceptance of incremental changes.
"How do you handle database migrations in a CI/CD pipeline?"
The standard pattern: forward-only migrations committed alongside application code, applied automatically before the new application version is deployed, and designed to be backward-compatible with the previous application version (expand-contract pattern). Rollback happens via forward migrations that revert the change, not by rolling back the migration itself.
References
[1] Robert Half. (2024). 2024 Technology Salary Guide. https://www.roberthalf.com/us/en/insights/salary-guide/technology
[2] Glassdoor. (2024). DevOps Engineer Salary Report. https://www.glassdoor.com/Salaries/devops-engineer-salary-SRCH_KO0,15.htm
[3] Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (Eds.). (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. ISBN: 978-1491929124
Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press. ISBN: 978-1942788003
Humble, J., & Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley. ISBN: 978-0321601919
Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). "Borg, Omega, and Kubernetes." ACM Queue, 14(1). https://queue.acm.org/detail.cfm?id=2898444
HashiCorp. (2024). "Terraform Best Practices." https://developer.hashicorp.com/terraform/docs/cloud-docs/recommended-practices
Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K., & Thorne, S. (2018). The Site Reliability Workbook. O'Reilly Media. ISBN: 978-1492029502
Luksa, M. (2017). Kubernetes in Action. Manning Publications. ISBN: 978-1617293726
Skelton, M., & Pais, M. (2019). Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution Press. ISBN: 978-1942788812. Foundational text for understanding the platform engineering shift and how DevOps roles are evolving toward developer-customer focused platform teams.
Cloud Native Computing Foundation. (2023). Platform Engineering Maturity Model Whitepaper. https://www.cncf.io/reports/
Frequently Asked Questions
What topics do DevOps engineer interviews typically cover?
DevOps interviews consistently cover CI/CD pipeline design and tooling (Jenkins, GitHub Actions, GitLab CI), container technology (Docker images, Kubernetes deployments), infrastructure as code (Terraform, CloudFormation), observability (logging, metrics, tracing), and deployment strategies. Behavioral questions about incident response and on-call experience also appear.
What is the difference between blue/green and canary deployments?
Blue/green maintains two identical environments and switches all traffic instantly from the old to the new version, enabling instant rollback by switching back. A canary release routes a small percentage of traffic to the new version, validates it with real traffic, then gradually increases the percentage. Blue/green requires double infrastructure; canary requires traffic splitting infrastructure.
What happens when a Kubernetes Pod exceeds its memory limit?
The Linux OOM killer terminates the container, and Kubernetes records an OOMKilled event. If the Pod has a restart policy (the default is Always), Kubernetes restarts the container with exponential backoff. Without memory limits, a runaway container can consume all available node memory and cause cascading failures across unrelated workloads on the same node.
What is Terraform state locking?
State locking prevents two Terraform operations from modifying state simultaneously, which could corrupt it. When using S3 as a remote backend, a DynamoDB table provides locking by recording an entry when state is in use. Any concurrent plan or apply will wait or fail rather than proceeding with potentially stale state.
What are the four golden signals for monitoring?
From the Google SRE book: Latency (how long requests take), Traffic (request rate), Errors (rate of failed requests), and Saturation (how close the system is to capacity). These four metrics provide a useful baseline for production monitoring and alerting because they cover the dimensions most likely to affect user experience.
