Search Pass4Sure

DevOps Interview Preparation: Key Topics Covered

In-depth preparation for DevOps interviews focusing on essential tools and strategies in CI/CD and microservices.

DevOps Interview Preparation: Key Topics Covered

What topics do DevOps engineer interviews typically cover?

DevOps interviews consistently cover CI/CD pipeline design and tooling (Jenkins, GitHub Actions, GitLab CI), container technology (Docker images, Kubernetes deployments), infrastructure as code (Terraform, CloudFormation), observability (logging, metrics, tracing), and deployment strategies. Behavioral questions about incident response and on-call experience also appear.


DevOps roles attract a wide range of interview styles, but the technical depth converges around a consistent set of topics: continuous integration and delivery pipelines, container orchestration, infrastructure as code , observability, and the cultural principles that underpin DevOps practice. This article covers the specific technical questions and topics that appear in DevOps engineer interviews, with the framing and depth that distinguishes candidates with real operational experience.

CI/CD Pipeline Questions

Pipeline Design and Deployment Strategies

"Walk me through a CI/CD pipeline you have built or maintained."

This is a common opening question that tests depth of practical experience. Strong answers describe a real pipeline with specific tools, explain the rationale for key decisions, and address how the pipeline handles failures.

A typical pipeline for a containerized application:

Code push to feature branch
    -> Lint and static analysis (eslint, pylint, gosec)
    -> Unit tests
    -> Build Docker image
    -> Push to container registry with commit SHA tag
    -> Integration tests against ephemeral environment
    -> Security scan of image (Trivy, Snyk)
    
Merge to main
    -> All above steps
    -> Tag image as "candidate"
    -> Deploy to staging environment
    -> Run smoke tests
    -> Manual approval gate (for production)
    -> Deploy to production with blue/green or rolling strategy

The interviewer is listening for whether you mention: test coverage gates, artifact versioning, environment-specific configuration management, and rollback strategy.

"What is the difference between a rolling deployment, blue/green deployment, and canary release?"

Strategy How It Works Key Benefit Key Risk
Rolling Gradually replace old instances with new No extra infrastructure cost Both versions run simultaneously during update
Blue/Green Maintain two identical environments; switch traffic Instant rollback by switching back Requires double infrastructure cost
Canary Route small percentage of traffic to new version Validates new version with real traffic before full rollout Requires traffic splitting infrastructure

In Kubernetes, rolling is the default deployment strategy. Blue/green is commonly implemented with weighted routing in a load balancer or service mesh. Canary releases are implemented with Flagger, Argo Rollouts, or native service mesh traffic splitting.

"What is a pipeline artifact and how do you manage artifact versioning?"

An artifact is the output of a build step—a compiled binary, Docker image, JAR file, or zip package. Artifact versioning ensures that the exact build can be reproduced and traced. Common schemes: semantic versioning (1.4.2), build number, or git commit SHA. Container images should be tagged with immutable identifiers (commit SHA) rather than mutable tags like latest in production systems.

Container and Kubernetes Questions

Docker and Kubernetes Fundamentals

"Explain the difference between a Docker image and a Docker container."

An image is a read-only, layered filesystem snapshot defined by a Dockerfile. A container is a running instance of an image. Multiple containers can run from the same image simultaneously. Images are immutable; containers have a writable layer that is discarded when the container stops (unless mounted to a volume).

"What is a Kubernetes Pod and how is it different from a container?"

A Pod is the smallest deployable unit in Kubernetes—it contains one or more containers that share a network namespace and storage volumes. Containers within a Pod communicate over localhost. The main container runs the application; sidecar containers provide auxiliary functions (log shipping, service mesh proxy, credential injection). Pods are ephemeral—when a Pod dies, Kubernetes creates a new one with a different IP.

"Explain the relationship between a Deployment, ReplicaSet, and Pod in Kubernetes."

A Deployment is the high-level object that defines the desired state: which container image to run and how many replicas. A Deployment manages a ReplicaSet, which maintains the specified number of Pod replicas. When you update a Deployment (change the image), Kubernetes creates a new ReplicaSet and gradually scales it up while scaling down the old one. This is the rolling update mechanism.

# Simplified Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: registry.example.com/web-app:abc123
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"

Resource requests and limits appear in most Kubernetes interview discussions because they affect scheduling decisions and cluster stability.

"What happens when a Pod's memory limit is exceeded?"

When a container exceeds its memory limit, the Linux OOM (Out of Memory) killer terminates the container. Kubernetes records this as an OOMKilled event. If the Pod has a restart policy, Kubernetes restarts it (with backoff). Without memory limits, a runaway container can consume all available node memory, causing node instability and cascading failures across unrelated workloads.

Infrastructure as Code Questions

Terraform State and Modules

"What is Terraform state and why is it important?"

Terraform state is a JSON file that records the mapping between your configuration and the real resources provisioned in the cloud. Terraform uses state to determine what changes to apply on the next terraform plan. Without state, Terraform cannot know which resources exist or which it manages.

Remote state storage (S3 + DynamoDB for AWS, GCS for Google Cloud) is essential for team workflows:

  • S3 stores the state file

  • DynamoDB provides state locking to prevent concurrent modifications

  • Encryption prevents credentials in state from being exposed

"What is a Terraform module and when would you create one?"

A module is a reusable, self-contained configuration that accepts inputs and produces outputs. Create a module when:

  • The same infrastructure pattern is needed in multiple places (e.g., a standard VPC layout)

  • You want to enforce organization standards (naming conventions, required tags, security defaults)

  • You want to hide complexity from teams consuming the infrastructure

Public modules from the Terraform Registry (official AWS, Azure, GCP modules) provide a starting point, but organizations typically maintain internal modules with their own conventions.

"What is configuration drift and how do you detect and prevent it?"

Configuration drift is the divergence between the desired state defined in code and the actual state of infrastructure—typically caused by manual changes made outside the IaC workflow. Detection: run terraform plan regularly (or in a scheduled pipeline) and alert on non-empty plans. Prevention: restrict who can modify infrastructure directly (IAM policies, cloud guardrails) and enforce all changes through the IaC workflow.

Observability Questions

The Three Pillars and Golden Signals

"The four golden signals—latency, traffic, errors, saturation—are the minimum viable monitoring set for any service. If you cannot answer those four questions about a system, you cannot operate it reliably." — Betsy Beyer, editor of Site Reliability Engineering (O'Reilly Media), Google SRE team

"What is the difference between logging, metrics, and tracing? When do you use each?"

Signal Type What It Captures Best For
Logs Discrete events with context Understanding what happened in a specific transaction
Metrics Numeric measurements over time Alerting on system health, capacity planning
Traces End-to-end request flow across services Diagnosing latency issues in distributed systems

The three together constitute the "three pillars of observability." A production incident typically starts with a metric alert, is investigated using logs, and—for microservices—requires tracing to identify which service in the call chain is causing the problem.

"What are the golden signals of monitoring?"

From the Google Site Reliability Engineering book, the four golden signals are:

  • Latency: how long requests take, distinguishing successful and failed requests

  • Traffic: how much demand the system is handling (requests per second)

  • Errors: rate of failed requests

  • Saturation: how close the system is to capacity (CPU, memory, queue depth)

Interviewers for SRE-adjacent roles often reference this framework and expect candidates to be familiar with it.

DevOps Culture and Process Questions

Secrets Management and Team Practices

Senior DevOps interviews include questions about team practices:

"How do you manage secrets in a CI/CD pipeline?"

Never store secrets in version control. Common patterns:

  • Environment variables injected by the CI system at runtime (GitHub Actions secrets, GitLab CI variables)

  • Integration with a secrets manager (HashiCorp Vault, AWS Secrets Manager) called at runtime

  • Short-lived credentials via cloud identity (assuming an IAM role in AWS, using Workload Identity in GCP)

The worst pattern is hardcoded credentials in code or configuration files committed to the repository. Static analysis tools like GitLeaks and truffleHog scan for committed secrets.

See also: Technical Interview Formats Explained: What to Expect at Each Stage

DevOps role compensation and certification mix

DevOps is one of the most heavily credentialed disciplines in IT - not because certifications are strictly required, but because the technical surface area is so wide that credentials serve as signal bundles. Current US 2024-2025 salary ranges, drawn from the Robert Half 2024 Technology Salary Guide [1] and Glassdoor aggregated data [2].

Role Seniority US salary range (2024-2025) Typical cert mix
Junior DevOps Engineer Entry $85,000-$115,000 AWS CCP, AZ-900, Terraform Associate
DevOps Engineer Mid $115,000-$160,000 AWS SAA, CKA, Terraform Associate
Senior DevOps Engineer Senior $150,000-$205,000 AWS DOP, CKA, CKS
Platform Engineer Senior $160,000-$215,000 AWS DOP, CKA, Terraform Associate
Site Reliability Engineer Senior $160,000-$230,000 AWS DOP, CKS, specialized vendor certs
DevOps Architect Senior $175,000-$240,000 AWS SAP, AWS DOP, CKA
Staff SRE Staff $220,000-$320,000 Depth over breadth; often no new certs
FAANG L5 SRE Staff $300,000-$480,000 TC System design interviews matter more than certs

Certification signal value for DevOps interviews

Certification Current exam code Fee Interview signal
AWS DevOps Engineer Professional DOP-C02 $300 Strong signal for AWS-focused DevOps
AWS Solutions Architect Professional SAP-C02 $300 Complementary to DOP; architecture judgment
CKA CKA $395 Near-prerequisite for Kubernetes-heavy roles
CKAD CKAD $395 Developer-oriented Kubernetes signal
CKS CKS $395 Security-focused Kubernetes; senior signal
Terraform Associate 003 $70.50 High ROI; widely respected IaC credential
HashiCorp Vault Associate 002 $70.50 Secrets management specialty
GitHub Actions N/A N/A Community respect; not formal cert
Google Professional Cloud DevOps Engineer PCDE $200 GCP-focused equivalent of AWS DOP

Incident response and on-call expectations

Senior DevOps interviews almost always include questions about on-call philosophy and incident response. Candidates without practical experience can prepare by studying real postmortems published by companies like GitLab, Cloudflare, and Google SRE.

  • Incident severity classification - SEV-1 through SEV-4 or P0 through P4, with clear response-time SLAs.

  • Blameless postmortem culture - focus on systemic causes, not individual fault.

  • Error budgets - service-level objective (SLO) tied to an allowable downtime per period; budget exhaustion triggers feature-freeze.

  • Paging hygiene - actionable alerts only; non-actionable alerts route to ticket queues, not pages.

  • Rotation sustainability - on-call rotations of at least one week with compensation or time-off-in-lieu.

"The best SRE teams treat operational load as a design input rather than an afterthought. If your service requires twelve people to run it reliably, the system is undersupplied with automation - no amount of heroism will scale that model. The goal of SRE is to reduce the incremental operational cost of each new service toward zero through automation and well-designed runbooks." - Benjamin Treynor Sloss, VP of Engineering at Google, founding director of Site Reliability Engineering [3].


Platform engineering and the shift from DevOps

The role titled "DevOps Engineer" is slowly being supplanted by "Platform Engineer" in mature organizations. The distinction matters in interviews because the expectations differ.

Aspect DevOps Engineer Platform Engineer
Customer Generic team support Internal developer teams as customers
Deliverable Pipelines, infrastructure per project Self-service platforms and golden paths
Metrics Deploy frequency, MTTR Developer onboarding time, platform NPS
Primary skill Automation breadth Product management + platform engineering
Current market trend Stable Growing 15-25% YoY

Candidates interviewing for roles at companies with mature internal platforms (Stripe, Shopify, HashiCorp) should read the Team Topologies book and the CNCF Platform Engineering whitepaper before interviews.


Pipeline-as-code and GitOps specific questions

"What is GitOps and how does it differ from traditional CI/CD?"

GitOps treats the Git repository as the single source of truth for both application code and operational state. Changes are proposed via pull requests, reviewed, and applied automatically by a reconciliation agent (ArgoCD or Flux for Kubernetes). The key distinction from traditional CI/CD is that deployments are pulled by the reconciliation agent rather than pushed by the CI server.

"What are the benefits and challenges of trunk-based development?"

Trunk-based development keeps all developers committing to a single main branch, using feature flags rather than long-lived branches to isolate in-progress work. Benefits: faster integration, fewer merge conflicts, continuous delivery alignment. Challenges: requires disciplined feature-flagging, mature test automation, and cultural acceptance of incremental changes.

"How do you handle database migrations in a CI/CD pipeline?"

The standard pattern: forward-only migrations committed alongside application code, applied automatically before the new application version is deployed, and designed to be backward-compatible with the previous application version (expand-contract pattern). Rollback happens via forward migrations that revert the change, not by rolling back the migration itself.


References

  • [1] Robert Half. (2024). 2024 Technology Salary Guide. https://www.roberthalf.com/us/en/insights/salary-guide/technology

  • [2] Glassdoor. (2024). DevOps Engineer Salary Report. https://www.glassdoor.com/Salaries/devops-engineer-salary-SRCH_KO0,15.htm

  • [3] Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (Eds.). (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. ISBN: 978-1491929124

  • Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press. ISBN: 978-1942788003

  • Humble, J., & Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley. ISBN: 978-0321601919

  • Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). "Borg, Omega, and Kubernetes." ACM Queue, 14(1). https://queue.acm.org/detail.cfm?id=2898444

  • HashiCorp. (2024). "Terraform Best Practices." https://developer.hashicorp.com/terraform/docs/cloud-docs/recommended-practices

  • Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K., & Thorne, S. (2018). The Site Reliability Workbook. O'Reilly Media. ISBN: 978-1492029502

  • Luksa, M. (2017). Kubernetes in Action. Manning Publications. ISBN: 978-1617293726

  • Skelton, M., & Pais, M. (2019). Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution Press. ISBN: 978-1942788812. Foundational text for understanding the platform engineering shift and how DevOps roles are evolving toward developer-customer focused platform teams.

  • Cloud Native Computing Foundation. (2023). Platform Engineering Maturity Model Whitepaper. https://www.cncf.io/reports/

Frequently Asked Questions

What topics do DevOps engineer interviews typically cover?

DevOps interviews consistently cover CI/CD pipeline design and tooling (Jenkins, GitHub Actions, GitLab CI), container technology (Docker images, Kubernetes deployments), infrastructure as code (Terraform, CloudFormation), observability (logging, metrics, tracing), and deployment strategies. Behavioral questions about incident response and on-call experience also appear.

What is the difference between blue/green and canary deployments?

Blue/green maintains two identical environments and switches all traffic instantly from the old to the new version, enabling instant rollback by switching back. A canary release routes a small percentage of traffic to the new version, validates it with real traffic, then gradually increases the percentage. Blue/green requires double infrastructure; canary requires traffic splitting infrastructure.

What happens when a Kubernetes Pod exceeds its memory limit?

The Linux OOM killer terminates the container, and Kubernetes records an OOMKilled event. If the Pod has a restart policy (the default is Always), Kubernetes restarts the container with exponential backoff. Without memory limits, a runaway container can consume all available node memory and cause cascading failures across unrelated workloads on the same node.

What is Terraform state locking?

State locking prevents two Terraform operations from modifying state simultaneously, which could corrupt it. When using S3 as a remote backend, a DynamoDB table provides locking by recording an entry when state is in use. Any concurrent plan or apply will wait or fail rather than proceeding with potentially stale state.

What are the four golden signals for monitoring?

From the Google SRE book: Latency (how long requests take), Traffic (request rate), Errors (rate of failed requests), and Saturation (how close the system is to capacity). These four metrics provide a useful baseline for production monitoring and alerting because they cover the dimensions most likely to affect user experience.