What does the Google Cloud Professional DevOps Engineer exam test?
The Professional Cloud DevOps Engineer exam tests your ability to apply site reliability engineering (SRE) principles to GCP environments, build and manage CI/CD pipelines, implement observability solutions, and optimize service reliability. It uniquely combines DevOps pipeline tooling with SRE concepts like SLIs, SLOs, error budgets, and toil reduction, making it one of the more intellectually demanding professional-level GCP exams.
The Google Cloud Professional Cloud DevOps Engineer certification occupies an unusual position in the GCP credential portfolio: it bridges software delivery practices (CI/CD, GitOps, deployment strategies) with operational reliability engineering (SRE methodology, incident management, observability). This combination reflects how modern platform engineering teams actually work -- they own both the delivery pipeline and the production reliability of the services they deploy.
The certification is particularly valuable for platform engineers, SREs, DevOps engineers, and senior developers who are responsible for production GCP environments. According to Dice's 2025 tech salary survey, DevOps engineers with cloud certifications earn an average of $23,000 more annually than non-certified peers [1]. This guide covers every exam domain with sufficient depth to pass the exam, along with hands-on practice priorities and strategic study tips.
Exam Overview
| Attribute | Detail |
|---|---|
| Exam cost | $200 USD |
| Exam duration | 120 minutes |
| Number of questions | 50-60 multiple-choice and multiple-select |
| Validity period | 2 years |
| Delivery | Remote proctored or test center |
| Prerequisites | None (Google recommends 3+ years DevOps/SRE experience) |
| Key skill domains | SRE principles, CI/CD, observability, GKE operations, Terraform |
Exam Domains
| Domain | Title | Approximate Weight |
|---|---|---|
| 1 | Bootstrapping a Google Cloud organization for DevOps | 17% |
| 2 | Building and implementing CI/CD pipelines for a service | 25% |
| 3 | Applying site reliability engineering principles to a service | 22% |
| 4 | Implementing service monitoring strategies | 20% |
| 5 | Optimizing service performance | 16% |
Domain 1: Bootstrapping a Google Cloud Organization for DevOps (17%)
This domain covers establishing the infrastructure and governance foundation that enables DevOps practices at scale.
Infrastructure as Code
Terraform is the dominant IaC tool on GCP and appears heavily in this domain:
- Terraform resource blocks for core GCP services: google_compute_instance, google_container_cluster, google_sql_database_instance
- Remote state management using Cloud Storage backends with state locking via Cloud Firestore
- Workspace-based environment separation: dev, staging, production as separate Terraform workspaces or separate state files
- Module structure for reusable infrastructure components
- Terraform plan and apply in CI/CD: running terraform plan as a pull request check and terraform apply only after review approval
Google Cloud Deployment Manager is also on the exam but is covered at a conceptual level. Terraform is clearly preferred for new infrastructure work in 2025-2026 exam scenarios.
GitOps and Configuration Management
GitOps principles treat the Git repository as the single source of truth for desired state. Key concepts:
- Config Connector: a Kubernetes operator that manages GCP resources via Kubernetes custom resources; allows GCP infrastructure to be declared in Git alongside application manifests
- Anthos Config Management: applies configs from a Git repository to GKE clusters automatically; prevents configuration drift
- Policy Controller: admission controller based on OPA (Open Policy Agent) that enforces governance policies on Kubernetes resources
Environment Strategy
- Separate GCP projects for dev, staging, and production is the standard recommendation for isolation
- Shared VPC with host project enables network consistency across environment projects
- Binary Authorization: requires that container images be signed by trusted authorities before deployment to GKE; enforces that only CI-vetted images reach production
Domain 2: Building and Implementing CI/CD Pipelines for a Service (25%)
The highest-weighted domain covers the full software delivery lifecycle on GCP.
Cloud Build
Cloud Build is GCP's fully managed CI/CD platform. Key concepts:
- Build configuration: cloudbuild.yaml (or equivalent JSON) defines a sequence of steps, each running in a Docker container
- Cloud Build triggers: connect to source repositories (Cloud Source Repositories, GitHub, Bitbucket) and fire on push, pull request, or tag events
- Substitution variables: parameterize build configs with dynamic values like commit SHA, branch name, and environment target
- Build artifacts: push container images to Artifact Registry; upload build outputs to Cloud Storage
- Private pools: dedicated build workers in your VPC for builds that require access to private resources without public internet exposure
# Example cloudbuild.yaml structure
steps:
- name: 'gcr.io/cloud-builders/docker'
args: ['build', '-t', '$_IMAGE_TAG', '.']
- name: 'gcr.io/cloud-builders/docker'
args: ['push', '$_IMAGE_TAG']
- name: 'gcr.io/cloud-builders/gke-deploy'
args: ['run', '--filename=k8s/', '--cluster=$_CLUSTER', '--location=$_REGION']
Cloud Deploy
Cloud Deploy is GCP's managed continuous delivery service, introduced to provide structured deployment pipelines with approval gates:
- Delivery pipelines define the sequence of target environments: dev > staging > production
- Releases are immutable: the same artifact is promoted through stages rather than rebuilt
- Rollouts: the deployment of a release to a specific target; can be manual or automatic
- Approval requirements between stages enforce human review for production deployments
- Rollback: Cloud Deploy can roll back to any previous release with a single command
Deployment Strategies
The exam tests when to use each deployment strategy:
| Strategy | Description | Use Case |
|---|---|---|
| Rolling update | Gradually replaces old pods with new pods | Default GKE update; minimal downtime |
| Canary deployment | Routes small percentage of traffic to new version | Risk reduction for high-impact changes |
| Blue/green | Runs old and new versions in parallel; switches all traffic at once | Zero-downtime with fast rollback option |
| A/B testing | Routes traffic based on user attributes, not percentage | Feature validation with specific user segments |
For GKE deployments, rolling updates are configured via the Deployment spec (maxSurge, maxUnavailable). Canary deployments in GKE use separate Deployments with different labels, managed by a traffic-splitting mechanism (Istio, Traffic Director, or weighted Ingress rules). Blue/green deployments switch the Service selector between two Deployment label sets.
Artifact Registry
Artifact Registry replaced Container Registry as the standard GCP artifact repository:
- Supports Docker images, npm packages, Maven artifacts, Python wheels, and Go modules in the same service
- Regional repositories reduce latency and improve reliability compared to global Container Registry
- Vulnerability scanning: integration with Container Analysis provides CVE scanning on stored images
- Binary Authorization integration: images must be signed before promotion to production registries
Domain 3: Applying Site Reliability Engineering Principles to a Service (22%)
The SRE domain is what makes this certification unique among DevOps certifications. The Google SRE book [7] underpins much of this domain.
Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
- SLI: a quantitative measurement of service behavior from the user's perspective. Common SLIs: availability (% of successful requests), latency (% of requests served within a threshold), error rate, throughput
- SLO: a target value for the SLI. Example: 99.9% of requests return 2xx status within 500ms
- SLA: a contractual commitment with consequences for violation; typically less strict than the internal SLO
- Error budget: the amount of unreliability permitted by the SLO. A 99.9% SLO has 43.8 minutes of downtime error budget per month. Error budget depletion triggers feature freeze and reliability investment.
Cloud Monitoring supports native SLO configuration for services defined in Service Monitoring. The exam tests the creation of SLOs using the GCP Console and the Cloud Monitoring API.
"The error budget is not a license to be unreliable. It is a mathematical framework for making explicit, agreed-upon trade-offs between reliability and feature velocity. When the error budget is depleted, reliability work takes priority over new feature development." -- Site Reliability Engineering book, Google [7]
Toil Reduction
Toil is manual, repetitive, automatable work that scales with service growth. SRE practice aims to keep toil below 50% of each engineer's time.
- Identify toil: manual deployments, repetitive ticket handling, manual scaling interventions
- Automate toil: Cloud Build pipelines for deployments, GKE cluster autoscaler for scaling, alerting runbooks for common incidents
- Track toil: time-tracking per category to measure reduction over time
Postmortem Culture
- Blameless postmortems focus on systemic causes, not individual blame
- Five whys analysis: recursively asking "why?" to identify root causes beyond surface symptoms
- Action items from postmortems must be tracked to completion; unfinished postmortem action items indicate a reliability debt accumulation
Domain 4: Implementing Service Monitoring Strategies (20%)
Cloud Monitoring
Cloud Monitoring is GCP's managed observability service, based on the Monarch time-series monitoring system internally:
- Metrics: built-in metrics for all GCP services; custom metrics via the Cloud Monitoring API or OpenTelemetry
- Dashboards: pre-built dashboards for GKE, Compute Engine, App Engine; custom dashboards via the metrics explorer
- Alerting policies: define conditions based on metric thresholds, rate-of-change, or absence of metrics; notify via email, PagerDuty, Slack, or Pub/Sub
- Uptime checks: synthetic monitoring from distributed Google locations for availability measurement
Cloud Logging
Cloud Logging is GCP's managed log aggregation service:
- Structured logging: emit JSON logs with severity, timestamp, trace ID, and custom fields; Cloud Logging parses structured fields for filtering and analysis
- Log-based metrics: create counters or distributions from log entries matching a filter; use for alerting on application-level events not exposed as Cloud Monitoring metrics
- Log sinks: export log entries to Cloud Storage (long-term archival), BigQuery (analysis), or Pub/Sub (real-time processing)
- Log buckets: configure retention periods and log analytics mode for SQL-based log analysis
Cloud Trace and Cloud Profiler
- Cloud Trace: distributed tracing service that collects latency data across microservices; integrates with OpenTelemetry for language-agnostic instrumentation
- Cloud Profiler: continuous profiling of CPU usage, heap allocation, and goroutine counts for production services without significant performance overhead
Error Reporting
Cloud Error Reporting automatically groups application errors from Cloud Logging and presents them with occurrence counts, affected users, and first/last seen timestamps. Alerts on new error types can be configured to trigger incident response.
GKE Observability
GKE provides integrated monitoring that uses Cloud Monitoring and Cloud Logging automatically:
- Workload metrics: CPU and memory requests vs. limits, pod restart counts, deployment rollout status
- GKE dataplane observability: network policy logging, connection-level metrics for services
- Managed Prometheus: Google Cloud Managed Service for Prometheus enables Prometheus-compatible monitoring for GKE workloads without running Prometheus infrastructure
Domain 5: Optimizing Service Performance (16%)
Performance Analysis
- Cloud Profiler identifies CPU and memory hotspots in production code without requiring staging reproduction
- BigQuery query plans: the INFORMATION_SCHEMA.JOBS view and the query plan explanation in the BigQuery Console show per-stage slot usage and bytes processed
- GKE horizontal pod autoscaler (HPA): scales pod count based on CPU, memory, or custom metrics from Cloud Monitoring
Load Balancing and Traffic Management
- Cloud Load Balancing is GCP's managed, globally distributed load balancer; it is not a single VM and does not require management
- Traffic Director: GCP's managed service mesh control plane for internal load balancing and traffic management between microservices
- Cloud CDN: caches content at Google's edge nodes globally; cache hit rate analysis in Cloud Monitoring identifies caching opportunities
Cost and Performance Balance
- Spot VMs (formerly preemptible VMs): 60-91% cheaper than standard VMs; can be preempted with 30-second warning; use for fault-tolerant batch workloads and CI build workers
- GKE Autopilot: Google manages node provisioning and scaling; charges per pod resource requests rather than node capacity; typically reduces infrastructure cost for variable workloads
Study Tips and Common Exam Traps
Understand SRE concepts deeply. The SRE domain is where candidates with only DevOps tooling experience lose points. Read at minimum chapters 2, 3, and 5 of the Google SRE book (available free at sre.google). Understanding why error budgets exist and how they govern feature velocity is more important than memorizing formulas.
Know Cloud Deploy vs. Cloud Build. Cloud Build handles CI (build, test, push artifact). Cloud Deploy handles CD (promoting artifacts through environment stages with approvals). The exam presents scenarios where candidates must identify which tool is responsible for which phase of delivery.
Distinguish monitoring alert types. Symptom-based alerting (alert when users are experiencing errors or high latency) is preferred over cause-based alerting (alert when CPU is high). The exam tests whether candidates understand that cause-based alerts generate noise; symptom-based alerts are actionable.
Practice with GKE Autopilot. Autopilot-specific behavior (no node management, per-pod billing, restricted host access) appears in scenarios where the correct answer depends on understanding Autopilot's constraints vs. Standard mode.
| Scenario Signal | Likely Correct Answer |
|---|---|
| Need approval gates between environments | Cloud Deploy |
| Need to run unit tests on every commit | Cloud Build trigger |
| Need to enforce only signed images in production | Binary Authorization |
| Need to prevent config drift on GKE cluster | Anthos Config Management |
| Service experiencing high error rate | Check SLO, consult error budget |
| Error budget depleted | Feature freeze, reliability sprint |
| Manual scaling intervention for traffic spike | Implement HPA with Cloud Monitoring custom metrics |
References
[1] Dice. "Tech Salary Report 2025." dice.com. Accessed May 2026.
[2] Google Cloud. "Professional Cloud DevOps Engineer Exam Guide." cloud.google.com/certification/cloud-devops-engineer. Accessed May 2026.
[3] Google Cloud. "Cloud Build Documentation." cloud.google.com/build/docs. Accessed May 2026.
[4] Google Cloud. "Cloud Deploy Documentation." cloud.google.com/deploy/docs. Accessed May 2026.
[5] Google Cloud. "Cloud Monitoring Documentation." cloud.google.com/monitoring/docs. Accessed May 2026.
[6] Tutorials Dojo. "Google Cloud Professional DevOps Engineer Practice Exams." tutorialsdojo.com. Accessed May 2026.
[7] Beyer, B., Jones, C., Petoff, J., Murphy, N.R. "Site Reliability Engineering." O'Reilly Media / Google, 2016. sre.google/sre-book.
[8] Google Cloud. "GKE Documentation: Choosing a GKE mode of operation." cloud.google.com/kubernetes-engine/docs. Accessed May 2026.
