Skip to main content
  1. Home
  2. >
  3. GCP
  4. >
  5. PCA
  6. >
  7. This article

GCP PCA Drill: Cost-Effective Hadoop Migration - The Managed vs. DIY Trade-off

Jeff Taakey
Author
Jeff Taakey
21+ Year Enterprise Architect | Multi-Cloud Architect & Strategist.
Jeff's Architecture Insights
Go beyond static exam dumps. Jeff’s Insights is engineered to cultivate the mindset of a Production-Ready Architect. We move past ‘correct answers’ to dissect the strategic trade-offs and multi-cloud patterns required to balance reliability, security, and TCO in mission-critical environments.

While preparing for the GCP Professional Cloud Architect (PCA) exam, many candidates get confused by the choice between managed services and manual infrastructure deployment for big data workloads. In the real world, this is fundamentally a decision about balancing operational overhead, cost efficiency, and reliability. Let’s drill into a simulated scenario.

The Scenario
#

GreenFunder is a rapidly growing global fintech startup using Hadoop jobs to power their fraud detection data science pipelines. Recently, they decided to migrate their Hadoop workloads to Google Cloud Platform — but they want to avoid modifying the existing job logic or infrastructure dependencies. The engineering leadership asks you to design the migration approach that minimizes ongoing infrastructure management effort and overall costs without compromising job functionality.

Requirements
#

Migrate Hadoop jobs to Google Cloud without altering the underlying infrastructure setup, while minimizing cost and operational toil.

The Options
#

  • A) Create a Dataproc cluster using standard worker instances.
  • B) Create a Dataproc cluster using preemptible worker instances.
  • C) Manually deploy a Hadoop cluster on Compute Engine using standard instances.
  • D) Manually deploy a Hadoop cluster on Compute Engine using preemptible instances.

Correct Answer
#

B) Create a Dataproc cluster using preemptible worker instances.


The Architect’s Analysis
#

Correct Answer
#

Option B: Create a Dataproc cluster using preemptible worker instances.

Step-by-Step Winning Logic
#

Dataproc is a fully managed Hadoop and Spark service that abstracts away cluster lifecycle and maintenance, aligning with SRE principles of reducing toil and focusing on reliability. Using preemptible instances for workers drastically cuts costs by taking advantage of spare Google Cloud capacity at a fraction of the price. Dataproc automatically manages job resubmissions if preemptible nodes are revoked, ensuring the workload continuity without manual intervention. This combination strikes the best balance between operational simplicity and cost efficiency for existing Hadoop workloads requiring no code changes.

The Traps (Distractor Analysis)
#

  • Why not A? Creating a Dataproc cluster with standard workers works but misses out on significant compute cost savings offered by preemptible VMs.
  • Why not C? Manually deploying a Hadoop cluster on Compute Engine involves heavy tooling and maintenance overhead, violating cloud-native operational best practices and increasing operational risk.
  • Why not D? Although using preemptible VMs manually reduces cost, you lose the automated management provided by Dataproc. This increases toil and risk of job failures due to preemption without transparent recovery.

The Architect Blueprint
#

  • Mermaid Diagram illustrating the flow of the CORRECT solution.
graph TD UserAPI([Data Science Team]) -->|Submit Hadoop job| Dataproc[Dataproc Cluster] Dataproc --> StandardMaster[Standard Master VMs] Dataproc --> PreemptibleWorkers[Preemptible Worker VMs] PreemptibleWorkers -->|Preemption events handled automatically| Dataproc Dataproc --> GCS["Google Cloud Storage (Job Input/Output)"] style Dataproc fill:#4285F4,stroke:#333,color:#fff style PreemptibleWorkers fill:#34A853,stroke:#333,color:#fff
  • Diagram Note: The data science team submits jobs to a managed Dataproc cluster, which uses cost-effective preemptible worker nodes handled transparently to reduce expenses and operational overhead.

The Decision Matrix (Mandatory for Professional Level)
#

Option Est. Complexity Est. Monthly Cost Pros Cons
A) Dataproc with standard workers Low Medium (pay standard VM pricing) Fully managed, reliable, low operational toil Higher cost without preemptible savings
B) Dataproc with preemptible workers Low Low (up to 70% cost savings on workers) Managed service + significant cost reduction, automatic recovery from preemption Potential job latency if many preemptions occur
C) Manual Hadoop on standard Compute Engine High High (pay full VM cost + management cost) Full control over cluster setup High operational burden, no auto scaling or managed jobs
D) Manual Hadoop on preemptible Compute Engine High Medium-Low (VM cost reduced) Lower compute cost High operational danger due to preemption, no management automation

Real-World Practitioner Insight
#

Exam Rule
#

For the exam, always pick Dataproc when you see Hadoop or Spark jobs and the requirement to minimize operational overhead.

Real World
#

Many organizations favor managed services like Dataproc for big data workloads to shift focus from infrastructure management to data insights. Using preemptible workers optimizes costs while maintaining fault-tolerance. However, if legacy integration or custom tooling is paramount, a manual cluster might still be needed, but this comes at significant operational expense.

GCP Professional Cloud Architect Drills

Design, develop, and manage robust, secure, and scalable Google Cloud solutions.