Skip to main content
  1. Home
  2. >
  3. AWS
  4. >
  5. SAP-C02
  6. >
  7. This article

AWS SAP-C02 Drill: High Availability Architecture - The Resilience vs. Complexity Trade-off

Jeff Taakey
Author
Jeff Taakey
21+ Year Enterprise Architect | Multi-Cloud Architect & Strategist.
Jeff's Architecture Insights
Go beyond static exam dumps. Jeff’s Insights is engineered to cultivate the mindset of a Production-Ready Architect. We move past ‘correct answers’ to dissect the strategic trade-offs and multi-cloud patterns required to balance reliability, security, and TCO in mission-critical environments.

While preparing for the AWS SAP-C02, many candidates get confused by High Availability versus Disaster Recovery patterns. In the real world, this is fundamentally a decision about Acceptable Downtime vs. Infrastructure Cost. Let’s drill into a simulated scenario.

The Scenario
#

MediTrack Solutions operates a patient appointment scheduling platform that processes real-time booking requests for 200+ medical clinics. The current architecture runs on:

  • A single Amazon EC2 instance (t3.large) hosting the application tier
  • An Amazon ElastiCache for Redis single-node cluster storing session state and appointment availability cache
  • An Amazon RDS for MariaDB single-instance database containing patient records and scheduling data

The platform experienced a 4-hour outage last month when the EC2 instance failed due to underlying hardware issues. During this time, clinics could not accept appointments, resulting in revenue loss and regulatory compliance concerns. The CTO has mandated that the architecture must automatically recover from component failures with minimal downtime (target: < 5 minutes).

Key Requirements
#

Design an architecture that enables automatic recovery from infrastructure failures across all three tiers (compute, cache, database) while minimizing manual intervention and downtime.

The Options
#

Select THREE:

  • A) Deploy an Elastic Load Balancer to distribute traffic across multiple EC2 instances; ensure instances are part of an Auto Scaling group with a minimum capacity of 2.
  • B) Deploy an Elastic Load Balancer to distribute traffic across multiple EC2 instances; configure instances in “Unlimited” mode to handle burst traffic.
  • C) Modify the database to create a read replica in the same Availability Zone; promote the read replica to primary during failures.
  • D) Modify the database to use Multi-AZ deployment spanning two Availability Zones.
  • E) Create a replication group for the ElastiCache for Redis cluster; configure the cluster to use an Auto Scaling group with minimum capacity of 2.
  • F) Create a replication group for the ElastiCache for Redis cluster; enable Multi-AZ with automatic failover.

Correct Answer
#

A, D, F


The Architect’s Analysis
#

Correct Answer
#

Options A, D, and F

Step-by-Step Winning Logic
#

This solution addresses all three architectural single points of failure using AWS-native high availability features:

  1. Option A (Compute Tier): Auto Scaling Group (ASG) with minimum capacity of 2 ensures that if one EC2 instance fails, the ELB automatically redirects traffic to healthy instances. The ASG automatically launches a replacement instance. Recovery time: < 60 seconds (health check interval + instance warmup).

  2. Option D (Database Tier): RDS Multi-AZ provides synchronous replication to a standby instance in a different AZ. Failover is fully automatic (AWS manages DNS CNAME update) with typical failover time of 60-120 seconds. This is superior to manual read replica promotion.

  3. Option F (Cache Tier): ElastiCache Multi-AZ with automatic failover creates a replica node in a different AZ and automatically promotes it during primary node failure. Failover time: typically < 60 seconds. This is the only option that provides automatic recovery for the cache layer.

The Trade-off: You’re paying ~2x infrastructure cost (2+ EC2 instances, Multi-AZ database premium, cache replica) but gaining automated recovery across all tiers without custom scripting or operational runbooks.

The Traps (Distractor Analysis)
#

  • Why not Option B? “Unlimited” mode for T-series instances allows CPU burst credit accumulation, but has zero impact on availability. This is a performance/cost feature, not a resilience pattern. It doesn’t address instance failure.

  • Why not Option C? Read replicas in RDS require manual promotion to become primary. The question explicitly requires “automatic recovery.” Same-AZ placement also means no protection against AZ-level failures. Failover time would be 10+ minutes (manual intervention + promotion process).

  • Why not Option E? ElastiCache does not support Auto Scaling groups for nodes. This is architecturally incorrect—cache clusters use replication groups, not ASGs. This is a distractor designed to confuse EC2 Auto Scaling patterns with cache resilience patterns.

The Architect Blueprint
#

graph TB Users([Clinic Users]) subgraph "Multi-AZ Architecture" ELB[Elastic Load Balancer] subgraph AZ1["Availability Zone 1"] EC2_1[EC2 Instance 1<br/>Application Tier] RDS_Primary[(RDS MariaDB<br/>Primary)] Redis_Primary[ElastiCache Redis<br/>Primary Node] end subgraph AZ2["Availability Zone 2"] EC2_2[EC2 Instance 2+<br/>Application Tier] RDS_Standby[(RDS MariaDB<br/>Standby - Sync Replication)] Redis_Replica[ElastiCache Redis<br/>Replica Node] end ASG[Auto Scaling Group<br/>Min: 2, Desired: 2] end Users --> ELB ELB --> EC2_1 ELB --> EC2_2 EC2_1 -.Session State.-> Redis_Primary EC2_2 -.Session State.-> Redis_Primary EC2_1 -.Reads/Writes.-> RDS_Primary EC2_2 -.Reads/Writes.-> RDS_Primary RDS_Primary -.Synchronous Replication.-> RDS_Standby Redis_Primary -.Async Replication.-> Redis_Replica ASG -. Manages .-> EC2_1 ASG -. Manages .-> EC2_2 style RDS_Primary fill:#3b82f6,stroke:#1e40af,color:#fff style RDS_Standby fill:#60a5fa,stroke:#3b82f6,color:#fff style Redis_Primary fill:#10b981,stroke:#059669,color:#fff style Redis_Replica fill:#6ee7b7,stroke:#10b981,color:#000 style ELB fill:#f59e0b,stroke:#d97706,color:#fff

Diagram Note: Traffic flows through ELB to multiple EC2 instances across two AZs. RDS Multi-AZ synchronously replicates to standby (automatic failover), while ElastiCache replicates asynchronously with automatic promotion on primary failure. Auto Scaling maintains minimum capacity.

The Decision Matrix
#

Option Est. Complexity Est. Monthly Cost Pros Cons
A (ASG + ELB) Medium $180-$250/mo (2x t3.large @ ~$60/mo each + ALB @ ~$25/mo + data transfer) ✅ Automatic instance replacement
✅ Health check-driven recovery
✅ Horizontal scaling capability
✅ ~60s recovery time
⚠️ Requires stateless application design
⚠️ 2x compute cost vs. single instance
⚠️ ELB adds latency (~1-5ms)
B (Unlimited Mode) Low $0 incremental (pricing model change only) ✅ No architectural changes
✅ Predictable billing for burst workloads
❌ Zero availability improvement
❌ Does not address failure recovery
❌ Potential cost overruns if burst exceeds expectations
C (Read Replica - Same AZ) High $85-$100/mo (replica @ ~$85/mo for db.t3.medium) ✅ Offload read traffic
✅ Can be promoted manually
Manual failover (10-15 min)
❌ Same AZ = no AZ failure protection
❌ Promotion causes DNS change delay
❌ Async replication = potential data loss
D (RDS Multi-AZ) Low $170-$190/mo (2x db.t3.medium @ ~$85/mo base, Multi-AZ adds ~100% premium) Automatic failover (60-120s)
✅ Synchronous replication (no data loss)
✅ AWS-managed (zero operational overhead)
✅ AZ-level failure protection
⚠️ 2x database cost
⚠️ Standby not usable for reads
⚠️ Brief connection interruption during failover
E (ElastiCache ASG) N/A N/A None Architecturally invalid
❌ ElastiCache does not support ASG
❌ Confusion between EC2 and cache patterns
F (ElastiCache Multi-AZ) Low $80-$110/mo (2x cache.t3.medium @ ~$40/mo each + cross-AZ data transfer ~$5-10/mo) Automatic failover (~30-60s)
✅ AZ-level failure protection
✅ Minimal application changes (connection string update)
✅ Preserves session state during failover
⚠️ 2x cache cost
⚠️ Async replication (sub-second lag, potential minimal data loss)
⚠️ Requires Redis Cluster Mode disabled for Multi-AZ

Total Cost Analysis:

  • Original Architecture: ~$215/mo (1x EC2 t3.large @ $60 + 1x RDS @ $85 + 1x ElastiCache @ $40 + misc. $30)
  • Correct Solution (A+D+F): ~$450-550/mo
  • Cost Premium: ~2.1-2.5x for automated sub-2-minute recovery across all tiers

Real-World Practitioner Insight
#

Exam Rule
#

For the SAP-C02 exam, when you see:

  • Automatic recovery” → Look for Multi-AZ (RDS, ElastiCache) or Auto Scaling
  • Minimal downtime” → Prioritize AWS-managed failover over manual orchestration
  • All components must be available” → Address every single point of failure in the stack

Exam Pattern Recognition: AWS Professional exams favor Multi-AZ over read replicas for HA scenarios because Multi-AZ is:

  1. Automatic (no operational burden)
  2. Synchronous (RDS) or near-synchronous (ElastiCache)
  3. Integrated with AWS service health checks

Real World
#

In production environments, we would likely enhance this further:

  1. Multi-Region Active-Passive: For critical healthcare workloads (HIPAA/compliance), we’d add:

    • Cross-Region Read Replicas for RDS (async, but regional disaster recovery)
    • Global Datastore for ElastiCache (cross-region replication)
    • Route 53 health checks with failover routing
  2. Observability Layer: The exam doesn’t mention monitoring, but we’d add:

    • CloudWatch Alarms on Auto Scaling health checks
    • RDS Enhanced Monitoring for failover detection
    • ElastiCache metrics for replication lag
  3. Cost Optimization: For a 200-clinic system with predictable traffic:

    • Use Reserved Instances (1-year commitment) → ~40% cost reduction
    • Implement ElastiCache Reserved Nodes → Additional 30-50% savings
    • Consider Graviton2 instances (t4g family) → 20% better price/performance
  4. Testing Discipline: Unlike the exam scenario, we’d mandate:

    • Monthly chaos engineering (terminate instances/trigger failovers)
    • GameDay exercises with RTO/RPO validation
    • Automated failover testing in staging environments

The Hidden Constraint: The exam assumes the application is stateless (session stored in Redis). In reality, we’d also verify:

  • Database connection pooling handles DNS updates during RDS failover
  • Application retry logic for transient ElastiCache failover errors
  • ELB deregistration delay matches application graceful shutdown time