While preparing for the AWS SAP-C02, many candidates get confused by High Availability versus Disaster Recovery patterns. In the real world, this is fundamentally a decision about Acceptable Downtime vs. Infrastructure Cost. Let’s drill into a simulated scenario.
The Scenario #
MediTrack Solutions operates a patient appointment scheduling platform that processes real-time booking requests for 200+ medical clinics. The current architecture runs on:
- A single Amazon EC2 instance (t3.large) hosting the application tier
- An Amazon ElastiCache for Redis single-node cluster storing session state and appointment availability cache
- An Amazon RDS for MariaDB single-instance database containing patient records and scheduling data
The platform experienced a 4-hour outage last month when the EC2 instance failed due to underlying hardware issues. During this time, clinics could not accept appointments, resulting in revenue loss and regulatory compliance concerns. The CTO has mandated that the architecture must automatically recover from component failures with minimal downtime (target: < 5 minutes).
Key Requirements #
Design an architecture that enables automatic recovery from infrastructure failures across all three tiers (compute, cache, database) while minimizing manual intervention and downtime.
The Options #
Select THREE:
- A) Deploy an Elastic Load Balancer to distribute traffic across multiple EC2 instances; ensure instances are part of an Auto Scaling group with a minimum capacity of 2.
- B) Deploy an Elastic Load Balancer to distribute traffic across multiple EC2 instances; configure instances in “Unlimited” mode to handle burst traffic.
- C) Modify the database to create a read replica in the same Availability Zone; promote the read replica to primary during failures.
- D) Modify the database to use Multi-AZ deployment spanning two Availability Zones.
- E) Create a replication group for the ElastiCache for Redis cluster; configure the cluster to use an Auto Scaling group with minimum capacity of 2.
- F) Create a replication group for the ElastiCache for Redis cluster; enable Multi-AZ with automatic failover.
Correct Answer #
A, D, F
The Architect’s Analysis #
Correct Answer #
Options A, D, and F
Step-by-Step Winning Logic #
This solution addresses all three architectural single points of failure using AWS-native high availability features:
-
Option A (Compute Tier): Auto Scaling Group (ASG) with minimum capacity of 2 ensures that if one EC2 instance fails, the ELB automatically redirects traffic to healthy instances. The ASG automatically launches a replacement instance. Recovery time: < 60 seconds (health check interval + instance warmup).
-
Option D (Database Tier): RDS Multi-AZ provides synchronous replication to a standby instance in a different AZ. Failover is fully automatic (AWS manages DNS CNAME update) with typical failover time of 60-120 seconds. This is superior to manual read replica promotion.
-
Option F (Cache Tier): ElastiCache Multi-AZ with automatic failover creates a replica node in a different AZ and automatically promotes it during primary node failure. Failover time: typically < 60 seconds. This is the only option that provides automatic recovery for the cache layer.
The Trade-off: You’re paying ~2x infrastructure cost (2+ EC2 instances, Multi-AZ database premium, cache replica) but gaining automated recovery across all tiers without custom scripting or operational runbooks.
The Traps (Distractor Analysis) #
-
Why not Option B? “Unlimited” mode for T-series instances allows CPU burst credit accumulation, but has zero impact on availability. This is a performance/cost feature, not a resilience pattern. It doesn’t address instance failure.
-
Why not Option C? Read replicas in RDS require manual promotion to become primary. The question explicitly requires “automatic recovery.” Same-AZ placement also means no protection against AZ-level failures. Failover time would be 10+ minutes (manual intervention + promotion process).
-
Why not Option E? ElastiCache does not support Auto Scaling groups for nodes. This is architecturally incorrect—cache clusters use replication groups, not ASGs. This is a distractor designed to confuse EC2 Auto Scaling patterns with cache resilience patterns.
The Architect Blueprint #
Diagram Note: Traffic flows through ELB to multiple EC2 instances across two AZs. RDS Multi-AZ synchronously replicates to standby (automatic failover), while ElastiCache replicates asynchronously with automatic promotion on primary failure. Auto Scaling maintains minimum capacity.
The Decision Matrix #
| Option | Est. Complexity | Est. Monthly Cost | Pros | Cons |
|---|---|---|---|---|
| A (ASG + ELB) | Medium | $180-$250/mo (2x t3.large @ ~$60/mo each + ALB @ ~$25/mo + data transfer) | ✅ Automatic instance replacement ✅ Health check-driven recovery ✅ Horizontal scaling capability ✅ ~60s recovery time |
⚠️ Requires stateless application design ⚠️ 2x compute cost vs. single instance ⚠️ ELB adds latency (~1-5ms) |
| B (Unlimited Mode) | Low | $0 incremental (pricing model change only) | ✅ No architectural changes ✅ Predictable billing for burst workloads |
❌ Zero availability improvement ❌ Does not address failure recovery ❌ Potential cost overruns if burst exceeds expectations |
| C (Read Replica - Same AZ) | High | $85-$100/mo (replica @ ~$85/mo for db.t3.medium) | ✅ Offload read traffic ✅ Can be promoted manually |
❌ Manual failover (10-15 min) ❌ Same AZ = no AZ failure protection ❌ Promotion causes DNS change delay ❌ Async replication = potential data loss |
| D (RDS Multi-AZ) | Low | $170-$190/mo (2x db.t3.medium @ ~$85/mo base, Multi-AZ adds ~100% premium) | ✅ Automatic failover (60-120s) ✅ Synchronous replication (no data loss) ✅ AWS-managed (zero operational overhead) ✅ AZ-level failure protection |
⚠️ 2x database cost ⚠️ Standby not usable for reads ⚠️ Brief connection interruption during failover |
| E (ElastiCache ASG) | N/A | N/A | None | ❌ Architecturally invalid ❌ ElastiCache does not support ASG ❌ Confusion between EC2 and cache patterns |
| F (ElastiCache Multi-AZ) | Low | $80-$110/mo (2x cache.t3.medium @ ~$40/mo each + cross-AZ data transfer ~$5-10/mo) | ✅ Automatic failover (~30-60s) ✅ AZ-level failure protection ✅ Minimal application changes (connection string update) ✅ Preserves session state during failover |
⚠️ 2x cache cost ⚠️ Async replication (sub-second lag, potential minimal data loss) ⚠️ Requires Redis Cluster Mode disabled for Multi-AZ |
Total Cost Analysis:
- Original Architecture: ~$215/mo (1x EC2 t3.large @ $60 + 1x RDS @ $85 + 1x ElastiCache @ $40 + misc. $30)
- Correct Solution (A+D+F): ~$450-550/mo
- Cost Premium: ~2.1-2.5x for automated sub-2-minute recovery across all tiers
Real-World Practitioner Insight #
Exam Rule #
For the SAP-C02 exam, when you see:
- “Automatic recovery” → Look for Multi-AZ (RDS, ElastiCache) or Auto Scaling
- “Minimal downtime” → Prioritize AWS-managed failover over manual orchestration
- “All components must be available” → Address every single point of failure in the stack
Exam Pattern Recognition: AWS Professional exams favor Multi-AZ over read replicas for HA scenarios because Multi-AZ is:
- Automatic (no operational burden)
- Synchronous (RDS) or near-synchronous (ElastiCache)
- Integrated with AWS service health checks
Real World #
In production environments, we would likely enhance this further:
-
Multi-Region Active-Passive: For critical healthcare workloads (HIPAA/compliance), we’d add:
- Cross-Region Read Replicas for RDS (async, but regional disaster recovery)
- Global Datastore for ElastiCache (cross-region replication)
- Route 53 health checks with failover routing
-
Observability Layer: The exam doesn’t mention monitoring, but we’d add:
- CloudWatch Alarms on Auto Scaling health checks
- RDS Enhanced Monitoring for failover detection
- ElastiCache metrics for replication lag
-
Cost Optimization: For a 200-clinic system with predictable traffic:
- Use Reserved Instances (1-year commitment) → ~40% cost reduction
- Implement ElastiCache Reserved Nodes → Additional 30-50% savings
- Consider Graviton2 instances (t4g family) → 20% better price/performance
-
Testing Discipline: Unlike the exam scenario, we’d mandate:
- Monthly chaos engineering (terminate instances/trigger failovers)
- GameDay exercises with RTO/RPO validation
- Automated failover testing in staging environments
The Hidden Constraint: The exam assumes the application is stateless (session stored in Redis). In reality, we’d also verify:
- Database connection pooling handles DNS updates during RDS failover
- Application retry logic for transient ElastiCache failover errors
- ELB deregistration delay matches application graceful shutdown time