AWS SAP-C02 Drill: Ephemeral HPC Storage - The Cost-Performance Trade-off Analysis

Table of Contents

While preparing for the AWS SAP-C02, many candidates get confused by ephemeral storage strategies for periodic HPC workloads. In the real world, this is fundamentally a decision about paying for storage 24/7 vs. paying for compute+storage only when needed. The trap? Choosing a technically viable solution that costs 10x more than necessary. Let’s drill into a simulated scenario.

The Scenario
#

GlobalGenomics Research operates a bioinformatics analysis platform on AWS. Their gene sequencing pipeline runs on a cluster of 300-500 Amazon EC2 instances (compute-optimized c6i.8xlarge), processing regulatory compliance reports from a shared dataset of approximately 200 TB.

Currently, the shared file system runs on a fleet of dedicated EC2 instances with attached EBS volumes, operating continuously to ensure availability. However, the actual computational workload runs only once per month for approximately 72 hours, during which time the system reads 40-60% of the dataset (partial file access) and generates analysis reports.

The compute cluster uses Auto Scaling Groups to scale from 0 to 500 instances during the job window, but the storage layer remains provisioned 24/7, generating unnecessary costs. The CFO has mandated a storage architecture redesign to align costs with actual usage while maintaining high-performance access during the 72-hour processing window.

Key Requirements
#

Replace the always-on shared file system with a cost-optimized solution
Provide high-performance concurrent access during the 72-hour monthly job
Support partial file access (only a subset of 200 TB is read each month)
Minimize total cost of ownership for the 11,000+ hours per year when storage is idle
All resources must remain within the same AWS region

The Options
#

A) Migrate the existing shared file system data to an Amazon S3 bucket using S3 Intelligent-Tiering storage class. Before each monthly job, create a new Amazon FSx for Lustre file system with lazy load configuration linked to the S3 bucket. Use the FSx file system as shared storage during the job, then delete it after completion.

B) Migrate the existing shared file system data to a large Amazon EBS volume with Multi-Attach enabled. Use Auto Scaling Group launch template user data scripts to attach the EBS volume to each EC2 instance. Use the EBS volume as shared storage during the job, then detach it after completion.

C) Migrate the existing shared file system data to an Amazon S3 bucket using S3 Standard storage class. Before each monthly job, create a new Amazon FSx for Lustre file system with bulk load configuration linked to the S3 bucket. Use the FSx file system as shared storage during the job, then delete it after completion.

D) Migrate the existing shared file system data to an Amazon S3 bucket. Before each monthly job, create an AWS Storage Gateway File Gateway linked to the S3 bucket. Use the File Gateway as shared storage during the job, then delete it after completion.

Correct Answer
#

Option A.

The Architect’s Analysis
#

Correct Answer
#

Option A — S3 Intelligent-Tiering + FSx for Lustre with Lazy Load

Step-by-Step Winning Logic
#

This solution represents the optimal cost-performance-operational trade-off for the following reasons:

1. Storage Cost Optimization (S3 Intelligent-Tiering)

Automatically moves objects to lower-cost tiers (Infrequent Access, Archive Access) after 30/90 days of no access
For 200 TB stored 11 months without access: ~$2,560/month (Infrequent Access tier) vs. $4,608/month (S3 Standard)
No retrieval fees when files are accessed (unlike Glacier)
No lifecycle policy management overhead

2. Compute Cost Optimization (FSx Lazy Load)

Lazy load only hydrates files from S3 when accessed by the application
Since only 40-60% of files are read each month, you avoid loading the full 200 TB
Estimated data transfer: ~100 TB vs. 200 TB (bulk load)
FSx for Lustre charges for provisioned capacity, so smaller initial hydration = lower costs during the 72-hour window

3. Ephemeral Infrastructure Pattern

FSx file system exists only 72 hours/month (3 days × 12 months = 36 days/year)
Annualized FSx cost: ~8% of always-on storage costs
Auto Scaling Group already handles compute elasticity; this extends elasticity to storage

4. Performance Characteristics

FSx for Lustre provides hundreds of GB/s throughput and sub-millisecond latencies
Native POSIX compliance (required for HPC workloads)
Seamless integration with EC2 instances via mount points

The Traps (Distractor Analysis)
#

Why not Option B (EBS Multi-Attach)?

Technical Limitation: EBS Multi-Attach supports a maximum of 16 instances (Nitro-based) concurrently
The scenario requires 300-500 EC2 instances to access the same dataset
Fatal flaw for the use case — this option is technically non-viable
Even if instance count were reduced, EBS costs ($0.10/GB-month for gp3) would be $20,480/month for 200 TB, running continuously

Why not Option C (FSx Lustre Bulk Load + S3 Standard)?

Bulk load pre-hydrates the entire 200 TB from S3 to FSx before the job starts
This incurs unnecessary data transfer costs (~$18,000 for 200 TB from S3 to FSx) and time overhead
Since only 40-60% of files are accessed, you’re paying to hydrate data that’s never read
S3 Standard costs $4,608/month vs. $2,560/month for Intelligent-Tiering after 30 days
Total unnecessary cost: ~$2,000/month + $10,000/job in wasted data transfer

Why not Option D (Storage Gateway File Gateway)?

File Gateway is designed for on-premises to cloud hybrid storage, not ephemeral HPC workloads
Performance bottleneck: File Gateway caches data locally but relies on S3 API calls for underlying data access
S3 throughput limits (~5,500 GET requests/second per prefix) create severe performance degradation for 300-500 concurrent instances
Not architected for HPC: No parallel file system semantics, no POSIX locking optimizations
Cost inefficiency: File Gateway EC2 instances + EBS cache would run continuously, negating savings

The Architect Blueprint
#

graph TB subgraph "Month 1-11: Idle State" S3[("S3 Intelligent-Tiering 200 TB Dataset Cost: ~$2,560/mo")] end subgraph "Month 12: Job Execution (72 hours)" S3 -->|"1. Create FSx w/ Lazy Load Link"| FSx["FSx for Lustre (Ephemeral) Provisioned: 200 TB capacity Hydrated: ~100 TB on-demand"] FSx -->|"2. Mount to cluster"| ASG["Auto Scaling Group 300-500 c6i.8xlarge instances"] ASG -->|"3. Read/Write Operations"| FSx FSx -->|"4. Modified data export"| S3 FSx -->|"5. Delete after 72h"| Delete["File System Terminated"] end style S3 fill:#FF9900,stroke:#232F3E,stroke-width:2px,color:#fff style FSx fill:#FF6600,stroke:#232F3E,stroke-width:2px,color:#fff style ASG fill:#527FFF,stroke:#232F3E,stroke-width:2px,color:#fff

Diagram Note: The architecture demonstrates the ephemeral lifecycle where FSx for Lustre exists only during the 72-hour processing window, with S3 Intelligent-Tiering providing cost-optimized persistence and lazy load minimizing unnecessary data hydration.

The Decision Matrix
#

Option	Est. Complexity	Est. Monthly Cost (200 TB, 72h job)	Pros	Cons
A: S3 Intelligent-Tiering + FSx Lazy Load	Medium (requires FSx creation automation)	~$3,200/mo ($2,560 S3 storage + $640 FSx 72h @ $8,800/mo prorated)	✅ Auto-tiering saves $2,048/mo ✅ Lazy load avoids 100 TB transfer ✅ HPC-grade performance ✅ Ephemeral cost model	⚠️ Requires orchestration (Lambda/Step Functions) ⚠️ Initial file access has S3 latency
B: EBS Multi-Attach	Low (native EC2 feature)	$20,480/mo (200 TB gp3 @ $0.10/GB-mo, continuous)	✅ Simple implementation	❌ Fails requirement (16 instance limit) ❌ 24/7 costs despite 72h usage ❌ No cost savings vs. current state
C: S3 Standard + FSx Bulk Load	Medium (requires FSx creation automation)	~$5,700/mo ($4,608 S3 + $640 FSx + $450 wasted data transfer)	✅ HPC-grade performance ✅ All data pre-loaded	❌ $2,048/mo higher S3 costs ❌ Wastes $5,400/yr on unused data transfer ❌ 4-6 hour pre-load delay
D: Storage Gateway File Gateway	High (gateway deployment + cache tuning)	~$8,500/mo ($4,608 S3 + $3,892 gateway instances 24/7)	✅ Familiar NFS interface	❌ Performance bottleneck for HPC ❌ 24/7 gateway costs ❌ S3 API rate limits ❌ Not designed for parallel workloads

Cost Quantification Notes:

S3 Intelligent-Tiering: $0.0125/GB (Frequent) → $0.0125/GB (Infrequent after 30 days) = ~$2,560/month for 200 TB in IA tier
FSx for Lustre: ~$0.145/GB-month for SSD storage (200 TB × $0.145 × 72h/720h) ≈ $640 prorated
EBS gp3: $0.08/GB-month (200 TB × 1024 × $0.08) = $16,384/month + IOPS costs
Data Transfer: S3 to FSx in same region is free for lazy load; bulk load incurs processing time overhead

Real-World Practitioner Insight
#

Exam Rule
#

For the SAP-C02 exam, when you see:

“Runs periodically” (weekly/monthly) + “hundreds of instances” → Think ephemeral FSx for Lustre
“Partial file access” → Choose lazy load over bulk load
“200 TB+” + “high performance” → FSx for Lustre, not EFS or Storage Gateway
“Cost optimization” for infrequently accessed data → S3 Intelligent-Tiering (no retrieval fees vs. Glacier)

Real World
#

In production environments, I would enhance this architecture with:

Hybrid Lazy Load + Preload Strategy:
- Use FSx for Lustre’s hsm_restore command to pre-warm frequently accessed files (e.g., reference genomes) while lazy-loading the rest
- Reduces first-access latency for critical datasets by 80%
S3 Lifecycle Policy Refinement:
- Tag “hot” reference data to remain in S3 Standard
- Allow S3 Intelligent-Tiering to manage the remaining 180 TB
- Estimated additional savings: $800/month
FSx Deployment Automation:
- Use AWS Step Functions to orchestrate: S3 → FSx creation → EC2 Auto Scaling trigger → Job completion → FSx deletion
- CloudWatch Events to trigger on S3 object uploads (new dataset versions)
Cost Anomaly Detection:
- Set AWS Cost Anomaly Detection alerts for FSx costs exceeding $1,000/month (indicates file system wasn’t deleted)
- Tag FSx file systems with auto-delete: 72h for automated cleanup via Lambda
Performance Monitoring:
- In the real world, we’d validate that lazy load latency (first read from S3: ~100-500ms) doesn’t impact the 72-hour SLA
- If it does, consider a hybrid approach: Bulk load the 20% most-accessed files, lazy load the remaining 80%

The Exam Simplification: The exam scenario omits complexity like:

Data versioning and rollback requirements
Compliance requirements for data retention (might prohibit S3 Intelligent-Tiering’s automatic archiving)
Network throughput limits (VPC endpoints for S3, FSx ENI placement)
Job failure scenarios (what if the job fails at hour 50? Do you pay for another 72-hour FSx window?)

In enterprise settings, these factors could shift the decision toward a persistent FSx for Lustre file system with S3 export (for very frequent jobs) or Amazon EFS with Infrequent Access (for lower performance requirements).

The Scenario #

Key Requirements #

The Options #

Correct Answer #

The Architect’s Analysis #

Correct Answer #

Step-by-Step Winning Logic #

The Traps (Distractor Analysis) #

The Architect Blueprint #

The Decision Matrix #

Real-World Practitioner Insight #

Exam Rule #

Real World #

Related Articles

Mastering AWS Solutions Architect Professional (SAP-C02)

The Scenario
#

Key Requirements
#

The Options
#

Correct Answer
#

The Architect’s Analysis
#

Correct Answer
#

Step-by-Step Winning Logic
#

The Traps (Distractor Analysis)
#

The Architect Blueprint
#

The Decision Matrix
#

Real-World Practitioner Insight
#

Exam Rule
#

Real World
#