Design HA architectures and DR strategies. Understand RTO, RPO, and the four DR approaches (backup/restore, pilot light, warm standby, multi-site).
High Availability and Disaster Recovery
High Availability (HA) means your application keeps running when parts fail. Disaster Recovery (DR) means you can restore your application when a major failure occurs. Every Solutions Architect must design for both.
Teacher Note: HA is like having a backup power generator in your building — when the main power fails, the generator kicks in automatically within seconds and nobody notices. DR is like having a second building in another city — if the main building burns down, you move operations to the backup site.
Recovery Objectives — Define Your Tolerance
Metric
Definition
Example
RTO (Recovery Time Objective)
Maximum acceptable downtime
We can tolerate 1 hour of downtime
RPO (Recovery Point Objective)
Maximum acceptable data loss
We can lose at most 15 minutes of transactions
Four DR Strategies — Cheapest to Most Expensive
Strategy
RTO
RPO
Cost
How It Works
Backup and Restore
Hours to days
Hours
Lowest
Regular snapshots to S3, restore when needed
Pilot Light
10-30 minutes
Minutes
Low
Core services always running, scale up on failure
Warm Standby
Minutes
Seconds
Medium
Scaled-down but fully running copy in another region
Multi-Site Active-Active
Near zero
Near zero
Highest
Full capacity running in multiple regions simultaneously
High Availability Patterns
Multi-AZ RDS: automatic failover within Region in 60-120 seconds — RPO=0, RTO=2 minutes
ALB + ASG across 3 AZs: instances in us-east-1a, 1b, 1c — survive any AZ failure
S3 with Versioning: recover any previous version of any object
Route 53 Failover: automatically route traffic to DR site when primary health check fails
Aurora Global Database: 5 regions, under 1 second replication lag, promote secondary in under 1 minute
Exam Tip: The exam loves choosing the RIGHT DR strategy for the given RTO/RPO requirements. Always match strategy to cost vs recovery requirement: 'not willing to lose any data' = active-active or warm standby. 'Can tolerate 4 hours downtime' = backup and restore. Read the requirements carefully!
1. A company's RPO is 1 hour and RTO is 4 hours. Which is the MOST cost-effective DR strategy?
💡 Backup and Restore has RTO of hours and RPO based on backup frequency. It is the cheapest strategy and matches RTO of 4 hours and RPO of 1 hour (hourly snapshots).
2. A banking application cannot afford to lose ANY data and must recover within 2 minutes of any failure. Which approach meets this requirement?
💡 Multi-AZ RDS uses synchronous replication (RPO=0, no data loss) and automatic failover in 60-120 seconds (RTO under 2 minutes).