📡 You're offline — showing cached content
New version available!
Quick Access
Tutorials AWS Solutions Architect High Availability and Disaster Recovery

High Availability and Disaster Recovery

5 min read Quiz at the end
Design HA architectures and DR strategies. Understand RTO, RPO, and the four DR approaches (backup/restore, pilot light, warm standby, multi-site).

High Availability and Disaster Recovery

High Availability (HA) means your application keeps running when parts fail. Disaster Recovery (DR) means you can restore your application when a major failure occurs. Every Solutions Architect must design for both.

Teacher Note: HA is like having a backup power generator in your building — when the main power fails, the generator kicks in automatically within seconds and nobody notices. DR is like having a second building in another city — if the main building burns down, you move operations to the backup site.

Recovery Objectives — Define Your Tolerance

MetricDefinitionExample
RTO (Recovery Time Objective)Maximum acceptable downtimeWe can tolerate 1 hour of downtime
RPO (Recovery Point Objective)Maximum acceptable data lossWe can lose at most 15 minutes of transactions

Four DR Strategies — Cheapest to Most Expensive

StrategyRTORPOCostHow It Works
Backup and RestoreHours to daysHoursLowestRegular snapshots to S3, restore when needed
Pilot Light10-30 minutesMinutesLowCore services always running, scale up on failure
Warm StandbyMinutesSecondsMediumScaled-down but fully running copy in another region
Multi-Site Active-ActiveNear zeroNear zeroHighestFull capacity running in multiple regions simultaneously

High Availability Patterns

  • Multi-AZ RDS: automatic failover within Region in 60-120 seconds — RPO=0, RTO=2 minutes
  • ALB + ASG across 3 AZs: instances in us-east-1a, 1b, 1c — survive any AZ failure
  • S3 with Versioning: recover any previous version of any object
  • Route 53 Failover: automatically route traffic to DR site when primary health check fails
  • Aurora Global Database: 5 regions, under 1 second replication lag, promote secondary in under 1 minute
Exam Tip: The exam loves choosing the RIGHT DR strategy for the given RTO/RPO requirements. Always match strategy to cost vs recovery requirement: 'not willing to lose any data' = active-active or warm standby. 'Can tolerate 4 hours downtime' = backup and restore. Read the requirements carefully!
Topic Quiz · 2 questions

Test your understanding before moving on

1. A company's RPO is 1 hour and RTO is 4 hours. Which is the MOST cost-effective DR strategy?
💡 Backup and Restore has RTO of hours and RPO based on backup frequency. It is the cheapest strategy and matches RTO of 4 hours and RPO of 1 hour (hourly snapshots).
2. A banking application cannot afford to lose ANY data and must recover within 2 minutes of any failure. Which approach meets this requirement?
💡 Multi-AZ RDS uses synchronous replication (RPO=0, no data loss) and automatic failover in 60-120 seconds (RTO under 2 minutes).