Recovery Disaster 101 for Data/Cloud teams
After a major cloud incident, the question isn’t if it can happen again—it’s how you prevent and recover when it does. Two concepts to keep top of mind:
RPO (Recovery Point Objective): how much data loss you can tolerate.
RTO (Recovery Time Objective): how much downtime you can tolerate.
Common DR strategies (from simplest to fastest RTO):
Backup & Restore
Regular backups (snapshots/exports) and restore to a secondary environment.
Lowest cost; highest RTO/RPO.
Pilot Light
Minimal, critical core of the system always running (DBs/config, small app footprint).
Faster than pure backup because essentials are already up.
Warm Standby
The full system is running at reduced capacity.
Scales quickly to production load during a disaster.
Multi-Site / Hot Standby
Two (or more) active production environments.
Very low RTO/RPO, but higher cost/complexity.
Multi-Region / Multi-Cloud
Production-capable stacks in different regions (or clouds) to survive regional/provider failures.
Mental model (left→right = faster recovery, higher cost/complexity):
Backup & Restore → Pilot Light → Warm Standby → Multi-Site → Multi-Region/Multi-Cloud
Tip: Match your DR choice to business SLAs: if RTO/RPO targets are strict, you’ll likely need Warm Standby or Hot/Multi-Region—plus IaC, runbooks, and regular game days.
Just remember this graphic:
————————————————————————————→
B&R Pilot Light Warm standby MultiSite
←————————————————————————————→