Failure Domains & Blast Radius
A failure domain is a set of components that fail together due to a shared dependency. Understanding failure domains is fundamental to designing resilient systems—if you don't know what can fail together, you can't properly separate your redundancy.
# Types of Failure Domains
Electrical Failure Domains
All devices sharing the same power source fail together when power is lost.
PDU A Failure:
[PDU A] --X--> [Server 1 PSU-A] (loses power)
[Server 2 PSU-A] (loses power)
[Server 3 PSU-A] (loses power)
If servers only have single PSU on PDU A, all three go down
Solution: Dual PSUs on separate PDUs (A and B)
Network Failure Domains
Devices behind the same network switch or router lose connectivity when that device fails.
ToR Switch Failure:
[ToR Switch] --X--> (switch fails)
|
+-----------+-----------+
| | |
[Server A] [Server B] [Server C]
All servers in rack lose network connectivity
Solution: Multi-homed servers to different switches (MLAG, bonding)
Geographic Failure Domains
Natural disasters, power grid failures, or network partitions affect entire regions.
Datacenter: us-east-1
AZ-A: [Servers] [Storage] [Network]
AZ-B: [Servers] [Storage] [Network] < Regional power grid
AZ-C: [Servers] [Storage] [Network] failure affects all AZs
Mitigation: Multi-region deployment (us-east-1 + us-west-2)
# Blast Radius
Blast radius is the scope of impact when a failure occurs. Small blast radius = localized failure. Large blast radius = widespread outage.
Small Blast Radius (Good):
Single server fails --> 1/100 capacity lost
Load balancer redistributes --> Users unaffected
Large Blast Radius (Bad):
Shared database fails --> All services down
No redundancy --> Complete outage
Design principle: Minimize blast radius by avoiding shared single points of failure and distributing load across independent failure domains.
# Correlated Failures
Correlated failures occur when seemingly independent components fail due to a shared root cause.
Example 1: Configuration Changes
Bad Configuration Pushed to All Servers Simultaneously:
T+0: Deploy config change to entire fleet at once
T+1: Bug in config causes all servers to crash
T+2: Entire service down (100% failure rate)
Problem: All replicas shared same failure mode (bad config)
Solution: Canary deployments (1% -> 10% -> 100%)
Example 2: Software Bugs Under Load
Scenario: Service has 5 replicas. Load increases, triggering memory leak in all replicas simultaneously. All 5 crash within minutes.
Problem: Replicas are identical software, hit same bug under same conditions. Redundancy didn't help—correlated failure mode.
Mitigation: Gradual rollouts, diverse deployments (different software versions in production), load shedding before hitting resource limits.
Example 3: Shared Dependencies
App Servers Depend on Shared Auth Service:
[App 1] --|
[App 2] --+--> [Auth Service] --X--> (auth crashes)
[App 3] --|
All apps fail auth checks, user-facing outage
Solution: Degrade gracefully (cached auth tokens, temporary bypass)
# Separating Failure Domains
Distribute resources across independent failure domains to prevent correlated failures.
Poor Separation:
Rack 1: [Primary DB] [Replica 1] [Replica 2]
Rack 2: (empty)
Rack power failure = all DB instances down
Good Separation:
Rack 1: [Primary DB]
Rack 2: [Replica 1]
Rack 3: [Replica 2]
Rack failure = 1 instance down, others continue
Cloud Failure Domain Separation
In AWS, distribute across Availability Zones (AZs):
Bad:
us-east-1a: [3 web servers] [database primary + replicas]
us-east-1b: (empty)
us-east-1c: (empty)
AZ failure = total outage
Good:
us-east-1a: [web] [db primary]
us-east-1b: [web] [db replica]
us-east-1c: [web] [db replica]
AZ failure = 33% capacity loss, service continues
# Real-World Example: AWS October 2025 DynamoDB Outage
On October 19-20, 2025, AWS experienced a major outage in us-east-1 when DynamoDB's DNS management system failed. The blast radius was enormous—DynamoDB, EC2, Lambda, ECS, and many other services went down.
Failure domain issue: DynamoDB's DNS management had a self-dependency—DynamoDB control plane used DynamoDB for state management. When DNS failed, DynamoDB couldn't recover without DNS, creating a circular dependency.
Blast radius amplification: EC2 instance launches depended on DynamoDB for lease management. When DynamoDB's DNS failed, EC2 couldn't launch instances. When EC2 couldn't launch, services using EC2 autoscaling (Lambda, ECS, Fargate) also failed.
Lesson: Control plane components shouldn't depend on themselves (or systems they control) for critical functions. Self-dependencies create failure modes where recovery is impossible without manual intervention.
# Key Takeaways
- Failure domains are components that fail together due to shared dependencies (power, network, geography)
- Blast radius is the scope of impact; minimize by avoiding shared single points of failure
- Correlated failures occur when independent replicas fail due to same root cause (config, code bug, dependency)
- Separate failure domains: distribute across racks, AZs, regions depending on availability requirements
- Avoid self-dependencies in control planes—systems shouldn't depend on themselves for recovery
- Test failure scenarios to validate blast radius is as expected