Failure Domains & Blast Radius (Draft)

mateusz@systems ~/book/ch06/failure-domains $ cat section.md

Failure Domains & Blast Radius

A failure domain is a set of components that fail together due to a shared dependency. Understanding failure domains is fundamental to designing resilient systems—if you don't know what can fail together, you can't properly separate your redundancy.

# Types of Failure Domains

Electrical Failure Domains

All devices sharing the same power source fail together when power is lost.

PDU A Failure:
    [PDU A] --X--> [Server 1 PSU-A]  (loses power)
                   [Server 2 PSU-A]  (loses power)
                   [Server 3 PSU-A]  (loses power)

If servers only have single PSU on PDU A, all three go down
Solution: Dual PSUs on separate PDUs (A and B)

Network Failure Domains

Devices behind the same network switch or router lose connectivity when that device fails.

ToR Switch Failure:
                [ToR Switch] --X--> (switch fails)
                     |
         +-----------+-----------+
         |           |           |
    [Server A]  [Server B]  [Server C]

All servers in rack lose network connectivity
Solution: Multi-homed servers to different switches (MLAG, bonding)

Geographic Failure Domains

Natural disasters, power grid failures, or network partitions affect entire regions.

Datacenter: us-east-1
    AZ-A: [Servers] [Storage] [Network]
    AZ-B: [Servers] [Storage] [Network]  < Regional power grid
    AZ-C: [Servers] [Storage] [Network]    failure affects all AZs

Mitigation: Multi-region deployment (us-east-1 + us-west-2)

# Blast Radius

Blast radius is the scope of impact when a failure occurs. Small blast radius = localized failure. Large blast radius = widespread outage.

Small Blast Radius (Good):
    Single server fails --> 1/100 capacity lost
    Load balancer redistributes --> Users unaffected

Large Blast Radius (Bad):
    Shared database fails --> All services down
    No redundancy --> Complete outage

Design principle: Minimize blast radius by avoiding shared single points of failure and distributing load across independent failure domains.

# Correlated Failures

Correlated failures occur when seemingly independent components fail due to a shared root cause.

Example 1: Configuration Changes

Bad Configuration Pushed to All Servers Simultaneously:
    T+0: Deploy config change to entire fleet at once
    T+1: Bug in config causes all servers to crash
    T+2: Entire service down (100% failure rate)

Problem: All replicas shared same failure mode (bad config)
Solution: Canary deployments (1% -> 10% -> 100%)

Example 2: Software Bugs Under Load

Scenario: Service has 5 replicas. Load increases, triggering memory leak in all replicas simultaneously. All 5 crash within minutes.

Problem: Replicas are identical software, hit same bug under same conditions. Redundancy didn't help—correlated failure mode.

Mitigation: Gradual rollouts, diverse deployments (different software versions in production), load shedding before hitting resource limits.

Example 3: Shared Dependencies

App Servers Depend on Shared Auth Service:
    [App 1] --|
    [App 2] --+--> [Auth Service] --X--> (auth crashes)
    [App 3] --|

All apps fail auth checks, user-facing outage
Solution: Degrade gracefully (cached auth tokens, temporary bypass)

# Separating Failure Domains

Distribute resources across independent failure domains to prevent correlated failures.

Poor Separation:
    Rack 1: [Primary DB] [Replica 1] [Replica 2]
    Rack 2: (empty)

Rack power failure = all DB instances down

Good Separation:
    Rack 1: [Primary DB]
    Rack 2: [Replica 1]
    Rack 3: [Replica 2]

Rack failure = 1 instance down, others continue

Cloud Failure Domain Separation

In AWS, distribute across Availability Zones (AZs):

Bad:
    us-east-1a: [3 web servers] [database primary + replicas]
    us-east-1b: (empty)
    us-east-1c: (empty)

AZ failure = total outage

Good:
    us-east-1a: [web] [db primary]
    us-east-1b: [web] [db replica]
    us-east-1c: [web] [db replica]

AZ failure = 33% capacity loss, service continues

# Real-World Example: AWS October 2025 DynamoDB Outage

On October 19-20, 2025, AWS experienced a major outage in us-east-1 when DynamoDB's DNS management system failed. The blast radius was enormous—DynamoDB, EC2, Lambda, ECS, and many other services went down.

Failure domain issue: DynamoDB's DNS management had a self-dependency—DynamoDB control plane used DynamoDB for state management. When DNS failed, DynamoDB couldn't recover without DNS, creating a circular dependency.

Blast radius amplification: EC2 instance launches depended on DynamoDB for lease management. When DynamoDB's DNS failed, EC2 couldn't launch instances. When EC2 couldn't launch, services using EC2 autoscaling (Lambda, ECS, Fargate) also failed.

Lesson: Control plane components shouldn't depend on themselves (or systems they control) for critical functions. Self-dependencies create failure modes where recovery is impossible without manual intervention.

# Key Takeaways

Failure domains are components that fail together due to shared dependencies (power, network, geography)
Blast radius is the scope of impact; minimize by avoiding shared single points of failure
Correlated failures occur when independent replicas fail due to same root cause (config, code bug, dependency)
Separate failure domains: distribute across racks, AZs, regions depending on availability requirements
Avoid self-dependencies in control planes—systems shouldn't depend on themselves for recovery
Test failure scenarios to validate blast radius is as expected