Notable Incidents & Lessons (Draft)

mateusz@systems ~/book/ch06/incidents $ cat section.md

Notable Incidents & Lessons Learned

Real-world outages teach us more than any theoretical discussion. This section summarizes major incidents discussed throughout this chapter and extracts common patterns and lessons.

# AWS DynamoDB DNS Outage (October 19-20, 2025)

What Happened

A race condition in DynamoDB's DNS management system caused all IP addresses for the regional DynamoDB endpoint in us-east-1 to be removed from DNS. This triggered a cascading failure across multiple AWS services.

Timeline

11:48 PM Oct 19: DynamoDB DNS failure begins
12:38 AM Oct 20: Root cause identified (race condition)
 2:25 AM:        DNS restored
 4:14 AM:        EC2 host restarts initiated to clear queues
 5:28 AM:        EC2 lease management recovering
10:36 AM:        Network propagation normalized
 1:50 PM:        EC2 fully recovered (14+ hours total)

Services Impacted

DynamoDB (direct DNS failure)
EC2 (instance launches failed, depended on DynamoDB for leases)
Lambda, ECS, Fargate (depend on EC2 autoscaling)
AWS Management Console, STS (authentication failures)
NLB (health check cascade failures)

Root Causes

Self-dependency: DynamoDB's DNS control plane used DynamoDB itself for state management. Circular dependency prevented recovery.
Race condition: Two DNS Enactors competing, older plan overwrote current plan
Cascade failure: EC2 depended on DynamoDB, Lambda/ECS depended on EC2

Lessons Learned

Control plane must never depend on itself (or systems it controls) for recovery
Use external, simple, reliable systems for bootstrapping (not the same complex system)
Test control plane failure scenarios—can system recover without manual intervention?
Dependency chains create cascade failures—map and isolate critical dependencies

# AWS S3 Outage (February 28, 2017)

What Happened

During routine maintenance, an engineer mistyped a command intended to remove a small number of servers from S3's billing subsystem. The typo removed a much larger set of servers, including critical S3 control plane systems (placement and index subsystems).

Impact

S3 API unavailable for ~4 hours in us-east-1
New bucket creation, object uploads failed (control plane down)
Some GET requests continued working (data plane had cached metadata)
Cascading impact on services depending on S3 (many AWS services use S3 internally)

Lessons Learned

Control plane vs data plane separation prevented total failure (some reads survived)
Human error is inevitable—design systems with safeguards (confirmation prompts, dry-run mode)
Cold-start recovery at scale is slow—S3 took hours to reload metadata for billions of objects
Test disaster recovery scenarios including full control plane restart

# Facebook BGP Outage (October 4, 2021)

What Happened

A routine maintenance command accidentally withdrew all of Facebook's BGP route advertisements. The internet's routers "forgot" how to reach Facebook's IP addresses, making Facebook (and Instagram, WhatsApp) unreachable globally for ~6 hours.

Impact

Before: Internet routers knew routes to Facebook IPs
        User requests reached Facebook servers

After BGP withdrawal:
        Internet routers had no routes to Facebook
        User requests failed with "host unreachable"
        Even Facebook's DNS servers unreachable (also BGP-advertised)

Why DNS Didn't Help

DNS resolution requires network connectivity. If BGP routes are gone, you can't query DNS servers to find IP addresses—the DNS servers themselves are unreachable.

Lessons Learned

BGP is critical infrastructure—misconfigurations can make entire networks vanish
BGP changes need safeguards: staged rollouts, automated validation, kill-switch
DNS depends on routing—if routing fails, DNS fails too
Recovery was slow—engineers couldn't access datacenters remotely (no network), had to physically enter

# GitHub Split-Brain (October 21, 2018)

What Happened

A 43-second network partition between GitHub's East Coast and West Coast datacenters caused both sides to promote MySQL replicas to primary, resulting in split-brain. Both datacenters accepted writes to the same database, creating conflicting data.

Impact

Normal:
    East Coast: [MySQL Primary] (writes)
    West Coast: [MySQL Replica] (reads)

During 43-Second Partition:
    East Coast: [MySQL Primary] (continued writes)
    West Coast: [MySQL Replica promoted to Primary] (also writes)

Result: Divergent database state, conflicting writes

Recovery

GitHub rolled back to East Coast database state, discarding West Coast writes during the partition. They manually reconciled lost data by replaying Git operations from application logs.

Lessons Learned

Quorum must be enforced strictly—even short partitions (43 seconds) cause split-brain
Use odd number of nodes (3, 5, 7) to ensure only one side can have majority
Automated failover needs careful quorum checks—automation without quorum is dangerous
Application-level logs enabled data recovery when database diverged

# Google Cloud Cooling Failure (June 2019)

What Happened

Cooling system failures in Google's europe-west2-a zone caused temperatures to rise. Automated systems shut down servers to prevent hardware damage, impacting services for hours.

Impact

Compute Engine instances terminated in affected zone
Persistent disks unavailable (zone-scoped resources)
Services without multi-zone redundancy experienced outages

Lessons Learned

Cooling failures are real outages—not just "ops problems"
Physical infrastructure (power, cooling) creates failure domains
Distribute workloads across multiple zones for high availability
Automated shutdowns to protect hardware can cause service outages—design for this scenario

# Cloudflare BGP Leak (June 2019)

What Happened

A BGP route leak during a configuration error caused traffic intended for Cloudflare to be routed through a small ISP, which couldn't handle the volume. This disrupted Cloudflare services.

How Anycast Limited Blast Radius

Cloudflare uses anycast—same IP advertised from multiple datacenters globally via BGP. Even though the route leak impacted some regions, other datacenters continued serving traffic. Blast radius was regional, not global.

Lessons Learned

BGP route leaks can redirect traffic to unintended destinations
Anycast architecture limits blast radius—failure affects region, not global
BGP filtering and validation needed to prevent route leaks

# Common Patterns Across Incidents

Self-Dependencies

DynamoDB 2025: Control plane depended on itself (DynamoDB DNS used DynamoDB for state). Created unrecoverable failure mode.

Cascade Failures

DynamoDB 2025: DynamoDB → EC2 → Lambda/ECS. Single failure cascaded through dependency chain.

Human Error

S3 2017: Typo in command. Facebook 2021: Accidental BGP withdrawal. Human error is inevitable—systems must have safeguards.

Insufficient Quorum Checks

GitHub 2018: Both datacenters thought they had quorum, caused split-brain. Strict quorum enforcement prevents this.

Physical Infrastructure Matters

Google 2019: Cooling failure. Software resilience can't protect against physical failures—distribute across physical failure domains.

# Key Takeaways

Control planes must not depend on themselves or systems they control for recovery
Cascade failures occur when dependency chains aren't isolated—use circuit breakers, bulkheading
Human error is constant—design for it with safeguards, confirmations, dry-run modes
Quorum must be enforced strictly—even short partitions cause split-brain without proper quorum
Physical infrastructure failures (power, cooling) are real—distribute across datacenters/zones
BGP misconfigurations can make networks vanish—need staged rollouts and validation
Control/data plane separation limits blast radius—data plane survives longer than control plane
Test failure scenarios regularly—untested failover will fail when you need it