Notable Incidents & Lessons Learned
Real-world outages teach us more than any theoretical discussion. This section summarizes major incidents discussed throughout this chapter and extracts common patterns and lessons.
# AWS DynamoDB DNS Outage (October 19-20, 2025)
What Happened
A race condition in DynamoDB's DNS management system caused all IP addresses for the regional DynamoDB endpoint in us-east-1 to be removed from DNS. This triggered a cascading failure across multiple AWS services.
Timeline
11:48 PM Oct 19: DynamoDB DNS failure begins 12:38 AM Oct 20: Root cause identified (race condition) 2:25 AM: DNS restored 4:14 AM: EC2 host restarts initiated to clear queues 5:28 AM: EC2 lease management recovering 10:36 AM: Network propagation normalized 1:50 PM: EC2 fully recovered (14+ hours total)
Services Impacted
- DynamoDB (direct DNS failure)
- EC2 (instance launches failed, depended on DynamoDB for leases)
- Lambda, ECS, Fargate (depend on EC2 autoscaling)
- AWS Management Console, STS (authentication failures)
- NLB (health check cascade failures)
Root Causes
- Self-dependency: DynamoDB's DNS control plane used DynamoDB itself for state management. Circular dependency prevented recovery.
- Race condition: Two DNS Enactors competing, older plan overwrote current plan
- Cascade failure: EC2 depended on DynamoDB, Lambda/ECS depended on EC2
Lessons Learned
- Control plane must never depend on itself (or systems it controls) for recovery
- Use external, simple, reliable systems for bootstrapping (not the same complex system)
- Test control plane failure scenarios—can system recover without manual intervention?
- Dependency chains create cascade failures—map and isolate critical dependencies
# AWS S3 Outage (February 28, 2017)
What Happened
During routine maintenance, an engineer mistyped a command intended to remove a small number of servers from S3's billing subsystem. The typo removed a much larger set of servers, including critical S3 control plane systems (placement and index subsystems).
Impact
- S3 API unavailable for ~4 hours in us-east-1
- New bucket creation, object uploads failed (control plane down)
- Some GET requests continued working (data plane had cached metadata)
- Cascading impact on services depending on S3 (many AWS services use S3 internally)
Lessons Learned
- Control plane vs data plane separation prevented total failure (some reads survived)
- Human error is inevitable—design systems with safeguards (confirmation prompts, dry-run mode)
- Cold-start recovery at scale is slow—S3 took hours to reload metadata for billions of objects
- Test disaster recovery scenarios including full control plane restart
# Facebook BGP Outage (October 4, 2021)
What Happened
A routine maintenance command accidentally withdrew all of Facebook's BGP route advertisements. The internet's routers "forgot" how to reach Facebook's IP addresses, making Facebook (and Instagram, WhatsApp) unreachable globally for ~6 hours.
Impact
Before: Internet routers knew routes to Facebook IPs
User requests reached Facebook servers
After BGP withdrawal:
Internet routers had no routes to Facebook
User requests failed with "host unreachable"
Even Facebook's DNS servers unreachable (also BGP-advertised)
Why DNS Didn't Help
DNS resolution requires network connectivity. If BGP routes are gone, you can't query DNS servers to find IP addresses—the DNS servers themselves are unreachable.
Lessons Learned
- BGP is critical infrastructure—misconfigurations can make entire networks vanish
- BGP changes need safeguards: staged rollouts, automated validation, kill-switch
- DNS depends on routing—if routing fails, DNS fails too
- Recovery was slow—engineers couldn't access datacenters remotely (no network), had to physically enter
# GitHub Split-Brain (October 21, 2018)
What Happened
A 43-second network partition between GitHub's East Coast and West Coast datacenters caused both sides to promote MySQL replicas to primary, resulting in split-brain. Both datacenters accepted writes to the same database, creating conflicting data.
Impact
Normal:
East Coast: [MySQL Primary] (writes)
West Coast: [MySQL Replica] (reads)
During 43-Second Partition:
East Coast: [MySQL Primary] (continued writes)
West Coast: [MySQL Replica promoted to Primary] (also writes)
Result: Divergent database state, conflicting writes
Recovery
GitHub rolled back to East Coast database state, discarding West Coast writes during the partition. They manually reconciled lost data by replaying Git operations from application logs.
Lessons Learned
- Quorum must be enforced strictly—even short partitions (43 seconds) cause split-brain
- Use odd number of nodes (3, 5, 7) to ensure only one side can have majority
- Automated failover needs careful quorum checks—automation without quorum is dangerous
- Application-level logs enabled data recovery when database diverged
# Google Cloud Cooling Failure (June 2019)
What Happened
Cooling system failures in Google's europe-west2-a zone caused temperatures to rise. Automated systems shut down servers to prevent hardware damage, impacting services for hours.
Impact
- Compute Engine instances terminated in affected zone
- Persistent disks unavailable (zone-scoped resources)
- Services without multi-zone redundancy experienced outages
Lessons Learned
- Cooling failures are real outages—not just "ops problems"
- Physical infrastructure (power, cooling) creates failure domains
- Distribute workloads across multiple zones for high availability
- Automated shutdowns to protect hardware can cause service outages—design for this scenario
# Cloudflare BGP Leak (June 2019)
What Happened
A BGP route leak during a configuration error caused traffic intended for Cloudflare to be routed through a small ISP, which couldn't handle the volume. This disrupted Cloudflare services.
How Anycast Limited Blast Radius
Cloudflare uses anycast—same IP advertised from multiple datacenters globally via BGP. Even though the route leak impacted some regions, other datacenters continued serving traffic. Blast radius was regional, not global.
Lessons Learned
- BGP route leaks can redirect traffic to unintended destinations
- Anycast architecture limits blast radius—failure affects region, not global
- BGP filtering and validation needed to prevent route leaks
# Common Patterns Across Incidents
Self-Dependencies
DynamoDB 2025: Control plane depended on itself (DynamoDB DNS used DynamoDB for state). Created unrecoverable failure mode.
Cascade Failures
DynamoDB 2025: DynamoDB → EC2 → Lambda/ECS. Single failure cascaded through dependency chain.
Human Error
S3 2017: Typo in command. Facebook 2021: Accidental BGP withdrawal. Human error is inevitable—systems must have safeguards.
Insufficient Quorum Checks
GitHub 2018: Both datacenters thought they had quorum, caused split-brain. Strict quorum enforcement prevents this.
Physical Infrastructure Matters
Google 2019: Cooling failure. Software resilience can't protect against physical failures—distribute across physical failure domains.
# Key Takeaways
- Control planes must not depend on themselves or systems they control for recovery
- Cascade failures occur when dependency chains aren't isolated—use circuit breakers, bulkheading
- Human error is constant—design for it with safeguards, confirmations, dry-run modes
- Quorum must be enforced strictly—even short partitions cause split-brain without proper quorum
- Physical infrastructure failures (power, cooling) are real—distribute across datacenters/zones
- BGP misconfigurations can make networks vanish—need staged rollouts and validation
- Control/data plane separation limits blast radius—data plane survives longer than control plane
- Test failure scenarios regularly—untested failover will fail when you need it