mateusz@systems ~/book/ch06 $ cat chapter.md

Chapter 6

Redundancy & Resilient System Design

Systems fail. Hardware fails, networks partition, software has bugs, and humans make mistakes. The difference between a minor glitch and a major outage lies in how systems are designed to handle failures. This chapter explores the principles and patterns for building resilient systems that degrade gracefully, recover automatically, and limit the blast radius of failures.

Resilience isn't just about redundancy—it's about understanding failure modes, separating concerns, managing state carefully, and testing that your failover actually works. We'll examine real-world outages (AWS, GitHub, Facebook, Cloudflare) to understand what went wrong and what we can learn.

This chapter covers practical techniques for designing resilient systems:

  • Failure Domains: How to limit blast radius and prevent correlated failures
  • Redundancy Models: N+1, N+2, 2N, active-active vs active-passive trade-offs
  • Control vs Data Plane: Why separating them prevents cascading failures
  • State Management: Consensus, quorum, and avoiding split-brain scenarios
  • Geographic Distribution: Multi-datacenter patterns and consistency trade-offs
  • Building for Resilience: Dev/QA/prod separation, deployment strategies, dependency management
  • Graceful Degradation: Circuit breakers, retries, timeouts, and health checks
  • Testing Resilience: Chaos engineering and fault injection

By the end of this chapter, you'll understand how to design systems that survive failures—and know what questions to ask when reviewing architecture for resilience.

# Chapter Sections

Failure Domains & Blast Radius

What is a failure domain (electrical, network, geographical). Separating failure domains to limit blast radius. Correlated failures and how to avoid them. Examples from datacenter infrastructure and cloud environments.

Redundancy Models

N+1, N+2, 2N redundancy explained with cost/availability trade-offs. Active-active vs active-passive architectures. When to use which model. Comparison tables and diagrams.

Control Plane vs Data Plane Separation

What they are and why separate them. Metadata vs data: when control plane failures cascade. Real-world example: AWS S3 2017 outage (control plane down, data plane partially survived) and AWS DynamoDB October 2025 outage (DNS control plane self-dependency).

State Management & Consensus

Stateless vs stateful services. Quorum and split-brain scenarios. Consensus protocols (Raft/Paxos) at high level. Real-world example: GitHub 2018 split-brain incident and recovery. State replication strategies.

Geographic Distribution

Multi-datacenter architectures. Consistency vs latency (CAP theorem in practice, not theory). Replication strategies (sync/async/semi-sync). Examples: Netflix multi-region, Google Spanner, S3 cross-region replication.

Building Resilient Systems: Practical Guidance

Environment separation (dev/qa/prod). Dependency lifecycle management (DNS, auth, monitoring). Testing strategies and deployment patterns (blue-green, canary, rolling). Service dependencies and circuit breakers. AWS fault isolation boundaries guidance.

Graceful Degradation

Retry strategies and exponential backoff. Circuit breakers in practice (with pseudocode example). Timeout configuration. Health checks that actually work. Bulkheading and failure isolation.

Testing Resilience

Chaos engineering principles. Fault injection examples (network latency, packet loss, process kills). Game days and disaster recovery drills. What to test and when. Building confidence in failover mechanisms.

Notable Incidents & Lessons Learned

Analysis of major outages: AWS DynamoDB October 2025 (DNS race condition, cascade failures), Facebook BGP October 2021 (routes withdrawn, DNS unreachable), GitHub 2018 (split-brain recovery), Cloudflare BGP leak 2019 (anycast limiting blast radius). Common patterns and lessons.