Redundancy Models (Draft) :: mateusz.systems

mateusz@systems ~/book/ch06/redundancy $ cat section.md

Redundancy Models

Redundancy is insurance against failure. But how much redundancy is enough? The answer depends on your availability requirements, budget, and failure modes. This section explores common redundancy models and when to use each.

# N+1 Redundancy

Definition: You need N components to handle the load. You deploy N+1, so one can fail without service degradation.

Required Capacity: 4 servers (N=4)
Deployed: 5 servers (N+1)

Normal Operation:
    [S1] [S2] [S3] [S4] [S5]  <- 5 servers, 80% utilized
     25%  25%  25%  25%   0%     (S5 is spare capacity)

One Server Fails:
    [S1] [S2] [S3] [S4] [S5-X]  <- 4 servers remain
     25%  25%  25%  25%   ---     Still enough capacity

Two Servers Fail:
    [S1] [S2] [S3-X] [S4] [S5-X]  <- 3 servers remain
     33%  33%   ---   33%   ---     Degraded (75% capacity)

Pros: Cost-effective (only 25% extra capacity in this example). Handles single failures.

Cons: Second concurrent failure causes degradation. No protection against correlated failures (entire rack loses power).

Best for: Non-critical workloads, budget-constrained environments, stateless services where scaling down temporarily is acceptable.

# N+2 Redundancy

Definition: Deploy N+2 components, tolerating two concurrent failures.

Required: 4 servers (N=4)
Deployed: 6 servers (N+2)

Normal: 67% utilized (6 servers, need 4)
One failure: 80% utilized (5 servers)
Two failures: 100% utilized (4 servers) <- still sufficient
Three failures: degraded

Pros: Tolerates two failures. Good for maintenance (take one offline) + unexpected failure scenario.

Cons: Higher cost (50% overhead in this example).

Best for: Critical services, environments with frequent maintenance, higher availability requirements (99.9%+).

# 2N (Dual Redundancy)

Definition: Deploy twice the required capacity (100% redundancy).

Required: 4 servers (N=4)
Deployed: 8 servers (2N)

Normal: 50% utilized (8 servers, need 4)
Four failures: 100% utilized (4 servers remain) <- still sufficient

Pros: Extremely high availability. Can lose entire failure domain (rack, AZ) and continue. Allows aggressive maintenance schedules.

Cons: Expensive (100% capacity overhead). Wasteful if failures are rare.

Best for: Mission-critical infrastructure (payment processing, healthcare systems), SLA requirements of 99.99%+, compliance-driven environments.

# Comparison Table

+----------------+---------+------------+--------------+-------------+
| Model          | Extra   | Failures   | Cost         | Typical     |
|                | Capacity| Tolerated  | Overhead     | Availability|
+----------------+---------+------------+--------------+-------------+
| N+1            | 1 unit  | 1          | Low          | 99.9%       |
|                |         |            | (~20-25%)    |             |
+----------------+---------+------------+--------------+-------------+
| N+2            | 2 units | 2          | Medium       | 99.95%      |
|                |         |            | (~40-50%)    |             |
+----------------+---------+------------+--------------+-------------+
| 2N             | N units | Up to N    | High         | 99.99%+     |
|                | (100%)  | (half      | (100%)       |             |
|                |         | fleet)     |              |             |
+----------------+---------+------------+--------------+-------------+

# Active-Active vs Active-Passive

Beyond capacity planning, how redundant components operate matters for performance and failover speed.

Active-Active Architecture

All redundant components actively serve traffic simultaneously. Load is distributed across all instances.

Load Balancer
     |
     +----------+----------+
     |          |          |
  [Web 1]   [Web 2]   [Web 3]  <- All serving requests
   Active    Active    Active     Load distributed 33/33/33%

One Fails:
  [Web 1]   [Web 2-X]   [Web 3]  <- Traffic redistributes
   Active      ---      Active     automatically to 1 and 3
   50%                    50%      (50/50% split)

Pros:

No wasted capacity—all resources actively used
Instant failover—load balancer stops routing to failed instance
Horizontal scaling—add more instances to increase capacity
Better resource utilization

Cons:

More complex—requires load balancing, state management
Session handling complexity (sticky sessions or shared state)
Harder to guarantee consistency (if stateful)

Best for: Stateless web services, APIs, microservices, read-heavy databases (read replicas).

Active-Passive Architecture

One component (primary) actively serves traffic. Backup (secondary) is on standby, takes over only when primary fails.

Normal Operation:
  [Primary DB]  <-- All writes and reads
     Active
       |
  [Standby DB]  <-- Receiving replication, not serving
     Passive         traffic (warm standby)

Primary Fails:
  [Primary DB-X]
      ---
       |
  [Standby DB]  <-- Promoted to primary
     Active         Now serves all traffic

Pros:

Simpler—no distributed state or coordination during normal operation
Guaranteed consistency (single writer)
Easier to reason about

Cons:

Wasted capacity—standby sits idle
Slower failover—must detect failure, promote standby (10s-60s typical)
Risk of split-brain if both become active
Doesn't scale capacity (just availability)

Best for: Write-heavy databases, stateful services, systems requiring strong consistency, legacy applications not designed for active-active.

Comparison: Active-Active vs Active-Passive

+---------------------+------------------+------------------+
| Characteristic      | Active-Active    | Active-Passive   |
+---------------------+------------------+------------------+
| Resource Use        | All instances    | Primary only,    |
|                     | serve traffic    | standby idle     |
+---------------------+------------------+------------------+
| Failover Speed      | Instant (<1s)    | 10-60 seconds    |
|                     |                  | (detect + promote|
+---------------------+------------------+------------------+
| Complexity          | High (state      | Low (single      |
|                     | management)      | active instance) |
+---------------------+------------------+------------------+
| Consistency         | Eventual         | Strong (single   |
|                     | (if stateful)    | writer)          |
+---------------------+------------------+------------------+
| Scaling             | Horizontal       | No (for capacity)|
|                     | (add instances)  | Yes (for HA)     |
+---------------------+------------------+------------------+
| Cost Efficiency     | High (all active)| Low (standby idle|
+---------------------+------------------+------------------+

# When to Use Which Model

Stateless services (web, API): Active-active with N+1 or N+2 redundancy. Load balance across instances, auto-scale based on demand.

Databases (write-heavy): Active-passive with N+1 (primary + standby). Promote standby on failure. Consider N+2 for maintenance window + unexpected failure.

Databases (read-heavy): Active-passive primary for writes + active-active read replicas. Writes go to primary, reads distributed across replicas.

Mission-critical infrastructure: 2N redundancy with active-active (if possible) or active-passive (if consistency required). Deploy across multiple failure domains.

# Key Takeaways

N+1 tolerates one failure (cost-effective), N+2 tolerates two (higher availability), 2N tolerates half fleet (mission-critical)
Active-active uses all capacity, instant failover, but complex state management
Active-passive wastes standby capacity, slower failover, but simpler and strongly consistent
Choose model based on: availability SLA, budget, failure frequency, consistency requirements
Stateless services: active-active. Write-heavy databases: active-passive. Read-heavy: hybrid.