Redundancy Models
Redundancy is insurance against failure. But how much redundancy is enough? The answer depends on your availability requirements, budget, and failure modes. This section explores common redundancy models and when to use each.
# N+1 Redundancy
Definition: You need N components to handle the load. You deploy N+1, so one can fail without service degradation.
Required Capacity: 4 servers (N=4)
Deployed: 5 servers (N+1)
Normal Operation:
[S1] [S2] [S3] [S4] [S5] <- 5 servers, 80% utilized
25% 25% 25% 25% 0% (S5 is spare capacity)
One Server Fails:
[S1] [S2] [S3] [S4] [S5-X] <- 4 servers remain
25% 25% 25% 25% --- Still enough capacity
Two Servers Fail:
[S1] [S2] [S3-X] [S4] [S5-X] <- 3 servers remain
33% 33% --- 33% --- Degraded (75% capacity)
Pros: Cost-effective (only 25% extra capacity in this example). Handles single failures.
Cons: Second concurrent failure causes degradation. No protection against correlated failures (entire rack loses power).
Best for: Non-critical workloads, budget-constrained environments, stateless services where scaling down temporarily is acceptable.
# N+2 Redundancy
Definition: Deploy N+2 components, tolerating two concurrent failures.
Required: 4 servers (N=4) Deployed: 6 servers (N+2) Normal: 67% utilized (6 servers, need 4) One failure: 80% utilized (5 servers) Two failures: 100% utilized (4 servers) <- still sufficient Three failures: degraded
Pros: Tolerates two failures. Good for maintenance (take one offline) + unexpected failure scenario.
Cons: Higher cost (50% overhead in this example).
Best for: Critical services, environments with frequent maintenance, higher availability requirements (99.9%+).
# 2N (Dual Redundancy)
Definition: Deploy twice the required capacity (100% redundancy).
Required: 4 servers (N=4) Deployed: 8 servers (2N) Normal: 50% utilized (8 servers, need 4) Four failures: 100% utilized (4 servers remain) <- still sufficient
Pros: Extremely high availability. Can lose entire failure domain (rack, AZ) and continue. Allows aggressive maintenance schedules.
Cons: Expensive (100% capacity overhead). Wasteful if failures are rare.
Best for: Mission-critical infrastructure (payment processing, healthcare systems), SLA requirements of 99.99%+, compliance-driven environments.
# Comparison Table
+----------------+---------+------------+--------------+-------------+ | Model | Extra | Failures | Cost | Typical | | | Capacity| Tolerated | Overhead | Availability| +----------------+---------+------------+--------------+-------------+ | N+1 | 1 unit | 1 | Low | 99.9% | | | | | (~20-25%) | | +----------------+---------+------------+--------------+-------------+ | N+2 | 2 units | 2 | Medium | 99.95% | | | | | (~40-50%) | | +----------------+---------+------------+--------------+-------------+ | 2N | N units | Up to N | High | 99.99%+ | | | (100%) | (half | (100%) | | | | | fleet) | | | +----------------+---------+------------+--------------+-------------+
# Active-Active vs Active-Passive
Beyond capacity planning, how redundant components operate matters for performance and failover speed.
Active-Active Architecture
All redundant components actively serve traffic simultaneously. Load is distributed across all instances.
Load Balancer
|
+----------+----------+
| | |
[Web 1] [Web 2] [Web 3] <- All serving requests
Active Active Active Load distributed 33/33/33%
One Fails:
[Web 1] [Web 2-X] [Web 3] <- Traffic redistributes
Active --- Active automatically to 1 and 3
50% 50% (50/50% split)
Pros:
- No wasted capacity—all resources actively used
- Instant failover—load balancer stops routing to failed instance
- Horizontal scaling—add more instances to increase capacity
- Better resource utilization
Cons:
- More complex—requires load balancing, state management
- Session handling complexity (sticky sessions or shared state)
- Harder to guarantee consistency (if stateful)
Best for: Stateless web services, APIs, microservices, read-heavy databases (read replicas).
Active-Passive Architecture
One component (primary) actively serves traffic. Backup (secondary) is on standby, takes over only when primary fails.
Normal Operation:
[Primary DB] <-- All writes and reads
Active
|
[Standby DB] <-- Receiving replication, not serving
Passive traffic (warm standby)
Primary Fails:
[Primary DB-X]
---
|
[Standby DB] <-- Promoted to primary
Active Now serves all traffic
Pros:
- Simpler—no distributed state or coordination during normal operation
- Guaranteed consistency (single writer)
- Easier to reason about
Cons:
- Wasted capacity—standby sits idle
- Slower failover—must detect failure, promote standby (10s-60s typical)
- Risk of split-brain if both become active
- Doesn't scale capacity (just availability)
Best for: Write-heavy databases, stateful services, systems requiring strong consistency, legacy applications not designed for active-active.
Comparison: Active-Active vs Active-Passive
+---------------------+------------------+------------------+ | Characteristic | Active-Active | Active-Passive | +---------------------+------------------+------------------+ | Resource Use | All instances | Primary only, | | | serve traffic | standby idle | +---------------------+------------------+------------------+ | Failover Speed | Instant (<1s) | 10-60 seconds | | | | (detect + promote| +---------------------+------------------+------------------+ | Complexity | High (state | Low (single | | | management) | active instance) | +---------------------+------------------+------------------+ | Consistency | Eventual | Strong (single | | | (if stateful) | writer) | +---------------------+------------------+------------------+ | Scaling | Horizontal | No (for capacity)| | | (add instances) | Yes (for HA) | +---------------------+------------------+------------------+ | Cost Efficiency | High (all active)| Low (standby idle| +---------------------+------------------+------------------+
# When to Use Which Model
Stateless services (web, API): Active-active with N+1 or N+2 redundancy. Load balance across instances, auto-scale based on demand.
Databases (write-heavy): Active-passive with N+1 (primary + standby). Promote standby on failure. Consider N+2 for maintenance window + unexpected failure.
Databases (read-heavy): Active-passive primary for writes + active-active read replicas. Writes go to primary, reads distributed across replicas.
Mission-critical infrastructure: 2N redundancy with active-active (if possible) or active-passive (if consistency required). Deploy across multiple failure domains.
# Key Takeaways
- N+1 tolerates one failure (cost-effective), N+2 tolerates two (higher availability), 2N tolerates half fleet (mission-critical)
- Active-active uses all capacity, instant failover, but complex state management
- Active-passive wastes standby capacity, slower failover, but simpler and strongly consistent
- Choose model based on: availability SLA, budget, failure frequency, consistency requirements
- Stateless services: active-active. Write-heavy databases: active-passive. Read-heavy: hybrid.