State Management & Consensus
Managing state in distributed systems is hard. Stateless services scale easily—just add more instances. Stateful services require coordination, consensus, and careful handling of failures. Understanding when to avoid state and how to manage it when necessary is critical for resilience.
# Stateless vs Stateful Services
Stateless Services
A stateless service doesn't store session or user data locally. Every request is independent, containing all necessary information.
Stateless Web Server:
Request 1: GET /user/123 (auth token in header)
--> Fetch user from database
--> Return response
--> No local state retained
Request 2: GET /user/456 (different server, same behavior)
--> Fetch user from database
--> Return response
Any server can handle any request
Advantages:
- Easy to scale: add more servers, load balance randomly
- Easy to replace: crashed server? Spin up new one, no state migration
- No consistency issues: no local state means no conflicting state
Examples: REST APIs, web frontends with JWT tokens, CDNs, proxy servers.
Stateful Services
A stateful service stores session data, cached data, or authoritative data locally. State must be preserved across failures.
Stateful Database:
Server A: Stores rows 1-1000 (shard A)
Server B: Stores rows 1001-2000 (shard B)
Request for row 500 MUST go to Server A
Server B doesn't have that data
Failure: If Server A crashes, rows 1-1000 unavailable
(unless replicated)
Challenges:
- Scaling: can't just add servers, need to partition/shard state
- Failure: need replication, failover, or recovery from disk
- Consistency: replicas must agree on state (consensus problem)
Examples: Databases, caches (if used as source of truth), session stores, distributed locks, leader election services.
# Quorum and Split-Brain Scenarios
When a network partition occurs, how do you ensure only one group continues operating? Quorum solves this. Split-brain happens when quorum fails.
Quorum Basics
A quorum is the minimum number of nodes required to make decisions. Typically: majority (N/2 + 1).
5-Node Cluster:
Quorum = 3 nodes (majority)
Normal Operation:
[A] [B] [C] [D] [E] <- All 5 nodes, quorum exists
Network Partition:
Group 1: [A] [B] [C] <- 3 nodes, has quorum, can operate
Group 2: [D] [E] <- 2 nodes, no quorum, read-only or halt
Only Group 1 accepts writes
Prevents split-brain (both groups writing conflicting data)
Why majority? Two groups can't both have majority. This prevents conflicting decisions.
Split-Brain Scenario
Split-brain occurs when a partition allows both sides to operate independently, creating conflicting state.
Database with 2 Nodes (No Quorum Check):
Normal: [Primary A] [Replica B]
Network Partition:
Both nodes think the other is dead
[Primary A] <- Promotes itself, accepts writes
[Replica B] <- Promotes itself, accepts writes
Both accept writes to same keys:
A: UPDATE user SET email='a@example.com' WHERE id=1
B: UPDATE user SET email='b@example.com' WHERE id=1
When partition heals: Conflict! Which email is correct?
Prevention: Use quorum. With 3 nodes, only the side with 2+ nodes can elect a leader.
# Real-World Example: GitHub Split-Brain (October 2018)
On October 21, 2018, GitHub experienced a network partition between its East Coast and West Coast datacenters, leading to a split-brain scenario in their MySQL database cluster.
What happened: A network maintenance event caused 43 seconds of connectivity loss between datacenters. GitHub's MySQL cluster (using orchestrator for leader election) failed to maintain quorum properly.
Before Partition:
East Coast: [MySQL Primary]
West Coast: [MySQL Replica]
During 43-Second Network Partition:
East Coast: [MySQL Primary] (continued accepting writes)
West Coast: [MySQL Replica] (promoted to primary, accepted writes)
Split-brain: Both datacenters accepting writes to same database
Impact: When the partition healed, the database clusters had diverged. GitHub had conflicting data in both datacenters.
Recovery: GitHub chose to roll back to the East Coast database state (discarding West Coast writes during partition). They reconciled lost data manually by replaying Git operations from application logs.
Lesson: Quorum must be enforced strictly. Even short network partitions (43 seconds) can cause split-brain if both sides think they have quorum. Use odd number of nodes (3, 5, 7) and majority quorum to prevent this.
# Consensus Protocols (High-Level)
Consensus protocols help distributed systems agree on state despite failures and partitions.
Raft
Raft is a consensus algorithm designed to be understandable. Used in etcd, Consul, CockroachDB.
Key concepts:
- Leader election: Cluster elects one leader. Only leader accepts writes.
- Log replication: Leader appends entries to log, replicates to followers.
- Quorum: Write is committed when majority of nodes acknowledge.
- Leader failure: Followers detect timeout, start new election.
Raft Cluster (5 nodes):
[Leader] [Follower] [Follower] [Follower] [Follower]
|
+-- Client write request
|
v
Replicate to 2+ followers (quorum)
|
v
Commit when majority confirm
Paxos
Paxos is older, more complex, but mathematically proven. Used in Google Chubby, Apache Cassandra (variant).
Key idea: Multi-phase protocol (prepare, promise, accept, commit) ensures agreement even with failures. More flexible than Raft but harder to implement correctly.
When to Use Consensus
Consensus is expensive (latency, coordination overhead). Use when you need:
- Leader election (who's the primary?)
- Configuration management (distribute config changes consistently)
- Distributed locks (ensure only one process holds lock)
- Strongly consistent metadata (file locations, shard assignments)
Avoid for: High-throughput data plane operations. Use consensus for control plane, not data plane.
# State Replication Strategies
Synchronous Replication
Write completes only after replicas acknowledge. Guarantees consistency but adds latency.
Client Write:
Primary receives write
|
v
Replicate to all replicas (wait for ACK)
|
v
All replicas confirm
|
v
Return success to client
Latency: Max replica response time
Consistency: Strong (all replicas have data before commit)
Asynchronous Replication
Write completes on primary, replicates to replicas in background. Fast but risk of data loss.
Client Write:
Primary receives write
|
v
Return success to client (immediately)
|
v
Replicate to replicas (background, async)
If primary crashes before replication: Data lost
Semi-Synchronous Replication
Compromise: wait for one replica to acknowledge (quorum of 2), then return. Balance latency and durability.
Client Write:
Primary receives write
|
v
Replicate to replicas
|
v
Wait for 1+ replica ACK (not all)
|
v
Return success to client
|
v
Continue async replication to remaining replicas
# Key Takeaways
- Stateless services scale easily; prefer stateless when possible
- Stateful services require replication, quorum, and careful failure handling
- Quorum (majority) prevents split-brain—two groups can't both have majority
- GitHub 2018: Network partition caused split-brain when quorum not enforced strictly
- Consensus protocols (Raft, Paxos) enable agreement despite failures, but add latency
- Use consensus for control plane (leader election, config), not high-throughput data plane
- Replication trade-offs: sync (consistent, slow), async (fast, risky), semi-sync (balanced)