State Management & Consensus (Draft)

mateusz@systems ~/book/ch06/state $ cat section.md

State Management & Consensus

Managing state in distributed systems is hard. Stateless services scale easily—just add more instances. Stateful services require coordination, consensus, and careful handling of failures. Understanding when to avoid state and how to manage it when necessary is critical for resilience.

# Stateless vs Stateful Services

Stateless Services

A stateless service doesn't store session or user data locally. Every request is independent, containing all necessary information.

Stateless Web Server:
    Request 1: GET /user/123 (auth token in header)
        --> Fetch user from database
        --> Return response
        --> No local state retained

    Request 2: GET /user/456 (different server, same behavior)
        --> Fetch user from database
        --> Return response

Any server can handle any request

Advantages:

Easy to scale: add more servers, load balance randomly
Easy to replace: crashed server? Spin up new one, no state migration
No consistency issues: no local state means no conflicting state

Examples: REST APIs, web frontends with JWT tokens, CDNs, proxy servers.

Stateful Services

A stateful service stores session data, cached data, or authoritative data locally. State must be preserved across failures.

Stateful Database:
    Server A: Stores rows 1-1000 (shard A)
    Server B: Stores rows 1001-2000 (shard B)

    Request for row 500 MUST go to Server A
    Server B doesn't have that data

Failure: If Server A crashes, rows 1-1000 unavailable
         (unless replicated)

Challenges:

Scaling: can't just add servers, need to partition/shard state
Failure: need replication, failover, or recovery from disk
Consistency: replicas must agree on state (consensus problem)

Examples: Databases, caches (if used as source of truth), session stores, distributed locks, leader election services.

# Quorum and Split-Brain Scenarios

When a network partition occurs, how do you ensure only one group continues operating? Quorum solves this. Split-brain happens when quorum fails.

Quorum Basics

A quorum is the minimum number of nodes required to make decisions. Typically: majority (N/2 + 1).

5-Node Cluster:
    Quorum = 3 nodes (majority)

Normal Operation:
    [A] [B] [C] [D] [E]  <- All 5 nodes, quorum exists

Network Partition:
    Group 1: [A] [B] [C]  <- 3 nodes, has quorum, can operate
    Group 2: [D] [E]      <- 2 nodes, no quorum, read-only or halt

Only Group 1 accepts writes
Prevents split-brain (both groups writing conflicting data)

Why majority? Two groups can't both have majority. This prevents conflicting decisions.

Split-Brain Scenario

Split-brain occurs when a partition allows both sides to operate independently, creating conflicting state.

Database with 2 Nodes (No Quorum Check):
    Normal: [Primary A] [Replica B]

Network Partition:
    Both nodes think the other is dead
    [Primary A] <- Promotes itself, accepts writes
    [Replica B] <- Promotes itself, accepts writes

Both accept writes to same keys:
    A: UPDATE user SET email='a@example.com' WHERE id=1
    B: UPDATE user SET email='b@example.com' WHERE id=1

When partition heals: Conflict! Which email is correct?

Prevention: Use quorum. With 3 nodes, only the side with 2+ nodes can elect a leader.

# Real-World Example: GitHub Split-Brain (October 2018)

On October 21, 2018, GitHub experienced a network partition between its East Coast and West Coast datacenters, leading to a split-brain scenario in their MySQL database cluster.

What happened: A network maintenance event caused 43 seconds of connectivity loss between datacenters. GitHub's MySQL cluster (using orchestrator for leader election) failed to maintain quorum properly.

Before Partition:
    East Coast: [MySQL Primary]
    West Coast: [MySQL Replica]

During 43-Second Network Partition:
    East Coast: [MySQL Primary] (continued accepting writes)
    West Coast: [MySQL Replica] (promoted to primary, accepted writes)

Split-brain: Both datacenters accepting writes to same database

Impact: When the partition healed, the database clusters had diverged. GitHub had conflicting data in both datacenters.

Recovery: GitHub chose to roll back to the East Coast database state (discarding West Coast writes during partition). They reconciled lost data manually by replaying Git operations from application logs.

Lesson: Quorum must be enforced strictly. Even short network partitions (43 seconds) can cause split-brain if both sides think they have quorum. Use odd number of nodes (3, 5, 7) and majority quorum to prevent this.

# Consensus Protocols (High-Level)

Consensus protocols help distributed systems agree on state despite failures and partitions.

Raft

Raft is a consensus algorithm designed to be understandable. Used in etcd, Consul, CockroachDB.

Key concepts:

Leader election: Cluster elects one leader. Only leader accepts writes.
Log replication: Leader appends entries to log, replicates to followers.
Quorum: Write is committed when majority of nodes acknowledge.
Leader failure: Followers detect timeout, start new election.

Raft Cluster (5 nodes):
    [Leader] [Follower] [Follower] [Follower] [Follower]
       |
       +-- Client write request
       |
       v
    Replicate to 2+ followers (quorum)
       |
       v
    Commit when majority confirm

Paxos

Paxos is older, more complex, but mathematically proven. Used in Google Chubby, Apache Cassandra (variant).

Key idea: Multi-phase protocol (prepare, promise, accept, commit) ensures agreement even with failures. More flexible than Raft but harder to implement correctly.

When to Use Consensus

Consensus is expensive (latency, coordination overhead). Use when you need:

Leader election (who's the primary?)
Configuration management (distribute config changes consistently)
Distributed locks (ensure only one process holds lock)
Strongly consistent metadata (file locations, shard assignments)

Avoid for: High-throughput data plane operations. Use consensus for control plane, not data plane.

# State Replication Strategies

Synchronous Replication

Write completes only after replicas acknowledge. Guarantees consistency but adds latency.

Client Write:
    Primary receives write
        |
        v
    Replicate to all replicas (wait for ACK)
        |
        v
    All replicas confirm
        |
        v
    Return success to client

Latency: Max replica response time
Consistency: Strong (all replicas have data before commit)

Asynchronous Replication

Write completes on primary, replicates to replicas in background. Fast but risk of data loss.

Client Write:
    Primary receives write
        |
        v
    Return success to client (immediately)
        |
        v
    Replicate to replicas (background, async)

If primary crashes before replication: Data lost

Semi-Synchronous Replication

Compromise: wait for one replica to acknowledge (quorum of 2), then return. Balance latency and durability.

Client Write:
    Primary receives write
        |
        v
    Replicate to replicas
        |
        v
    Wait for 1+ replica ACK (not all)
        |
        v
    Return success to client
        |
        v
    Continue async replication to remaining replicas

# Key Takeaways

Stateless services scale easily; prefer stateless when possible
Stateful services require replication, quorum, and careful failure handling
Quorum (majority) prevents split-brain—two groups can't both have majority
GitHub 2018: Network partition caused split-brain when quorum not enforced strictly
Consensus protocols (Raft, Paxos) enable agreement despite failures, but add latency
Use consensus for control plane (leader election, config), not high-throughput data plane
Replication trade-offs: sync (consistent, slow), async (fast, risky), semi-sync (balanced)