Control Plane vs Data Plane Separation (Draft)

mateusz@systems ~/book/ch06/control-data $ cat section.md

Control Plane vs Data Plane Separation

The control plane manages configuration, coordination, and metadata. The data plane handles actual user requests and data processing. Separating them prevents control plane failures from bringing down the data plane—and vice versa. This separation is fundamental to resilient system design.

# What Are Control and Data Planes?

Control Plane

The control plane manages system state, configuration, and coordination. Examples:

Cluster membership and leader election
Configuration distribution
Health checking and monitoring
API endpoints for admin operations (create resource, delete resource)
Metadata management (file locations, routing tables)

Characteristics: Lower volume traffic, slower changing, critical for configuration but not for every request.

Data Plane

The data plane handles user requests and actual data processing. Examples:

Serving web requests
Reading/writing data to databases or storage
Packet forwarding in routers
Processing user transactions

Characteristics: High volume traffic, latency-sensitive, must be highly available.

Example: Network Router

Control Plane:
    - BGP protocol (exchange routing information)
    - OSPF protocol (calculate shortest paths)
    - Update routing table when topology changes
    - Low volume, infrequent

Data Plane:
    - Forward packets based on routing table
    - Millions of packets per second
    - Must be fast, latency-sensitive
    - Uses routing table built by control plane

# Why Separate Them?

Blast radius isolation: Control plane bug doesn't crash data plane. Data plane overload doesn't starve control plane of resources.

Independent scaling: Data plane needs horizontal scaling (more servers for more requests). Control plane needs coordination/consensus (not horizontally scalable in same way).

Graceful degradation: If control plane fails, data plane continues serving with last-known-good configuration. Service degrades but doesn't stop.

Coupled (Bad):
    Control Plane and Data Plane in same process
    Control plane crash --> Entire service down

Separated (Good):
    Control Plane fails --> Data plane continues with cached config
                        --> New configs can't be applied
                        --> But existing traffic still works

# Real-World Example: AWS S3 Outage (February 2017)

On February 28, 2017, AWS S3 in us-east-1 experienced a major outage lasting ~4 hours. A typo in a command during routine maintenance took down S3's control plane.

What happened: An engineer intended to remove a small number of servers from the S3 billing subsystem (control plane). The command had a typo, removing a much larger set of servers—including critical control plane systems.

Intended: Remove 10 servers from billing subsystem
Actual: Removed hundreds of servers, including:
    - S3 placement subsystem (assigns objects to storage)
    - S3 index subsystem (metadata about object locations)

Impact:
    - Control plane: Can't create new buckets, upload new objects
    - Data plane: Existing objects could be read (partially)
              Some GET requests continued working for hours
              because data plane had cached metadata

Partial survival: Because S3's data plane was somewhat separated, some GET requests continued working even with control plane down. Not all requests—some metadata lookups still depended on control plane—but separation prevented total failure.

Recovery complexity: Restarting control plane required reloading massive amounts of state (metadata for billions of objects), which took hours. Systems weren't designed for cold-start at that scale.

Lesson: Separate control and data planes physically (different servers). Design data plane to cache control plane state locally. Test control plane failure scenarios regularly.

# Real-World Example: AWS DynamoDB DNS Outage (October 2025)

On October 19-20, 2025, DynamoDB in us-east-1 suffered a control plane failure that cascaded across multiple AWS services.

Root cause: A race condition in DynamoDB's DNS management (control plane) caused all IP addresses for the regional DynamoDB endpoint to be removed from DNS.

Self-dependency problem: DynamoDB's control plane used DynamoDB itself for state management. When DNS failed, the control plane couldn't query DynamoDB to fix DNS—circular dependency.

DynamoDB DNS Control Plane:
    Uses DynamoDB to store endpoint state
         |
         v
    DNS fails --> Can't reach DynamoDB
         |
         v
    Can't query DynamoDB to fix DNS --> Stuck

Self-dependency created unrecoverable failure mode

Cascade failure: EC2 instance launches depended on DynamoDB for lease management (control plane). When DynamoDB DNS failed, EC2 couldn't launch instances. When EC2 couldn't launch, Lambda/ECS/Fargate (which autoscale on EC2) also failed.

DynamoDB DNS fails
    |
    v
DynamoDB control plane unreachable
    |
    v
EC2 lease management fails (depends on DynamoDB)
    |
    v
EC2 instance launches fail
    |
    v
Lambda/ECS/Fargate can't autoscale --> cascade outage

Lesson: Control plane should never depend on itself (or systems it controls) for critical recovery functions. Use external, simple, reliable systems for control plane bootstrapping (e.g., static config files, separate simple database, not the same database you're managing).

# Metadata vs Data Separation

A specific case of control/data plane separation: metadata (where data is) vs actual data.

Distributed Storage Example

Metadata Service (Control Plane):
    - Tracks which nodes store which blocks
    - File: /home/user/data.txt
        Block 1: stored on Node A, Node B (replicas)
        Block 2: stored on Node C, Node D (replicas)

Data Service (Data Plane):
    - Stores actual blocks
    - Serves read/write requests
    - Uses metadata service to find blocks

Separation:
    Metadata failure: Can't find new files, can't write
                      But cached metadata allows reading existing files
    Data node failure: Only blocks on that node unavailable
                       Metadata reroutes to replicas

Best practice: Keep metadata and data on separate servers. Metadata service should be highly available (Raft/Paxos consensus). Data service should be horizontally scalable.

# Design Patterns for Separation

Pattern 1: Cached Control Plane State

Data plane caches control plane state locally. Control plane failure doesn't immediately impact data plane.

Data Plane Node:
    - Fetches config from control plane every 60s
    - Caches config locally (in memory or disk)
    - Uses cached config to serve requests
    - If control plane unreachable, continues with cached config

Graceful degradation: No new config updates, but service continues

Pattern 2: Separate Resource Pools

Run control plane and data plane on separate hardware/VMs. Prevents resource contention.

Control Plane: 3 dedicated servers (small, for coordination)
Data Plane: 100 servers (large, for handling user traffic)

Data plane CPU spike doesn't starve control plane
Control plane memory leak doesn't OOM data plane

Pattern 3: Read-Only Data Plane

Control plane handles writes (slow, needs coordination). Data plane handles reads (fast, cacheable).

Write Request:
    Client --> Control Plane (validate, coordinate, write)
            --> Replicate to data plane

Read Request:
    Client --> Data Plane (cached, fast, no control plane query)

Control plane failure: Writes fail, reads continue

# Key Takeaways

Control plane manages config/metadata; data plane handles user requests
Separate them physically (different servers) to prevent failure cascade
Data plane should cache control plane state for graceful degradation
S3 2017: Control plane failure partially survived because data plane had cached metadata
DynamoDB 2025: Self-dependency (control plane using DynamoDB) created unrecoverable failure
Never have control plane depend on itself or systems it controls for recovery
Metadata and data should be separated (different servers, different scaling properties)