Graceful Degradation (Draft) :: mateusz.systems

mateusz@systems ~/book/ch06/degradation $ cat section.md

Graceful Degradation

Perfect reliability is impossible. Systems will fail. Graceful degradation means your system continues providing reduced functionality rather than failing completely. This section covers patterns for handling failures gracefully: retries, circuit breakers, timeouts, and bulkheading.

# Retry Strategies

Transient failures (network blips, temporary overload) can be resolved by retrying. But naive retries make things worse.

Naive Retry (Bad)

Pseudocode:
    attempt = 0
    while attempt < MAX_RETRIES:
        result = call_service()
        if result.success:
            return result
        attempt += 1
        # No delay, retry immediately

Problem: If service is overloaded, immediate retries amplify load
        All clients retry simultaneously --> thundering herd

Exponential Backoff (Better)

Pseudocode:
    attempt = 0
    base_delay = 100ms  # Initial delay
    max_delay = 60s     # Cap delay at 60 seconds

    while attempt < MAX_RETRIES:
        result = call_service()
        if result.success:
            return result

        attempt += 1
        delay = min(base_delay * (2 ^ attempt), max_delay)
        sleep(delay)

    raise ServiceUnavailableError()

Delays: 100ms, 200ms, 400ms, 800ms, 1600ms, ...
Gives service time to recover, spreads retries over time

Exponential Backoff with Jitter (Best)

Pseudocode:
    attempt = 0
    base_delay = 100ms
    max_delay = 60s

    while attempt < MAX_RETRIES:
        result = call_service()
        if result.success:
            return result

        attempt += 1
        delay = min(base_delay * (2 ^ attempt), max_delay)
        jittered_delay = delay * (0.5 + random(0, 0.5))
        sleep(jittered_delay)

    raise ServiceUnavailableError()

Jitter randomizes delays: prevents synchronized retries
If 1000 clients all failed at same time, retries are spread out

When to retry: Idempotent operations (GET, PUT with same data), transient network errors (connection timeout, 503 service unavailable), read operations.

When NOT to retry: Non-idempotent operations (POST creating resource twice is bad), 4xx client errors (won't succeed on retry), operations with side effects (payment processing).

# Circuit Breaker Pattern

Circuit breakers prevent calling a failing service repeatedly. If a service is down, stop calling it for a period (fail fast), then test recovery.

Circuit Breaker States

States:
    CLOSED      --> Normal operation, calls go through
    OPEN        --> Service failing, reject calls immediately
    HALF_OPEN   --> Testing if service recovered

State Transitions:
    CLOSED --> OPEN: After N consecutive failures
    OPEN --> HALF_OPEN: After timeout period
    HALF_OPEN --> CLOSED: If test request succeeds
    HALF_OPEN --> OPEN: If test request fails

Circuit Breaker Pseudocode

class CircuitBreaker:
    state = CLOSED
    failure_count = 0
    success_count = 0
    failure_threshold = 5      # Open after 5 failures
    success_threshold = 2      # Close after 2 successes in half-open
    timeout = 60s              # Time before testing recovery
    last_failure_time = null

    function call(operation):
        if state == OPEN:
            if now() - last_failure_time > timeout:
                state = HALF_OPEN
                success_count = 0
            else:
                raise CircuitOpenError("Service unavailable")

        if state == HALF_OPEN:
            # Allow limited requests through to test recovery
            try:
                result = operation()
                success_count += 1

                if success_count >= success_threshold:
                    state = CLOSED
                    failure_count = 0

                return result
            except Exception as e:
                state = OPEN
                last_failure_time = now()
                raise e

        if state == CLOSED:
            try:
                result = operation()
                failure_count = 0  # Reset on success
                return result
            except Exception as e:
                failure_count += 1
                last_failure_time = now()

                if failure_count >= failure_threshold:
                    state = OPEN

                raise e

# Usage:
circuit = CircuitBreaker()
try:
    result = circuit.call(lambda: fetch_user_data(user_id))
except CircuitOpenError:
    # Fail fast, return cached data or error to user
    return cached_user_data(user_id)

Why circuit breakers help: Prevents wasting resources calling dead service. Gives service time to recover (stop hammering it). Fails fast (user gets error immediately, not after timeout). Allows monitoring/alerting on circuit state changes.

# Timeout Configuration

Timeouts prevent waiting forever for failed operations. But setting them wrong causes problems.

Too Long

Timeout: 60 seconds
Service is down, takes 60s to timeout
User waits 60s for error
Thread/connection blocked for 60s
If many requests, exhausts thread pool --> cascade failure

Too Short

Timeout: 100ms
Service normally responds in 90ms
Network spike: 150ms response
Timeout fires, request aborted
Service actually succeeded, but client thinks it failed
Retries increase load, making problem worse

Right-Sizing Timeouts

Guideline: Set timeout to p99 latency + buffer. If 99% of requests complete in 500ms, set timeout to 1000ms (2x p99).

Latency Distribution:
    p50: 100ms
    p95: 300ms
    p99: 500ms
    p99.9: 2000ms

Timeout Setting:
    Conservative: 2x p99 = 1000ms (catches most slow requests)
    Aggressive: 1.5x p99 = 750ms (fails fast, may false-positive on slow requests)

Monitor timeout rates: If >1% of requests timing out, either increase timeout or fix underlying slowness.

# Health Checks That Actually Work

Health checks determine if a service instance is healthy. Poor health checks cause false positives (marking healthy instance unhealthy) or false negatives (missing actual failures).

Bad Health Check

GET /health
    return HTTP 200 OK

Always returns success, even if:
    - Database connection is down
    - Disk is full
    - Memory exhausted
    - Critical dependency unavailable

Load balancer thinks instance is healthy, routes traffic
Requests fail, users see errors

Good Health Check

GET /health
    check_database_connection()
    check_dependency_connectivity()
    check_disk_space()
    check_memory_available()

    if all_checks_pass:
        return HTTP 200 OK
    else:
        return HTTP 503 Service Unavailable

Load balancer removes unhealthy instances from pool
Traffic routes only to healthy instances

Deep vs Shallow Health Checks:

Shallow (liveness): Is process alive? Can it accept requests? Fast, frequent (every 1s).
Deep (readiness): Can it handle requests successfully? Check dependencies. Slower, less frequent (every 10s).

Avoid: Health checks that are too expensive (calling all dependencies adds load). Health checks that modify state (don't write to DB during health check). Health checks that always succeed (useless) or always fail (removes all instances).

# Bulkheading: Isolate Failures

Bulkheads (ship compartments that prevent water spreading) isolate failures to prevent cascade.

Thread Pool Bulkheading

Without Bulkheading:
    100 threads (shared pool)
    Slow dependency exhausts all threads
    No threads left for other requests
    Entire service down

With Bulkheading:
    50 threads for Dependency A
    30 threads for Dependency B
    20 threads for Dependency C

    If Dependency A slow, uses up 50 threads
    But B and C still have their threads
    Partial service degradation, not total failure

Resource isolation: Separate thread pools, connection pools, memory pools per dependency. One dependency failure doesn't starve others.

# Key Takeaways

Retries with exponential backoff + jitter prevent thundering herd
Circuit breakers fail fast when service is down, give it time to recover
Circuit breaker states: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
Timeouts should be 1.5-2x p99 latency; too short causes false failures, too long blocks resources
Health checks must validate dependencies, not just "process alive"
Bulkheading isolates resource pools per dependency to prevent cascade failures
Fail fast with circuit breakers + short timeouts, then return cached/default data