Graceful Degradation
Perfect reliability is impossible. Systems will fail. Graceful degradation means your system continues providing reduced functionality rather than failing completely. This section covers patterns for handling failures gracefully: retries, circuit breakers, timeouts, and bulkheading.
# Retry Strategies
Transient failures (network blips, temporary overload) can be resolved by retrying. But naive retries make things worse.
Naive Retry (Bad)
Pseudocode:
attempt = 0
while attempt < MAX_RETRIES:
result = call_service()
if result.success:
return result
attempt += 1
# No delay, retry immediately
Problem: If service is overloaded, immediate retries amplify load
All clients retry simultaneously --> thundering herd
Exponential Backoff (Better)
Pseudocode:
attempt = 0
base_delay = 100ms # Initial delay
max_delay = 60s # Cap delay at 60 seconds
while attempt < MAX_RETRIES:
result = call_service()
if result.success:
return result
attempt += 1
delay = min(base_delay * (2 ^ attempt), max_delay)
sleep(delay)
raise ServiceUnavailableError()
Delays: 100ms, 200ms, 400ms, 800ms, 1600ms, ...
Gives service time to recover, spreads retries over time
Exponential Backoff with Jitter (Best)
Pseudocode:
attempt = 0
base_delay = 100ms
max_delay = 60s
while attempt < MAX_RETRIES:
result = call_service()
if result.success:
return result
attempt += 1
delay = min(base_delay * (2 ^ attempt), max_delay)
jittered_delay = delay * (0.5 + random(0, 0.5))
sleep(jittered_delay)
raise ServiceUnavailableError()
Jitter randomizes delays: prevents synchronized retries
If 1000 clients all failed at same time, retries are spread out
When to retry: Idempotent operations (GET, PUT with same data), transient network errors (connection timeout, 503 service unavailable), read operations.
When NOT to retry: Non-idempotent operations (POST creating resource twice is bad), 4xx client errors (won't succeed on retry), operations with side effects (payment processing).
# Circuit Breaker Pattern
Circuit breakers prevent calling a failing service repeatedly. If a service is down, stop calling it for a period (fail fast), then test recovery.
Circuit Breaker States
States:
CLOSED --> Normal operation, calls go through
OPEN --> Service failing, reject calls immediately
HALF_OPEN --> Testing if service recovered
State Transitions:
CLOSED --> OPEN: After N consecutive failures
OPEN --> HALF_OPEN: After timeout period
HALF_OPEN --> CLOSED: If test request succeeds
HALF_OPEN --> OPEN: If test request fails
Circuit Breaker Pseudocode
class CircuitBreaker:
state = CLOSED
failure_count = 0
success_count = 0
failure_threshold = 5 # Open after 5 failures
success_threshold = 2 # Close after 2 successes in half-open
timeout = 60s # Time before testing recovery
last_failure_time = null
function call(operation):
if state == OPEN:
if now() - last_failure_time > timeout:
state = HALF_OPEN
success_count = 0
else:
raise CircuitOpenError("Service unavailable")
if state == HALF_OPEN:
# Allow limited requests through to test recovery
try:
result = operation()
success_count += 1
if success_count >= success_threshold:
state = CLOSED
failure_count = 0
return result
except Exception as e:
state = OPEN
last_failure_time = now()
raise e
if state == CLOSED:
try:
result = operation()
failure_count = 0 # Reset on success
return result
except Exception as e:
failure_count += 1
last_failure_time = now()
if failure_count >= failure_threshold:
state = OPEN
raise e
# Usage:
circuit = CircuitBreaker()
try:
result = circuit.call(lambda: fetch_user_data(user_id))
except CircuitOpenError:
# Fail fast, return cached data or error to user
return cached_user_data(user_id)
Why circuit breakers help: Prevents wasting resources calling dead service. Gives service time to recover (stop hammering it). Fails fast (user gets error immediately, not after timeout). Allows monitoring/alerting on circuit state changes.
# Timeout Configuration
Timeouts prevent waiting forever for failed operations. But setting them wrong causes problems.
Too Long
Timeout: 60 seconds Service is down, takes 60s to timeout User waits 60s for error Thread/connection blocked for 60s If many requests, exhausts thread pool --> cascade failure
Too Short
Timeout: 100ms Service normally responds in 90ms Network spike: 150ms response Timeout fires, request aborted Service actually succeeded, but client thinks it failed Retries increase load, making problem worse
Right-Sizing Timeouts
Guideline: Set timeout to p99 latency + buffer. If 99% of requests complete in 500ms, set timeout to 1000ms (2x p99).
Latency Distribution:
p50: 100ms
p95: 300ms
p99: 500ms
p99.9: 2000ms
Timeout Setting:
Conservative: 2x p99 = 1000ms (catches most slow requests)
Aggressive: 1.5x p99 = 750ms (fails fast, may false-positive on slow requests)
Monitor timeout rates: If >1% of requests timing out, either increase timeout or fix underlying slowness.
# Health Checks That Actually Work
Health checks determine if a service instance is healthy. Poor health checks cause false positives (marking healthy instance unhealthy) or false negatives (missing actual failures).
Bad Health Check
GET /health
return HTTP 200 OK
Always returns success, even if:
- Database connection is down
- Disk is full
- Memory exhausted
- Critical dependency unavailable
Load balancer thinks instance is healthy, routes traffic
Requests fail, users see errors
Good Health Check
GET /health
check_database_connection()
check_dependency_connectivity()
check_disk_space()
check_memory_available()
if all_checks_pass:
return HTTP 200 OK
else:
return HTTP 503 Service Unavailable
Load balancer removes unhealthy instances from pool
Traffic routes only to healthy instances
Deep vs Shallow Health Checks:
- Shallow (liveness): Is process alive? Can it accept requests? Fast, frequent (every 1s).
- Deep (readiness): Can it handle requests successfully? Check dependencies. Slower, less frequent (every 10s).
Avoid: Health checks that are too expensive (calling all dependencies adds load). Health checks that modify state (don't write to DB during health check). Health checks that always succeed (useless) or always fail (removes all instances).
# Bulkheading: Isolate Failures
Bulkheads (ship compartments that prevent water spreading) isolate failures to prevent cascade.
Thread Pool Bulkheading
Without Bulkheading:
100 threads (shared pool)
Slow dependency exhausts all threads
No threads left for other requests
Entire service down
With Bulkheading:
50 threads for Dependency A
30 threads for Dependency B
20 threads for Dependency C
If Dependency A slow, uses up 50 threads
But B and C still have their threads
Partial service degradation, not total failure
Resource isolation: Separate thread pools, connection pools, memory pools per dependency. One dependency failure doesn't starve others.
# Key Takeaways
- Retries with exponential backoff + jitter prevent thundering herd
- Circuit breakers fail fast when service is down, give it time to recover
- Circuit breaker states: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Timeouts should be 1.5-2x p99 latency; too short causes false failures, too long blocks resources
- Health checks must validate dependencies, not just "process alive"
- Bulkheading isolates resource pools per dependency to prevent cascade failures
- Fail fast with circuit breakers + short timeouts, then return cached/default data