Testing Resilience (Draft) :: mateusz.systems

mateusz@systems ~/book/ch06/testing $ cat section.md

Testing Resilience

You can't know if your failover works until you test it. Many systems have redundancy and circuit breakers on paper, but they fail during real outages because they were never tested. This section covers chaos engineering, fault injection, and game days for building confidence in your resilience mechanisms.

# Chaos Engineering Principles

Chaos engineering is the practice of deliberately injecting failures into production systems to validate resilience. Popularized by Netflix's Chaos Monkey.

Core Principles

Build a hypothesis: "If we kill one instance, traffic should route to others with <5% error rate increase"
Inject real-world failures: Instance crashes, network latency, disk full, dependency failures
Minimize blast radius: Start small (1% traffic, dev environment), expand gradually
Automate: Run chaos experiments continuously, not just once
Measure and learn: Did system behave as expected? What broke? Fix it.

Why in Production?

Staging doesn't catch everything: Prod has real traffic patterns, real data volume, real dependencies. Staging environment may be too simple or different to reveal production failure modes.

Controlled failure is better than unexpected failure: If you inject failures during business hours with engineers watching, you can respond immediately. Real outages happen at 3am on weekends.

# Fault Injection Examples

Different types of failures to test:

Instance/Process Termination

Chaos Monkey (Netflix):
    Randomly kill EC2 instances in production
    Validates: Auto-scaling works, load balancers remove failed instances,
               services handle instance loss gracefully

Command:
    # Kill random instance in auto-scaling group
    aws autoscaling terminate-instance-in-auto-scaling-group \
        --instance-id i-1234567890abcdef0 \
        --should-decrement-desired-capacity false

Expected Outcome:
    - Load balancer detects failure via health check (10-30s)
    - Routes traffic to remaining instances
    - Auto-scaling group launches replacement instance
    - No user-visible errors

Network Latency Injection

Inject 200ms latency to database:
    # Using tc (traffic control) on Linux
    tc qdisc add dev eth0 root netem delay 200ms

Expected Outcome:
    - Application queries slow down
    - Timeouts may fire if set too aggressively
    - Circuit breakers should NOT open (200ms is slow, not failure)
    - User experience degrades but doesn't fail

Cleanup:
    tc qdisc del dev eth0 root

Packet Loss Injection

Inject 5% packet loss:
    tc qdisc add dev eth0 root netem loss 5%

Expected Outcome:
    - TCP retransmits packets (automatic)
    - Throughput decreases
    - Latency increases (due to retransmits)
    - Application should handle gracefully (TCP is reliable)

Dependency Failure

Block access to external API:
    # Using iptables to drop packets to API endpoint
    iptables -A OUTPUT -d api.external.com -j DROP

Expected Outcome:
    - Circuit breaker opens after N failures
    - Application returns cached data or degraded response
    - Monitoring alerts on circuit breaker state change
    - No cascading failures to other services

Cleanup:
    iptables -D OUTPUT -d api.external.com -j DROP

Resource Exhaustion

Fill disk to 95%:
    dd if=/dev/zero of=/tmp/fillfile bs=1M count=50000

Expected Outcome:
    - Disk space alerts fire
    - Application handles write failures gracefully
    - Log rotation/cleanup kicks in
    - Service doesn't crash (proper error handling)

CPU Spike:
    stress --cpu 8 --timeout 60s

Expected Outcome:
    - Auto-scaling triggers (if configured)
    - Request latency increases but doesn't timeout
    - Load balancer distributes to other instances

# Game Days and DR Drills

Game days are scheduled events where teams simulate major failures and practice recovery procedures.

Game Day Scenarios

Datacenter failure: Simulate entire AZ/region down, failover to backup
Database primary failure: Force failover to replica, validate data consistency
Network partition: Isolate services, test split-brain handling
Deployment gone wrong: Deploy bad code, practice rollback procedures
Cascading failure: Overload one service, watch cascade, test circuit breakers

Game Day Process

1. Plan (1-2 weeks before):
    - Define scenario (e.g., "us-east-1 datacenter down")
    - Write hypothesis (e.g., "traffic fails over to us-west-2 in <5 min")
    - Schedule (business hours, team on standby)
    - Notify stakeholders

2. Execute (day of):
    - Inject failure (kill datacenter, etc.)
    - Monitor dashboards, alerts, user impact
    - Document timeline, decisions, actions taken
    - Recover when done or if unexpected issues arise

3. Debrief (within 1 week):
    - What worked? What didn't?
    - Were metrics accurate?
    - Did alerts fire as expected?
    - Runbooks up to date?
    - Action items to improve

4. Fix and Iterate:
    - Fix issues discovered
    - Update runbooks, dashboards, alerts
    - Schedule next game day (quarterly recommended)

Disaster Recovery (DR) Drills

DR drills test your backup/restore procedures. Critical for compliance (SOC2, PCI-DSS) and actual disasters.

DR Drill Scenario: Database Corruption
    1. Simulate corruption (in DR environment, not prod!)
    2. Restore from backup
    3. Validate data integrity (checksums, row counts)
    4. Measure restore time (RTO - Recovery Time Objective)
    5. Measure data loss (RPO - Recovery Point Objective)

Metrics to Validate:
    - RTO: How long to restore? (Target: <4 hours)
    - RPO: How much data lost? (Target: <1 hour of transactions)
    - Data integrity: Checksums match, no corruption
    - Application compatibility: App works with restored data

# What to Test and When

Not all failures need continuous testing. Prioritize based on impact and likelihood.

Continuous (Automated):
    - Instance termination (daily)
    - Network latency injection (hourly)
    - Dependency timeout simulation (daily)

Regular (Scheduled):
    - AZ failure (quarterly game day)
    - Database failover (monthly)
    - Full DR restore (quarterly)

Occasional (Annual):
    - Region failure
    - Complete datacenter evacuation
    - Multi-service cascading failure

# Building Confidence in Failover

Failover mechanisms are only trustworthy if regularly tested. Common issues discovered during testing:

Stale runbooks: Documented failover process doesn't match current infrastructure
Untested scripts: Failover script has typo, fails when actually run
Hidden dependencies: "Backup" datacenter depends on primary for DNS/auth/config
Capacity assumptions: Backup DC can't handle 100% traffic (overloads)
Data lag: Replica is 2 hours behind, unacceptable for failover
Manual steps: Failover requires 10 manual steps, takes 2 hours

Fix these issues by testing regularly. Automated failover is better than manual, but even automated systems need periodic validation.

# Tools for Chaos Engineering

Chaos Monkey (Netflix): Randomly terminates instances in AWS
Gremlin: Commercial chaos engineering platform (network, resource, state failures)
Litmus (CNCF): Kubernetes chaos engineering tool
Pumba: Docker chaos testing (kill containers, network failures)
tc (traffic control): Linux built-in tool for network fault injection
stress / stress-ng: CPU, memory, disk stress testing

# Key Takeaways

Chaos engineering validates resilience by deliberately injecting failures
Test in production (with controls) to catch issues staging environments miss
Fault injection types: instance termination, network latency/loss, dependency failures, resource exhaustion
Game days simulate major failures (AZ down, DB failover) and practice recovery
DR drills validate backup/restore procedures and measure RTO/RPO
Continuous testing (instance kills, latency) + scheduled testing (game days) build confidence
Untested failover mechanisms will fail when you need them—test regularly