Testing Resilience
You can't know if your failover works until you test it. Many systems have redundancy and circuit breakers on paper, but they fail during real outages because they were never tested. This section covers chaos engineering, fault injection, and game days for building confidence in your resilience mechanisms.
# Chaos Engineering Principles
Chaos engineering is the practice of deliberately injecting failures into production systems to validate resilience. Popularized by Netflix's Chaos Monkey.
Core Principles
- Build a hypothesis: "If we kill one instance, traffic should route to others with <5% error rate increase"
- Inject real-world failures: Instance crashes, network latency, disk full, dependency failures
- Minimize blast radius: Start small (1% traffic, dev environment), expand gradually
- Automate: Run chaos experiments continuously, not just once
- Measure and learn: Did system behave as expected? What broke? Fix it.
Why in Production?
Staging doesn't catch everything: Prod has real traffic patterns, real data volume, real dependencies. Staging environment may be too simple or different to reveal production failure modes.
Controlled failure is better than unexpected failure: If you inject failures during business hours with engineers watching, you can respond immediately. Real outages happen at 3am on weekends.
# Fault Injection Examples
Different types of failures to test:
Instance/Process Termination
Chaos Monkey (Netflix):
Randomly kill EC2 instances in production
Validates: Auto-scaling works, load balancers remove failed instances,
services handle instance loss gracefully
Command:
# Kill random instance in auto-scaling group
aws autoscaling terminate-instance-in-auto-scaling-group \
--instance-id i-1234567890abcdef0 \
--should-decrement-desired-capacity false
Expected Outcome:
- Load balancer detects failure via health check (10-30s)
- Routes traffic to remaining instances
- Auto-scaling group launches replacement instance
- No user-visible errors
Network Latency Injection
Inject 200ms latency to database:
# Using tc (traffic control) on Linux
tc qdisc add dev eth0 root netem delay 200ms
Expected Outcome:
- Application queries slow down
- Timeouts may fire if set too aggressively
- Circuit breakers should NOT open (200ms is slow, not failure)
- User experience degrades but doesn't fail
Cleanup:
tc qdisc del dev eth0 root
Packet Loss Injection
Inject 5% packet loss:
tc qdisc add dev eth0 root netem loss 5%
Expected Outcome:
- TCP retransmits packets (automatic)
- Throughput decreases
- Latency increases (due to retransmits)
- Application should handle gracefully (TCP is reliable)
Dependency Failure
Block access to external API:
# Using iptables to drop packets to API endpoint
iptables -A OUTPUT -d api.external.com -j DROP
Expected Outcome:
- Circuit breaker opens after N failures
- Application returns cached data or degraded response
- Monitoring alerts on circuit breaker state change
- No cascading failures to other services
Cleanup:
iptables -D OUTPUT -d api.external.com -j DROP
Resource Exhaustion
Fill disk to 95%:
dd if=/dev/zero of=/tmp/fillfile bs=1M count=50000
Expected Outcome:
- Disk space alerts fire
- Application handles write failures gracefully
- Log rotation/cleanup kicks in
- Service doesn't crash (proper error handling)
CPU Spike:
stress --cpu 8 --timeout 60s
Expected Outcome:
- Auto-scaling triggers (if configured)
- Request latency increases but doesn't timeout
- Load balancer distributes to other instances
# Game Days and DR Drills
Game days are scheduled events where teams simulate major failures and practice recovery procedures.
Game Day Scenarios
- Datacenter failure: Simulate entire AZ/region down, failover to backup
- Database primary failure: Force failover to replica, validate data consistency
- Network partition: Isolate services, test split-brain handling
- Deployment gone wrong: Deploy bad code, practice rollback procedures
- Cascading failure: Overload one service, watch cascade, test circuit breakers
Game Day Process
1. Plan (1-2 weeks before):
- Define scenario (e.g., "us-east-1 datacenter down")
- Write hypothesis (e.g., "traffic fails over to us-west-2 in <5 min")
- Schedule (business hours, team on standby)
- Notify stakeholders
2. Execute (day of):
- Inject failure (kill datacenter, etc.)
- Monitor dashboards, alerts, user impact
- Document timeline, decisions, actions taken
- Recover when done or if unexpected issues arise
3. Debrief (within 1 week):
- What worked? What didn't?
- Were metrics accurate?
- Did alerts fire as expected?
- Runbooks up to date?
- Action items to improve
4. Fix and Iterate:
- Fix issues discovered
- Update runbooks, dashboards, alerts
- Schedule next game day (quarterly recommended)
Disaster Recovery (DR) Drills
DR drills test your backup/restore procedures. Critical for compliance (SOC2, PCI-DSS) and actual disasters.
DR Drill Scenario: Database Corruption
1. Simulate corruption (in DR environment, not prod!)
2. Restore from backup
3. Validate data integrity (checksums, row counts)
4. Measure restore time (RTO - Recovery Time Objective)
5. Measure data loss (RPO - Recovery Point Objective)
Metrics to Validate:
- RTO: How long to restore? (Target: <4 hours)
- RPO: How much data lost? (Target: <1 hour of transactions)
- Data integrity: Checksums match, no corruption
- Application compatibility: App works with restored data
# What to Test and When
Not all failures need continuous testing. Prioritize based on impact and likelihood.
Continuous (Automated):
- Instance termination (daily)
- Network latency injection (hourly)
- Dependency timeout simulation (daily)
Regular (Scheduled):
- AZ failure (quarterly game day)
- Database failover (monthly)
- Full DR restore (quarterly)
Occasional (Annual):
- Region failure
- Complete datacenter evacuation
- Multi-service cascading failure
# Building Confidence in Failover
Failover mechanisms are only trustworthy if regularly tested. Common issues discovered during testing:
- Stale runbooks: Documented failover process doesn't match current infrastructure
- Untested scripts: Failover script has typo, fails when actually run
- Hidden dependencies: "Backup" datacenter depends on primary for DNS/auth/config
- Capacity assumptions: Backup DC can't handle 100% traffic (overloads)
- Data lag: Replica is 2 hours behind, unacceptable for failover
- Manual steps: Failover requires 10 manual steps, takes 2 hours
Fix these issues by testing regularly. Automated failover is better than manual, but even automated systems need periodic validation.
# Tools for Chaos Engineering
- Chaos Monkey (Netflix): Randomly terminates instances in AWS
- Gremlin: Commercial chaos engineering platform (network, resource, state failures)
- Litmus (CNCF): Kubernetes chaos engineering tool
- Pumba: Docker chaos testing (kill containers, network failures)
- tc (traffic control): Linux built-in tool for network fault injection
- stress / stress-ng: CPU, memory, disk stress testing
# Key Takeaways
- Chaos engineering validates resilience by deliberately injecting failures
- Test in production (with controls) to catch issues staging environments miss
- Fault injection types: instance termination, network latency/loss, dependency failures, resource exhaustion
- Game days simulate major failures (AZ down, DB failover) and practice recovery
- DR drills validate backup/restore procedures and measure RTO/RPO
- Continuous testing (instance kills, latency) + scheduled testing (game days) build confidence
- Untested failover mechanisms will fail when you need them—test regularly