Building Resilient Systems: Practical Guidance
Resilience isn't just about infrastructure redundancy—it's about how you develop, test, and deploy systems. This section covers practical patterns for building resilient systems: environment separation, dependency management, deployment strategies, and leveraging cloud provider fault isolation boundaries.
# Environment Separation: Dev/QA/Prod
Separating environments isolates failures and allows safe testing of changes before production deployment.
Development (Dev):
- Individual developers test code changes
- Rapid iteration, frequent breakage acceptable
- May use mock/stub dependencies
- Data: Synthetic or anonymized
Quality Assurance (QA/Staging):
- Integration testing, load testing, security testing
- Production-like configuration and data volume
- Validates changes before prod deployment
- Data: Anonymized production snapshot or realistic synthetic
Production (Prod):
- Live user traffic
- High availability, monitoring, on-call
- Changes deployed only after QA validation
- Data: Real user data
Why separate: Dev/QA failures don't affect users. QA catches bugs before production. Production stays stable.
Common Pitfalls
- Shared databases: Dev and QA sharing same database → test data pollution, accidental prod data access
- "Read-only" prod access: Developers with read-only prod access still risk data leaks, and "just this once" writes
- QA not prod-like: QA with 1/100th prod data volume misses performance issues that only appear at scale
- No staging: Deploying dev → prod directly skips integration testing, increases risk
Best practice: Completely isolated environments. QA should match prod topology (same number of tiers, similar data volume, same configuration). Use IAM/RBAC to enforce separation.
# Dependency Lifecycle Management
Dependencies (DNS, authentication, databases, APIs) are often shared across systems. How do you safely change a dependency without breaking everything?
Problem: Shared Dependencies
Shared Auth Service:
[App 1] --|
[App 2] --+--> [Auth Service] (single instance)
[App 3] --|
Auth Service upgrade:
- Changes API contract (new field required)
- App 1, 2, 3 not updated yet
- Deploy auth upgrade --> Apps break (missing field)
Solution: Dependencies should follow their own dev/qa/prod lifecycle, versioned APIs, and backward compatibility.
Pattern: Versioned Dependencies
Auth Service v1 and v2 running simultaneously:
[App 1 (old)] --> Auth v1 endpoint (/v1/auth)
[App 2 (new)] --> Auth v2 endpoint (/v2/auth)
[App 3 (old)] --> Auth v1 endpoint
Migration:
1. Deploy Auth v2 alongside v1
2. Update App 2 to use v2 (test, validate)
3. Update App 1 to use v2
4. Update App 3 to use v2
5. Decommission v1 once no clients remain
DNS example: DNS changes need their own lifecycle. If you change DNS to point example.com from 10.0.1.5 → 10.0.1.10, applications need to handle both IPs during transition (TTL period). Test DNS change in dev/QA before prod.
# Deployment Strategies
How you deploy changes affects blast radius and rollback speed.
Blue-Green Deployment
Run two identical environments (blue and green). Deploy to inactive environment, then switch traffic.
Before Deployment:
Blue (active): v1.0 <-- 100% traffic
Green (idle): (empty or old version)
Deploy v1.1:
Blue (active): v1.0 <-- 100% traffic
Green (idle): v1.1 <-- Deploy new version, test
Switch Traffic:
Blue (idle): v1.0
Green (active): v1.1 <-- 100% traffic switched instantly
Rollback:
If v1.1 has issues, switch traffic back to blue (instant)
Pros: Fast rollback (flip switch), full testing before traffic switch, zero downtime.
Cons: Expensive (2x infrastructure during deployment), database migrations tricky (both versions must handle same schema).
Canary Deployment
Gradually roll out to small percentage of traffic, monitor, then expand.
Phase 1: 1% traffic to v1.1
v1.0: 99% traffic
v1.1: 1% traffic <-- Monitor error rates, latency
Phase 2: 10% traffic (if Phase 1 looks good)
v1.0: 90% traffic
v1.1: 10% traffic
Phase 3: 50% traffic
v1.0: 50% traffic
v1.1: 50% traffic
Phase 4: 100% traffic (full rollout)
v1.1: 100% traffic
v1.0: decommissioned
Pros: Catches issues early (small blast radius), gradual validation, cost-effective.
Cons: Slower rollout, requires traffic splitting infrastructure, monitoring complexity.
Rolling Deployment
Update instances one at a time (or in small batches).
10 instances running v1.0:
Update instance 1 --> v1.1
Update instance 2 --> v1.1
...
Update instance 10 --> v1.1
Mixed version period:
Some instances v1.0, some v1.1 (must be compatible)
Pros: Simple, no extra infrastructure, gradual rollout.
Cons: Rollback slow (must redeploy old version to all), mixed-version compatibility required.
# Service Dependencies and Circuit Breakers
In microservice architectures, services depend on each other. How do you prevent one service failure from cascading?
Dependency Graph
[Frontend] --> [API Gateway] --> [Auth Service]
|
+--> [Product Service] --> [Database]
|
+--> [Recommendation] --> [ML Model]
Problem: If Product Service is slow (database overloaded), API Gateway waits for timeouts, exhausts thread pool, becomes unresponsive. Now Frontend also fails (can't reach API Gateway). Cascade failure.
Solution: Circuit breakers (see Graceful Degradation section), timeouts, bulkheading (isolate resources per dependency).
# AWS Fault Isolation Boundaries
AWS provides guidance on designing systems using their fault isolation boundaries. Key takeaways:
Zonal Services (AZ-Scoped)
Some services operate within a single AZ: EBS volumes, EC2 instances, subnet-specific resources.
Design implication: Distribute across multiple AZs. Don't put all EC2 instances in one AZ. Use Auto Scaling Groups with multi-AZ configuration.
Regional Services (Region-Scoped)
Some services span AZs within a region: S3, DynamoDB, ELB, Lambda. AZ failure doesn't impact these (they're already multi-AZ).
Design implication: Use regional services for HA within region. For cross-region DR, use multi-region patterns (S3 cross-region replication, DynamoDB Global Tables).
Control Plane vs Data Plane
AWS recommends understanding which API calls are control plane (CreateInstance, DeleteBucket) vs data plane (GetObject, PutItem).
During outages: Control plane may be impacted, but data plane often continues. Design systems to minimize control plane dependencies during normal operation.
Example: Pre-create resources (instances, buckets) rather than creating on-demand during traffic spikes. Use instance pools instead of creating new instances per request.
# Key Takeaways
- Separate dev/qa/prod environments completely—shared resources create failure correlations
- QA should mirror prod topology and data volume to catch scale-dependent issues
- Dependencies (DNS, auth, DBs) need their own dev/qa/prod lifecycle and versioned APIs
- Deployment strategies: blue-green (fast rollback), canary (gradual validation), rolling (simple)
- Service dependencies require circuit breakers, timeouts, and bulkheading to prevent cascades
- AWS: Use multi-AZ for zonal services, regional services for HA, minimize control plane dependencies
- Test deployment strategies in QA before prod—ensure rollback actually works