Building Resilient Systems (Draft)

mateusz@systems ~/book/ch06/building $ cat section.md

Building Resilient Systems: Practical Guidance

Resilience isn't just about infrastructure redundancy—it's about how you develop, test, and deploy systems. This section covers practical patterns for building resilient systems: environment separation, dependency management, deployment strategies, and leveraging cloud provider fault isolation boundaries.

# Environment Separation: Dev/QA/Prod

Separating environments isolates failures and allows safe testing of changes before production deployment.

Development (Dev):
    - Individual developers test code changes
    - Rapid iteration, frequent breakage acceptable
    - May use mock/stub dependencies
    - Data: Synthetic or anonymized

Quality Assurance (QA/Staging):
    - Integration testing, load testing, security testing
    - Production-like configuration and data volume
    - Validates changes before prod deployment
    - Data: Anonymized production snapshot or realistic synthetic

Production (Prod):
    - Live user traffic
    - High availability, monitoring, on-call
    - Changes deployed only after QA validation
    - Data: Real user data

Why separate: Dev/QA failures don't affect users. QA catches bugs before production. Production stays stable.

Common Pitfalls

Shared databases: Dev and QA sharing same database → test data pollution, accidental prod data access
"Read-only" prod access: Developers with read-only prod access still risk data leaks, and "just this once" writes
QA not prod-like: QA with 1/100th prod data volume misses performance issues that only appear at scale
No staging: Deploying dev → prod directly skips integration testing, increases risk

Best practice: Completely isolated environments. QA should match prod topology (same number of tiers, similar data volume, same configuration). Use IAM/RBAC to enforce separation.

# Dependency Lifecycle Management

Dependencies (DNS, authentication, databases, APIs) are often shared across systems. How do you safely change a dependency without breaking everything?

Problem: Shared Dependencies

Shared Auth Service:
    [App 1] --|
    [App 2] --+--> [Auth Service] (single instance)
    [App 3] --|

Auth Service upgrade:
    - Changes API contract (new field required)
    - App 1, 2, 3 not updated yet
    - Deploy auth upgrade --> Apps break (missing field)

Solution: Dependencies should follow their own dev/qa/prod lifecycle, versioned APIs, and backward compatibility.

Pattern: Versioned Dependencies

Auth Service v1 and v2 running simultaneously:
    [App 1 (old)] --> Auth v1 endpoint (/v1/auth)
    [App 2 (new)] --> Auth v2 endpoint (/v2/auth)
    [App 3 (old)] --> Auth v1 endpoint

Migration:
    1. Deploy Auth v2 alongside v1
    2. Update App 2 to use v2 (test, validate)
    3. Update App 1 to use v2
    4. Update App 3 to use v2
    5. Decommission v1 once no clients remain

DNS example: DNS changes need their own lifecycle. If you change DNS to point example.com from 10.0.1.5 → 10.0.1.10, applications need to handle both IPs during transition (TTL period). Test DNS change in dev/QA before prod.

# Deployment Strategies

How you deploy changes affects blast radius and rollback speed.

Blue-Green Deployment

Run two identical environments (blue and green). Deploy to inactive environment, then switch traffic.

Before Deployment:
    Blue (active): v1.0 <-- 100% traffic
    Green (idle): (empty or old version)

Deploy v1.1:
    Blue (active): v1.0 <-- 100% traffic
    Green (idle): v1.1 <-- Deploy new version, test

Switch Traffic:
    Blue (idle): v1.0
    Green (active): v1.1 <-- 100% traffic switched instantly

Rollback:
    If v1.1 has issues, switch traffic back to blue (instant)

Pros: Fast rollback (flip switch), full testing before traffic switch, zero downtime.

Cons: Expensive (2x infrastructure during deployment), database migrations tricky (both versions must handle same schema).

Canary Deployment

Gradually roll out to small percentage of traffic, monitor, then expand.

Phase 1: 1% traffic to v1.1
    v1.0: 99% traffic
    v1.1:  1% traffic <-- Monitor error rates, latency

Phase 2: 10% traffic (if Phase 1 looks good)
    v1.0: 90% traffic
    v1.1: 10% traffic

Phase 3: 50% traffic
    v1.0: 50% traffic
    v1.1: 50% traffic

Phase 4: 100% traffic (full rollout)
    v1.1: 100% traffic
    v1.0: decommissioned

Pros: Catches issues early (small blast radius), gradual validation, cost-effective.

Cons: Slower rollout, requires traffic splitting infrastructure, monitoring complexity.

Rolling Deployment

Update instances one at a time (or in small batches).

10 instances running v1.0:
    Update instance 1 --> v1.1
    Update instance 2 --> v1.1
    ...
    Update instance 10 --> v1.1

Mixed version period:
    Some instances v1.0, some v1.1 (must be compatible)

Pros: Simple, no extra infrastructure, gradual rollout.

Cons: Rollback slow (must redeploy old version to all), mixed-version compatibility required.

# Service Dependencies and Circuit Breakers

In microservice architectures, services depend on each other. How do you prevent one service failure from cascading?

Dependency Graph

[Frontend] --> [API Gateway] --> [Auth Service]
                  |
                  +--> [Product Service] --> [Database]
                  |
                  +--> [Recommendation] --> [ML Model]

Problem: If Product Service is slow (database overloaded), API Gateway waits for timeouts, exhausts thread pool, becomes unresponsive. Now Frontend also fails (can't reach API Gateway). Cascade failure.

Solution: Circuit breakers (see Graceful Degradation section), timeouts, bulkheading (isolate resources per dependency).

# AWS Fault Isolation Boundaries

AWS provides guidance on designing systems using their fault isolation boundaries. Key takeaways:

Zonal Services (AZ-Scoped)

Some services operate within a single AZ: EBS volumes, EC2 instances, subnet-specific resources.

Design implication: Distribute across multiple AZs. Don't put all EC2 instances in one AZ. Use Auto Scaling Groups with multi-AZ configuration.

Regional Services (Region-Scoped)

Some services span AZs within a region: S3, DynamoDB, ELB, Lambda. AZ failure doesn't impact these (they're already multi-AZ).

Design implication: Use regional services for HA within region. For cross-region DR, use multi-region patterns (S3 cross-region replication, DynamoDB Global Tables).

Control Plane vs Data Plane

AWS recommends understanding which API calls are control plane (CreateInstance, DeleteBucket) vs data plane (GetObject, PutItem).

During outages: Control plane may be impacted, but data plane often continues. Design systems to minimize control plane dependencies during normal operation.

Example: Pre-create resources (instances, buckets) rather than creating on-demand during traffic spikes. Use instance pools instead of creating new instances per request.

# Key Takeaways

Separate dev/qa/prod environments completely—shared resources create failure correlations
QA should mirror prod topology and data volume to catch scale-dependent issues
Dependencies (DNS, auth, DBs) need their own dev/qa/prod lifecycle and versioned APIs
Deployment strategies: blue-green (fast rollback), canary (gradual validation), rolling (simple)
Service dependencies require circuit breakers, timeouts, and bulkheading to prevent cascades
AWS: Use multi-AZ for zonal services, regional services for HA, minimize control plane dependencies
Test deployment strategies in QA before prod—ensure rollback actually works