mateusz@systems ~/book/ch05/cloud $ cat section.md

Cloud Provider Architecture

Cloud providers abstract datacenter infrastructure into logical constructs like regions, availability zones, and VPCs. Understanding how these map to physical infrastructure helps you design for resilience, predict failure modes, and optimize costs. This section focuses on AWS with cross-references to GCP and Azure equivalents.

# AWS Regions and Availability Zones

AWS organizes infrastructure hierarchically into regions and availability zones (AZs).

Regions

A region is a geographic area (e.g., us-east-1 in Northern Virginia, eu-west-1 in Ireland). Each region is completely independent—separate datacenters, power grids, networks.

Isolation: Region failure (extremely rare) doesn't affect other regions. Services in us-east-1 won't be impacted by us-west-2 outage.

Data residency: Data in one region doesn't replicate to others unless explicitly configured. Important for compliance (GDPR, data sovereignty laws).

Availability Zones (AZs)

An availability zone is one or more discrete datacenters within a region, with redundant power, networking, and connectivity. Each region has multiple AZs (typically 3-6).

Region: us-east-1 (Northern Virginia)
    |
    +-- AZ: us-east-1a (Datacenter facility 1)
    |       Power, network, cooling isolated
    |
    +-- AZ: us-east-1b (Datacenter facility 2)
    |       Separate building, power grid, network path
    |
    +-- AZ: us-east-1c (Datacenter facility 3)
    |       Geographically distant from 1a/1b
    |
    +-- AZ: us-east-1d, us-east-1e, us-east-1f
            Additional AZs for redundancy

AZs connected by high-bandwidth, low-latency links (<2ms)

Physical mapping: AZ names (us-east-1a) are randomized per AWS account. Your us-east-1a might be a different physical datacenter than mine. This prevents all customers from piling into "AZ A" (which would create hot spots).

Failure domain: AZ is the fundamental failure domain in AWS. Power outage, network partition, or cooling failure affects one AZ, not others.

Designing for AZ Failures

Deploy resources across multiple AZs for high availability:

Single AZ (not resilient):
    us-east-1a: [Web Server] [Database Primary]
    us-east-1b: (empty)
    us-east-1c: (empty)

Problem: AZ failure = total outage

Multi-AZ (resilient):
    us-east-1a: [Web Server] [Database Primary]
    us-east-1b: [Web Server] [Database Standby]
    us-east-1c: [Web Server]

AZ failure: traffic shifts to remaining AZs, DB fails over to standby

# VPCs: Virtual Private Clouds

A VPC is a logically isolated network within an AWS region. You define IP ranges (CIDR blocks), subnets, routing tables, and gateways.

VPC: 10.0.0.0/16 in us-east-1
    |
    +-- Subnet: 10.0.1.0/24 in us-east-1a (public)
    |   [EC2 instances with public IPs, route to internet gateway]
    |
    +-- Subnet: 10.0.2.0/24 in us-east-1b (public)
    |
    +-- Subnet: 10.0.10.0/24 in us-east-1a (private)
    |   [Databases, no direct internet access]
    |
    +-- Subnet: 10.0.11.0/24 in us-east-1b (private)

Subnets: Each subnet lives in a single AZ. Spread subnets across AZs for redundancy.

Routing: Route tables control traffic. Public subnets route 0.0.0.0/0 to internet gateway (IGW). Private subnets route via NAT gateway (for outbound) or stay isolated.

Under the hood: VPCs use encapsulation (similar to VXLAN) to isolate customer traffic on shared physical infrastructure. Your 10.0.1.5 is distinct from another customer's 10.0.1.5—even though they might be on the same physical server.

# AWS Storage Services Architecture

Understanding how AWS storage services work internally helps you choose the right service and predict performance/availability.

S3: Object Storage

Architecture: S3 stores objects (files) across multiple AZs by default. Each object is replicated to at least 3 AZs. Metadata and data are separated (control plane vs data plane).

S3 PUT request:
    Client --> S3 API (control plane)
            --> Metadata updated (object name, size, etag)
            --> Data written to 3+ AZs (data plane)
            --> Returns success when 2+ replicas confirmed

S3 GET request:
    Client --> S3 API --> Read from nearest replica
                      --> Return object

11 9's durability (99.999999999%) via replication + erasure coding

Consistency: S3 provides strong read-after-write consistency (as of Dec 2020). After successful PUT, immediate GET returns new data.

Failure mode: AZ failure doesn't affect S3 availability (data in other AZs). Rare control plane issues (like Oct 2025 DynamoDB incident) can impact API availability.

EBS: Elastic Block Store

Architecture: EBS volumes are block devices (like virtual disks) attached to EC2 instances. Each EBS volume exists in a single AZ and is replicated within that AZ.

EBS Volume in us-east-1a:
    [EC2 instance in us-east-1a] <--> [EBS volume]
                                          |
                                Replicated within AZ
                                (multiple physical drives)
                                          |
                                      Not visible
                                    across AZs

AZ failure = EBS volumes in that AZ unavailable

Limitation: EBS volumes can't be attached to instances in different AZs. To move data across AZs, create snapshot (stored in S3, multi-AZ) and restore in target AZ.

Use case: Low-latency block storage for databases, boot volumes. Not suitable for multi-AZ shared access (use EFS or S3 instead).

EFS: Elastic File System

Architecture: EFS is a managed NFS-compatible file system that spans multiple AZs. Data is automatically replicated across AZs in a region.

EFS File System in us-east-1:
    [EC2 in us-east-1a] --|
    [EC2 in us-east-1b] --+--> [EFS] (replicated across AZs)
    [EC2 in us-east-1c] --|

All instances mount same file system, see same data
AZ failure: instances in other AZs continue accessing EFS

Use case: Shared file storage across instances/AZs. Home directories, application data, content management. Higher latency than EBS (network filesystem).

# Fault Domains in Cloud Environments

Cloud providers define fault domains at multiple levels:

  • Server/instance: Single VM or bare-metal instance. Smallest unit.
  • Rack: Not exposed to customers, but AWS internally tracks rack-level failures.
  • Availability Zone: Exposed to customers. Primary failure domain for design.
  • Region: Largest failure domain. Complete geographic isolation.

AWS guidance: Distribute workloads across AZs for HA within a region. Use multiple regions for disaster recovery (DR) against region-level failures.

# Cross-Cloud Comparison

GCP and Azure have similar concepts with different terminology.

+---------------------+--------------+--------------+---------------+
| Concept             | AWS          | GCP          | Azure         |
+---------------------+--------------+--------------+---------------+
| Geographic Area     | Region       | Region       | Region        |
+---------------------+--------------+--------------+---------------+
| Fault Domain        | Availability | Zone         | Availability  |
| (within region)     | Zone (AZ)    |              | Zone          |
+---------------------+--------------+--------------+---------------+
| Private Network     | VPC          | VPC          | Virtual       |
|                     |              |              | Network (VNet)|
+---------------------+--------------+--------------+---------------+
| Object Storage      | S3           | Cloud        | Blob Storage  |
|                     |              | Storage (GCS)|               |
+---------------------+--------------+--------------+---------------+
| Block Storage       | EBS          | Persistent   | Managed Disks |
|                     |              | Disk         |               |
+---------------------+--------------+--------------+---------------+
| File Storage        | EFS          | Filestore    | Azure Files   |
+---------------------+--------------+--------------+---------------+
| Compute (VMs)       | EC2          | Compute      | Virtual       |
|                     |              | Engine       | Machines      |
+---------------------+--------------+--------------+---------------+

Note: While names differ, core concepts are similar. All major clouds provide multi-AZ/zone architectures, private networks, and tiered storage (object/block/file).

# Key Takeaways

  • Regions are geographically isolated; AZs are fault-isolated datacenters within a region
  • AZ names are randomized per account to prevent hot spots
  • Design for AZ failures by spreading resources across multiple AZs
  • VPCs use encapsulation to isolate customer networks on shared infrastructure
  • S3 replicates across AZs (multi-AZ resilient); EBS is single-AZ (snapshot to move data)
  • EFS spans AZs for shared file access across instances
  • Fault domains: instance < rack < AZ < region; design for AZ-level failures at minimum
  • GCP/Azure have equivalent concepts (zones, VNets, object/block/file storage)