Physical Infrastructure (Draft)

mateusz@systems ~/book/ch05/physical $ cat section.md

Physical Infrastructure

Whether you're deciding where to deploy workloads, purchasing colocation space, or evaluating cloud providers, understanding physical datacenter infrastructure is essential. The physical layout determines your fault domains, dictates your availability guarantees, and constrains your cost structure. Power density limits what hardware you can deploy, cooling capacity affects performance consistency, and the network topology impacts your latency and bandwidth options. This knowledge helps you ask the right questions during vendor selection, design for realistic failure scenarios, and troubleshoot outages that trace back to physical infrastructure.

# Racks, Rows, and Pods

When purchasing colocation space or deploying in a cloud provider, you're buying into this hierarchy. Understanding these levels helps you negotiate contracts, plan capacity, and design fault-tolerant systems:

Datacenter Building
    |
    +-- Pod (10-40 racks, dedicated power/cooling)
            |
            +-- Row (8-12 racks, hot/cold aisle arrangement)
                    |
                    +-- Rack (42U standard, ~10-15kW power typical)
                            |
                            +-- Servers (1U, 2U, or blade chassis)

Rack: The fundamental unit. Standard racks are 42U tall (1U = 1.75 inches). Servers, switches, and PDUs mount vertically. Each rack has power and network connectivity.

Row: Multiple racks arranged in a line, typically with "hot aisle / cold aisle" layout to manage airflow. Cold air flows from the front (cold aisle), hot air exhausts from the back (hot aisle).

Pod: A group of racks (sometimes called a cluster or zone) sharing power distribution, cooling infrastructure, and often network aggregation. Pods create natural fault domains—power or cooling failure in one pod doesn't affect others. When negotiating colocation contracts, confirm your workloads can be distributed across multiple pods. Cloud providers map these to "availability zones."

# Power Distribution

Power is your most critical constraint—it determines your density limits, operational costs, and availability guarantees. When buying datacenter space, verify the power architecture. A single power failure can take down hundreds of servers if redundancy isn't properly architected. Ask vendors: What's the power capacity per rack? Do you provide dual feeds? What's the UPS runtime?

Redundant Power Feeds (A/B Power)

         Utility Grid 1          Utility Grid 2
              |                        |
         +----v----+              +----v----+
         |   UPS A |              |   UPS B |  < Uninterruptible Power
         +---------+              +---------+    Supply (battery backup)
              |                        |
         +----v----+              +----v----+
         |  PDU A  |              |  PDU B  |  < Power Distribution Units
         +---------+              +---------+
              |                        |
         +----v------------------------v----+
         |        Server (Dual PSU)         |  < Server has two power
         |  PSU 1 (A)           PSU 2 (B)   |    supplies, one per feed
         +----------------------------------+

Dual power feeds (A/B power) ensure that a single power failure doesn't bring down servers. Each server has two power supplies (PSUs), each connected to a different PDU backed by a separate UPS and utility feed.

Failure scenario: If PDU A fails, servers continue running on PSU B powered by PDU B. If utility Grid 1 fails, UPS A provides battery power while the datacenter switches to Grid 2 or starts generators.

Real-world outage: On December 22, 2021, AWS experienced a power failure in a single data center within availability zone USE1-AZ4 (us-east-1). The outage started at 4:35 AM PST, power was restored by 5:39 AM, but networking issues persisted for hours. Some physical infrastructure was permanently damaged, resulting in unrecoverable EC2 instances and EBS volumes. (Details)

Key lesson for users: The incident affected customers who deployed workloads entirely within a single availability zone. AWS's recommendation to distribute workloads across multiple AZs proved critical—customers using multiple AZs maintained availability despite the power failure. When evaluating providers, verify what constitutes an "availability zone" and confirm they map to separate physical infrastructure (different pods/power systems).

# Cooling Systems

Servers generate heat—modern high-density racks dissipate 10-15kW (HPC racks exceed 30kW). Without adequate cooling, your workloads don't just fail, they throttle first, creating subtle performance degradation before outright failures. When evaluating datacenter space, ask: What's the cooling capacity per rack? What happens during cooling system failures? Are there redundant cooling units? In extreme cases, datacenters shut down servers to prevent hardware damage—your uptime depends on cooling infrastructure you can't directly control.

Hot Aisle / Cold Aisle Layout

                Cold Aisle         Hot Aisle        Cold Aisle
                    |                  |                 |
    [Rack] [Rack] [Rack]   [Rack] [Rack] [Rack]   [Rack] [Rack]
      ^      ^      ^        |      |      |         ^      ^
      |      |      |        v      v      v         |      |
    Cold   Cold   Cold      Hot    Hot    Hot      Cold   Cold
     Air    Air    Air      Air    Air    Air       Air    Air
      |      |      |        |      |      |         |      |
      +------+------+--------+------+------+---------+------+
                           |
                    Raised Floor
                  (cold air plenum)
                           |
                    CRAC Units
            (Computer Room Air Conditioning)

Airflow: Cold air is pumped under a raised floor, flows up through perforated tiles in the cold aisle, gets pulled through servers (front to back), and exhausts hot air into the hot aisle. Hot air is captured and returned to CRAC (Computer Room Air Conditioning) units for cooling.

Why this matters: Improper airflow (e.g., missing blanking panels in racks, blocked cold aisles) causes hot spots. Servers in hot spots throttle performance or shut down, creating mysterious "node failures" that are actually thermal issues.

Real-world outage: On July 19, 2022 (the hottest day on record in London at 40.2°C), Google Cloud experienced a simultaneous failure of multiple redundant cooling systems in a datacenter hosting zone europe-west2-a. Google terminated approximately 35% of VMs in that zone to prevent hardware damage. The outage lasted 18 hours, with a 35-hour "long tail" before full recovery. (Incident Report)

What failed: Multiple redundant cooling systems failed simultaneously under extreme heat. Additionally, human error compounded the issue—engineers inadvertently modified traffic routing to avoid all three zones in europe-west2 (including the unaffected zones), expanding the blast radius beyond the initial cooling failure.

Google's response: (1) Audit cooling equipment and standards across all datacenters globally, (2) Repair and re-test failover automation, (3) Develop methods to progressively decrease thermal load within a datacenter to avoid full shutdowns, (4) Detailed root cause analysis of cooling system failure modes.

Key lesson for users: Even redundant cooling systems can fail under extreme conditions. Ask providers: How are cooling systems tested under thermal stress? What's the protocol during partial cooling failures? Geographic diversity matters—don't assume all zones in a region have independent cooling infrastructure. Distribute workloads across regions for workloads requiring extreme availability.

# Network: Top of Rack (ToR) vs End of Row (EoR)

Network architecture determines your fault isolation, latency, and bandwidth capacity. When evaluating datacenter providers, ask about their network topology—it affects both your availability and performance profile. Two common approaches for connecting racks to the network:

Top of Rack (ToR) Switches

     Rack 1        Rack 2        Rack 3
    +------+      +------+      +------+
    | ToR  |------| ToR  |------| ToR  |  < Each rack has its own switch
    | SW 1 |      | SW 2 |      | SW 3 |    Uplinks to aggregation layer
    +--+---+      +--+---+      +--+---+
       |             |             |
    +--+---+      +--+---+      +--+---+
    |Server|      |Server|      |Server|
    +------+      +------+      +------+
    |Server|      |Server|      |Server|
    +------+      +------+      +------+

Pros: Superior fault isolation—a ToR switch failure affects only one rack (~40-80 servers). Short cable runs reduce latency. Preferred for workloads requiring predictable failure domains.

Cons: More switches to manage. If buying your own hardware, higher upfront switch costs (offset by lower cable costs).

End of Row (EoR) Switches

     Rack 1        Rack 2        Rack 3
    +------+      +------+      +------+
    |Server|------|Server|------|Server|
    +------+  |   +------+  |   +------+  |
    |Server|--+   |Server|--+   |Server|--+
    +------+  |   +------+  |   +------+  |
              |             |             |
              +-------------+-------------+
                            |
                       +----v----+
                       |  EoR    |  < One large switch at end of row
                       | Switch  |    serves entire row
                       +---------+

Pros: Fewer switches to manage. Potentially lower initial deployment cost.

Cons: Larger blast radius—switch failure affects the entire row. Variable latency depending on rack distance from the switch. If you're buying space in an EoR datacenter, ask how switch redundancy is handled and what the expected impact radius is for switch failures.

Modern approach: ToR dominates modern datacenters due to better fault isolation, especially with leaf-spine topologies (covered in Network Topology). When evaluating providers, confirm they use ToR architecture if your workload requires rack-level fault isolation.

# Fault Domains from Physical Layout

Understanding fault domains is crucial for capacity planning and SLA design. These boundaries determine what can fail together and inform your redundancy strategy:

Rack-level: ToR switch failure, PDU failure, or power feed issue affects one rack (~40-80 servers)
Row-level: Cooling failure in a hot aisle can affect an entire row
Pod-level: Power distribution failure, cooling plant failure, or network aggregation failure affects an entire pod (hundreds to thousands of servers)
Datacenter-level: Building power loss, natural disaster, or network partition from internet

Design implication: When purchasing datacenter space or deploying to cloud providers, map your availability requirements to fault domains. For 99.99% uptime, deploy across multiple pods (availability zones). For mission-critical systems (99.999%+), deploy across multiple datacenters or regions. Ask providers: How do your availability zones map to physical pods? Can I get rack/pod placement guarantees? What's the failure correlation between zones?

# Key Takeaways for Datacenter Users

Fault domains drive availability design: Physical layout creates hierarchical failure boundaries (rack > row > pod > datacenter). Map your SLA requirements to these domains when purchasing space or deploying workloads.
Power architecture is non-negotiable: Verify dual power feeds (A/B) with separate UPS systems. Ask about per-rack power capacity—it constrains your hardware density and operational costs.
Cooling is a hidden SLA dependency: Cooling failures cause both performance degradation (throttling) and hard outages (automated shutdowns). Even redundant systems fail under extreme conditions. Ask about cooling capacity, redundancy, and thermal stress testing.
Network topology affects blast radius: ToR switches provide better fault isolation than EoR. Confirm the network architecture when evaluating providers—it determines what fails together.
Availability zones vary by provider: Verify what "availability zone" means physically. Do zones map to separate pods with independent power/cooling? Get placement guarantees in writing.
Real incidents validate your assumptions: Both AWS and Google Cloud have experienced datacenter-level failures in recent years. Design for these scenarios, not theoretical ideals. Distribute across zones/regions based on your uptime requirements.