mateusz@systems ~/book/ch05/topology $ cat section.md

Network Topology

Network topology determines how data flows through a datacenter, how much bandwidth is available, and what happens when switches fail. The evolution from traditional 3-tier architectures to modern leaf-spine designs reflects changing traffic patterns and scale requirements.

# Traditional 3-Tier Architecture

The classic datacenter network has three layers: access, aggregation, and core.

                        +-------------+
                        |    Core     |  < Core switches (high-capacity)
                        |  Switches   |    Route traffic between pods
                        +------+------+    and to internet
                          /    |    \
                         /     |     \
                        /      |      \
               +-------+  +----+----+  +-------+
               | Aggr  |  |  Aggr   |  | Aggr  |  < Aggregation layer
               | SW 1  |  |  SW 2   |  | SW 3  |    Aggregate ToR uplinks
               +---+---+  +----+----+  +---+---+
                  / \         / \          / \
                 /   \       /   \        /   \
              +--+  +-+   +-+  +-+    +-+  +-+
              |ToR|  |ToR| |ToR|  |ToR|  |ToR|  |ToR|  < Access layer (ToR)
              +---+  +---+ +---+  +---+  +---+  +---+    Connect to servers
               |||    |||   |||    |||    |||    |||
              Servers      Servers      Servers

Access Layer: Top of Rack (ToR) switches connect servers within a rack.

Aggregation Layer: Aggregates multiple ToR uplinks, provides Layer 2/3 boundary, often where VLANs terminate.

Core Layer: High-capacity switches that route traffic between aggregation blocks (pods) and provide connectivity to the internet or WAN.

Limitations of 3-Tier

  • Oversubscription: Uplinks from access → aggregation → core create bottlenecks. Typical 4:1 or 8:1 oversubscription means not all servers can use full bandwidth simultaneously [1].
  • East-West traffic bottleneck: Traffic between servers in different pods must hairpin through aggregation and core layers, adding latency and consuming bandwidth.
  • Spanning Tree Protocol (STP): To prevent loops, STP blocks redundant links, wasting capacity. Active links carry all traffic while backup links sit idle.
  • Scaling complexity: Adding capacity requires upgrading core switches (expensive, disruptive).

# Leaf-Spine Architecture

Modern datacenters use leaf-spine topology (also known as Clos topology), which provides non-blocking bandwidth, predictable latency, and simple scaling. Major cloud providers including Google [2], Meta (Facebook) [3], and Microsoft Azure [4] have adopted variants of this architecture for their datacenters.

         Spine Layer (every leaf connects to every spine)
            +-------+    +-------+    +-------+    +-------+
            | Spine |    | Spine |    | Spine |    | Spine |
            |   1   |    |   2   |    |   3   |    |   4   |
            +---+---+    +---+---+    +---+---+    +---+---+
                |   \    /   |   \    /   |   \    /   |
                |    \  /    |    \  /    |    \  /    |
                |     \/     |     \/     |     \/     |
                |     /\     |     /\     |     /\     |
                |    /  \    |    /  \    |    /  \    |
                |   /    \   |   /    \   |   /    \   |
         +------+--+  +--+---+--+  +--+---+--+  +--+------+
         | Leaf 1 |  | Leaf 2 |  | Leaf 3 |  | Leaf 4 |
         +----+---+  +----+---+  +----+---+  +----+---+
              |           |           |           |
         Servers     Servers     Servers     Servers
         (1 rack)    (1 rack)    (1 rack)    (1 rack)

Leaf switches: Connect directly to servers (typically one leaf per rack, like ToR). Each leaf connects to every spine switch.

Spine switches: Never connect to servers or other spines. They only connect to leaf switches and provide forwarding between leafs.

Advantages of Leaf-Spine

  • Non-blocking: With sufficient spine switches, every server-to-server path has full bandwidth. No oversubscription bottleneck.
  • Predictable latency: Every server-to-server path is exactly 2 hops (leaf → spine → leaf), or 3 hops if going through core to another datacenter [2].
  • ECMP (Equal-Cost Multi-Path): Multiple paths between leafs (via different spines) enable load balancing [5][6]. All links are active, no STP blocking.
  • Horizontal scaling: Add more leafs for more server capacity. Add more spines for more bandwidth. No need to replace existing hardware [3].
  • Failure resilience: Single spine failure reduces bandwidth but doesn't partition network. Traffic automatically redistributes via ECMP.

Comparison Table

+---------------------+------------------+-------------------+
| Characteristic      | 3-Tier           | Leaf-Spine        |
+---------------------+------------------+-------------------+
| Latency             | Variable (2-6    | Predictable (2    |
|                     | hops depending   | hops)             |
|                     | on path)         |                   |
+---------------------+------------------+-------------------+
| Oversubscription    | Typical 4:1-8:1  | Can be 1:1        |
|                     | at aggregation   | (non-blocking)    |
+---------------------+------------------+-------------------+
| Scaling             | Vertical (larger | Horizontal (add   |
|                     | core switches)   | more leafs/spines)|
+---------------------+------------------+-------------------+
| Redundancy          | STP blocks links | ECMP uses all     |
|                     | (wasted capacity)| links (active)    |
+---------------------+------------------+-------------------+
| Failure Impact      | Core failure can | Spine failure     |
|                     | partition network| reduces bandwidth |
|                     |                  | but doesn't       |
|                     |                  | partition         |
+---------------------+------------------+-------------------+
| Best For            | Traditional      | Cloud, HPC,       |
|                     | enterprise,      | modern datacenters|
|                     | legacy apps      | with E/W traffic  |
+---------------------+------------------+-------------------+

# North-South vs East-West Traffic

Understanding traffic patterns is critical for network design.

North-South Traffic

              Internet / External Clients
                        ^
                        |  N
                        |  ^
                  +-----+--+------+
                  |   Datacenter  |
                  +---------------+

North-South: Traffic between external clients (internet, branch offices) and datacenter servers. Typical in web applications: user requests come in (south), responses go out (north).

East-West Traffic

          +---------------+
          |  Datacenter   |
          |               |
          | [DB] <--> [App] <--> [Cache] <--> [Storage]
          |  ^                                    ^
          |  +------------------------------------+
          |        East-West (server-to-server)
          +---------------+

East-West: Traffic between servers within the datacenter. Microservices, database replication, distributed storage, and inter-service communication generate massive E/W traffic.

Modern trend: E/W traffic has grown to dominate (often 70-80% of total traffic) in cloud and microservice architectures [7]. According to Cisco's Global Cloud Index, approximately 76% of datacenter traffic flows east-west, with only 17% flowing north-south [8]. Traditional 3-tier networks weren't designed for this—hence the shift to leaf-spine.

# Oversubscription Ratios

Oversubscription measures the ratio of downstream bandwidth to upstream bandwidth. If 48 servers with 10Gbps NICs connect to a ToR switch with only 40Gbps of uplinks, the oversubscription is 12:1 (480Gbps downstream vs 40Gbps upstream).

Why oversubscribe? Cost. Building a fully non-blocking network (1:1) is expensive. Most workloads don't need every server transmitting at full speed simultaneously.

When it matters: HPC workloads, distributed storage (e.g., Ceph, GPFS), and database clusters often require low oversubscription (2:1 or 1:1) to avoid bottlenecks. Web servers handling mostly external traffic can tolerate higher oversubscription (4:1 or more) [1].

# Layer 2 vs Layer 3 Switching

Layer 2 (L2): Switching based on MAC addresses. All devices in the same L2 domain (VLAN) can communicate directly. Broadcasts flood the entire domain.

Layer 3 (L3): Routing based on IP addresses. Each subnet is isolated; routers forward traffic between subnets. Broadcasts stay within subnets.

Modern approach: L3 to the leaf. In leaf-spine networks, each rack (leaf) is its own L3 subnet. Routing happens at the leaf, minimizing broadcast domains and improving scalability [3][5]. This is sometimes called "routed access" or "IP fabric."

# References

[1] Cisco. "Data Center Infrastructure Design Guide - Server Cluster Designs with Ethernet." Cisco recommends 4:1 oversubscription for distribution-to-core links in traditional three-tier architectures. cisco.com

[2] Amin Vahdat et al. "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network." ACM SIGCOMM 2015. Describes Google's Jupiter fabric architecture with Clos topology achieving 1+ Pbps bisection bandwidth. research.google

[3] Alexey Andreyev. "Introducing data center fabric, the next-generation Facebook data center network." Meta Engineering Blog, November 2014. Details Facebook's all-L3 fabric design with ECMP routing. engineering.fb.com

[4] Microsoft Azure. "Azure network architecture." Microsoft Learn documentation on Azure's datacenter network architecture with tiered network devices providing redundancy and high bandwidth. learn.microsoft.com

[5] P. Lapukhov, A. Premji, J. Mitchell (Editors). "Use of BGP for Routing in Large-Scale Data Centers." RFC 7938, August 2016. Describes EBGP routing for Clos/leaf-spine datacenter topologies. IETF RFC 7938

[6] Ruijie Networks. "Research on the Application of Equal Cost Multi-Path (ECMP) Technology in Data Center Networks." Technical article on ECMP load balancing in Fabric architectures widely used in datacenter networks. ruijienetworks.com

[7] Multiple Industry Sources. "East-West Traffic Dominance in Modern Data Centers." Industry research showing 70-80% of datacenter traffic is server-to-server (east-west) communication. Wikipedia - East-west traffic

[8] Cisco. "Trends in Data Center Security: Part 1 – Traffic Trends." Cisco Global Cloud Index reports 76% of datacenter traffic flows east-west, 17% north-south. blogs.cisco.com

# Key Takeaways

  • Leaf-spine has replaced 3-tier in modern datacenters for predictable latency and horizontal scaling
  • East-West traffic dominates in cloud/microservice architectures; design accordingly
  • Oversubscription is a cost trade-off; HPC and storage need low ratios, web apps can tolerate higher
  • ECMP in leaf-spine enables all links to be active, unlike STP blocking in 3-tier
  • L3 to the leaf reduces broadcast domains and simplifies network management