Network topology determines how data flows through a datacenter, how much bandwidth is available, and
what happens when switches fail. The evolution from traditional 3-tier architectures to modern leaf-spine
designs reflects changing traffic patterns and scale requirements.
# Traditional 3-Tier Architecture
The classic datacenter network has three layers: access, aggregation, and core.
+-------------+
| Core | < Core switches (high-capacity)
| Switches | Route traffic between pods
+------+------+ and to internet
/ | \
/ | \
/ | \
+-------+ +----+----+ +-------+
| Aggr | | Aggr | | Aggr | < Aggregation layer
| SW 1 | | SW 2 | | SW 3 | Aggregate ToR uplinks
+---+---+ +----+----+ +---+---+
/ \ / \ / \
/ \ / \ / \
+--+ +-+ +-+ +-+ +-+ +-+
|ToR| |ToR| |ToR| |ToR| |ToR| |ToR| < Access layer (ToR)
+---+ +---+ +---+ +---+ +---+ +---+ Connect to servers
||| ||| ||| ||| ||| |||
Servers Servers Servers
Access Layer: Top of Rack (ToR) switches connect servers within a rack.
Aggregation Layer: Aggregates multiple ToR uplinks, provides Layer 2/3 boundary,
often where VLANs terminate.
Core Layer: High-capacity switches that route traffic between aggregation blocks
(pods) and provide connectivity to the internet or WAN.
Limitations of 3-Tier
- Oversubscription: Uplinks from access → aggregation → core create bottlenecks. Typical 4:1 or 8:1 oversubscription means not all servers can use full bandwidth simultaneously [1].
- East-West traffic bottleneck: Traffic between servers in different pods must hairpin through aggregation and core layers, adding latency and consuming bandwidth.
- Spanning Tree Protocol (STP): To prevent loops, STP blocks redundant links, wasting capacity. Active links carry all traffic while backup links sit idle.
- Scaling complexity: Adding capacity requires upgrading core switches (expensive, disruptive).
# Leaf-Spine Architecture
Modern datacenters use leaf-spine topology (also known as Clos topology), which provides non-blocking bandwidth, predictable latency,
and simple scaling. Major cloud providers including Google [2], Meta (Facebook) [3], and Microsoft Azure [4] have adopted
variants of this architecture for their datacenters.
Spine Layer (every leaf connects to every spine)
+-------+ +-------+ +-------+ +-------+
| Spine | | Spine | | Spine | | Spine |
| 1 | | 2 | | 3 | | 4 |
+---+---+ +---+---+ +---+---+ +---+---+
| \ / | \ / | \ / |
| \ / | \ / | \ / |
| \/ | \/ | \/ |
| /\ | /\ | /\ |
| / \ | / \ | / \ |
| / \ | / \ | / \ |
+------+--+ +--+---+--+ +--+---+--+ +--+------+
| Leaf 1 | | Leaf 2 | | Leaf 3 | | Leaf 4 |
+----+---+ +----+---+ +----+---+ +----+---+
| | | |
Servers Servers Servers Servers
(1 rack) (1 rack) (1 rack) (1 rack)
Leaf switches: Connect directly to servers (typically one leaf per rack, like ToR).
Each leaf connects to every spine switch.
Spine switches: Never connect to servers or other spines. They only connect to leaf
switches and provide forwarding between leafs.
Advantages of Leaf-Spine
- Non-blocking: With sufficient spine switches, every server-to-server path has full bandwidth. No oversubscription bottleneck.
- Predictable latency: Every server-to-server path is exactly 2 hops (leaf → spine → leaf), or 3 hops if going through core to another datacenter [2].
- ECMP (Equal-Cost Multi-Path): Multiple paths between leafs (via different spines) enable load balancing [5][6]. All links are active, no STP blocking.
- Horizontal scaling: Add more leafs for more server capacity. Add more spines for more bandwidth. No need to replace existing hardware [3].
- Failure resilience: Single spine failure reduces bandwidth but doesn't partition network. Traffic automatically redistributes via ECMP.
Comparison Table
+---------------------+------------------+-------------------+
| Characteristic | 3-Tier | Leaf-Spine |
+---------------------+------------------+-------------------+
| Latency | Variable (2-6 | Predictable (2 |
| | hops depending | hops) |
| | on path) | |
+---------------------+------------------+-------------------+
| Oversubscription | Typical 4:1-8:1 | Can be 1:1 |
| | at aggregation | (non-blocking) |
+---------------------+------------------+-------------------+
| Scaling | Vertical (larger | Horizontal (add |
| | core switches) | more leafs/spines)|
+---------------------+------------------+-------------------+
| Redundancy | STP blocks links | ECMP uses all |
| | (wasted capacity)| links (active) |
+---------------------+------------------+-------------------+
| Failure Impact | Core failure can | Spine failure |
| | partition network| reduces bandwidth |
| | | but doesn't |
| | | partition |
+---------------------+------------------+-------------------+
| Best For | Traditional | Cloud, HPC, |
| | enterprise, | modern datacenters|
| | legacy apps | with E/W traffic |
+---------------------+------------------+-------------------+
# North-South vs East-West Traffic
Understanding traffic patterns is critical for network design.
North-South Traffic
Internet / External Clients
^
| N
| ^
+-----+--+------+
| Datacenter |
+---------------+
North-South: Traffic between external clients (internet, branch offices) and
datacenter servers. Typical in web applications: user requests come in (south), responses go out (north).
East-West Traffic
+---------------+
| Datacenter |
| |
| [DB] <--> [App] <--> [Cache] <--> [Storage]
| ^ ^
| +------------------------------------+
| East-West (server-to-server)
+---------------+
East-West: Traffic between servers within the datacenter. Microservices, database
replication, distributed storage, and inter-service communication generate massive E/W traffic.
Modern trend: E/W traffic has grown to dominate (often 70-80% of total traffic) in
cloud and microservice architectures [7]. According to Cisco's Global Cloud Index, approximately 76% of datacenter
traffic flows east-west, with only 17% flowing north-south [8]. Traditional 3-tier networks weren't designed for this—hence the
shift to leaf-spine.
# Oversubscription Ratios
Oversubscription measures the ratio of downstream bandwidth to upstream bandwidth. If 48 servers with
10Gbps NICs connect to a ToR switch with only 40Gbps of uplinks, the oversubscription is 12:1
(480Gbps downstream vs 40Gbps upstream).
Why oversubscribe? Cost. Building a fully non-blocking network (1:1) is expensive.
Most workloads don't need every server transmitting at full speed simultaneously.
When it matters: HPC workloads, distributed storage (e.g., Ceph, GPFS), and database
clusters often require low oversubscription (2:1 or 1:1) to avoid bottlenecks. Web servers handling
mostly external traffic can tolerate higher oversubscription (4:1 or more) [1].
# Layer 2 vs Layer 3 Switching
Layer 2 (L2): Switching based on MAC addresses. All devices in the same L2 domain
(VLAN) can communicate directly. Broadcasts flood the entire domain.
Layer 3 (L3): Routing based on IP addresses. Each subnet is isolated; routers forward
traffic between subnets. Broadcasts stay within subnets.
Modern approach: L3 to the leaf. In leaf-spine networks, each rack (leaf) is its own
L3 subnet. Routing happens at the leaf, minimizing broadcast domains and improving scalability [3][5]. This is
sometimes called "routed access" or "IP fabric."
# References
[1] Cisco. "Data Center Infrastructure Design Guide - Server Cluster Designs with Ethernet."
Cisco recommends 4:1 oversubscription for distribution-to-core links in traditional three-tier architectures.
cisco.com
[2] Amin Vahdat et al. "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network."
ACM SIGCOMM 2015. Describes Google's Jupiter fabric architecture with Clos topology achieving 1+ Pbps bisection bandwidth.
research.google
[3] Alexey Andreyev. "Introducing data center fabric, the next-generation Facebook data center network."
Meta Engineering Blog, November 2014. Details Facebook's all-L3 fabric design with ECMP routing.
engineering.fb.com
[4] Microsoft Azure. "Azure network architecture."
Microsoft Learn documentation on Azure's datacenter network architecture with tiered network devices providing redundancy and high bandwidth.
learn.microsoft.com
[5] P. Lapukhov, A. Premji, J. Mitchell (Editors). "Use of BGP for Routing in Large-Scale Data Centers."
RFC 7938, August 2016. Describes EBGP routing for Clos/leaf-spine datacenter topologies.
IETF RFC 7938
[6] Ruijie Networks. "Research on the Application of Equal Cost Multi-Path (ECMP) Technology in Data Center Networks."
Technical article on ECMP load balancing in Fabric architectures widely used in datacenter networks.
ruijienetworks.com
[7] Multiple Industry Sources. "East-West Traffic Dominance in Modern Data Centers."
Industry research showing 70-80% of datacenter traffic is server-to-server (east-west) communication.
Wikipedia - East-west traffic
[8] Cisco. "Trends in Data Center Security: Part 1 – Traffic Trends."
Cisco Global Cloud Index reports 76% of datacenter traffic flows east-west, 17% north-south.
blogs.cisco.com
# Key Takeaways
- Leaf-spine has replaced 3-tier in modern datacenters for predictable latency and horizontal scaling
- East-West traffic dominates in cloud/microservice architectures; design accordingly
- Oversubscription is a cost trade-off; HPC and storage need low ratios, web apps can tolerate higher
- ECMP in leaf-spine enables all links to be active, unlike STP blocking in 3-tier
- L3 to the leaf reduces broadcast domains and simplifies network management