HPC Networking (Draft) :: mateusz.systems

mateusz@systems ~/book/ch05/hpc $ cat section.md

HPC Networking

High-Performance Computing (HPC) workloads—scientific simulations, machine learning training, large-scale data analytics—demand ultra-low latency and high bandwidth. Traditional TCP/IP networking introduces overhead (CPU cycles, kernel context switches, memory copies) that becomes a bottleneck. RDMA (Remote Direct Memory Access) bypasses these limitations, enabling microsecond latencies and multi-hundred gigabit throughput.

# Why RDMA Matters

Traditional TCP/IP networking involves multiple data copies and CPU intervention:

Traditional TCP/IP Send:
    App calls send()
        |
        v
    [Copy to kernel buffer]  < CPU copy #1
        |
        v
    [TCP/IP stack processing] < CPU overhead
        |
        v
    [NIC driver]
        |
        v
    [Copy to NIC memory]     < CPU copy #2
        |
        v
    Network transmission

Multiple context switches, two data copies, CPU busy with network I/O

RDMA Send:
    App initiates RDMA write
        |
        v
    [NIC directly reads app memory] < Zero CPU copies
        |
        v
    Network transmission

NIC handles everything, CPU free for computation

RDMA benefits:

Zero-copy: Data moves directly from app memory to NIC to network, no kernel buffers
Kernel bypass: NIC handles protocol logic, no kernel involvement after setup
Low CPU usage: CPU freed from network processing, available for actual work
Low latency: 1-2 microseconds typical (vs 10-100µs for TCP)
High bandwidth: 100-400 Gbps achievable with modern RDMA NICs

When RDMA matters: Distributed storage (Ceph with RBD, GPFS, WekaFS), MPI applications (parallel scientific computing), machine learning training (multi-GPU clusters with NCCL), low-latency databases (RAMCloud, FaRM).

# RDMA Technologies Comparison

Three major RDMA technologies exist, each with different transport layers and use cases.

+------------------+--------------+--------------+---------------+
| Characteristic   | InfiniBand   | RoCE         | iWARP         |
+------------------+--------------+--------------+---------------+
| Transport        | IB native    | Ethernet     | Ethernet      |
|                  | protocol     | (Layer 2)    | (TCP/IP)      |
+------------------+--------------+--------------+---------------+
| Latency          | 0.5-1 µs     | 1-2 µs       | 2-5 µs        |
|                  | (best)       |              | (higher)      |
+------------------+--------------+--------------+---------------+
| Bandwidth        | Up to 400    | Up to 400    | Up to 100     |
|                  | Gbps (HDR)   | Gbps         | Gbps          |
+------------------+--------------+--------------+---------------+
| Routing          | IB switches  | Ethernet     | Ethernet      |
|                  | (separate    | switches     | switches      |
|                  | network)     | (standard)   | (standard)    |
+------------------+--------------+--------------+---------------+
| Congestion Ctrl  | Hardware     | PFC (lossless| TCP (software |
|                  | (lossless)   | Ethernet)    | based)        |
+------------------+--------------+--------------+---------------+
| Cost             | High         | Medium       | Medium        |
|                  | (specialized)| (commodity)  | (commodity)   |
+------------------+--------------+--------------+---------------+
| Ecosystem        | HPC-focused  | Growing      | Declining     |
|                  | (Mellanox,   | (cloud,      | (niche)       |
|                  | Intel)       | storage)     |               |
+------------------+--------------+--------------+---------------+

# InfiniBand

InfiniBand (IB) is a purpose-built high-performance network fabric. It uses dedicated IB switches, cables, and NICs (called HCAs - Host Channel Adapters).

Architecture: InfiniBand networks are typically separate from Ethernet. Compute nodes have two NICs: Ethernet for management/internet, InfiniBand for storage and MPI communication.

Compute Node:
    [Eth NIC] -----> Management network (SSH, monitoring)
    [IB HCA]  -----> High-speed network (MPI, storage I/O)
                          |
                     IB Switch (dedicated fabric)
                          |
                     [Storage nodes with IB HCAs]

Speeds: EDR (100 Gbps), HDR (200 Gbps), NDR (400 Gbps) per port.

Advantages: Lowest latency, highest bandwidth, lossless transport (no packet drops), proven in HPC for decades.

Disadvantages: Expensive (specialized switches/NICs), separate network to manage, limited to HPC/storage use cases (can't route to internet).

Common in: Top500 supercomputers, HPC clusters, high-end storage arrays (NetApp, Pure Storage backend).

# RoCE: RDMA over Converged Ethernet

RoCE brings RDMA to Ethernet networks, allowing RDMA over standard Ethernet switches (with caveats). Two versions exist:

RoCE v1: RDMA directly over Ethernet (Layer 2). Non-routable, limited to single subnet. Rarely used today.

RoCE v2: RDMA over UDP/IP (Layer 3). Routable across subnets, dominant version.

Lossless Ethernet Requirement

RDMA assumes lossless transport (no packet drops). Standard Ethernet drops packets when congested. RoCE requires Priority Flow Control (PFC) to pause senders and prevent drops.

Standard Ethernet (lossy):
    Queue full --> Drop packet --> Retransmit (slow)

Lossless Ethernet with PFC:
    Queue full --> Send PAUSE frame --> Sender stops temporarily
                                    --> No packet loss

PFC caveat: Misconfigurations can cause head-of-line blocking—one slow flow pauses entire link. Requires careful switch configuration and monitoring.

Advantages: Uses commodity Ethernet infrastructure, converged network (one fabric for RDMA and TCP/IP), lower cost than InfiniBand.

Disadvantages: Slightly higher latency than IB, requires PFC configuration, more complex to debug than IB.

Common in: Cloud storage backends (AWS EBS, Azure Premium Storage), distributed databases, Kubernetes storage (Rook/Ceph).

# iWARP: RDMA over TCP

iWARP runs RDMA over TCP/IP, making it routable over any IP network without special switch configs.

Advantages: Works on any IP network, no need for lossless Ethernet, easier to deploy over WAN.

Disadvantages: Highest latency of RDMA options (TCP overhead), lower performance, declining ecosystem (fewer vendors).

Use case: RDMA over WAN links, environments where lossless Ethernet isn't feasible. Less common today; RoCE v2 has largely displaced it.

# Network Fabric Design for HPC

HPC clusters use specialized network topologies to minimize latency and maximize bandwidth for parallel workloads.

Fat-Tree Topology

         Core Switches (top tier)
            /    |    \
       Aggregation Switches (middle tier)
        /    |    \       /    |    \
    Leaf Switches (bottom tier - ToR)
      |  |  |  |     |  |  |  |
    Compute Nodes  Compute Nodes

Equal bandwidth at all levels ("fat" pipes going up)
Non-blocking: every node-to-node path has full bandwidth

Used for: MPI applications requiring all-to-all communication, machine learning training with data parallelism.

Dragonfly Topology

Groups of switches fully connected internally, sparse connections between groups. Optimizes for cost while maintaining low diameter (few hops between any two nodes).

Used for: Large-scale HPC (thousands to hundreds of thousands of nodes), balances cost and performance.

# Performance Considerations

Latency-sensitive workloads: Choose InfiniBand or RoCE v2. Avoid extra network hops (use leaf-spine, not 3-tier). Tune OS for low latency (disable CPU frequency scaling, pin threads).

Bandwidth-intensive workloads: Use RDMA with multi-path (ECMP or bonded interfaces). Ensure storage network is separate from management to avoid contention. Monitor for PFC PAUSE storms (in RoCE).

Mixed workloads: Separate compute and storage networks. Use InfiniBand or RoCE for storage backend, standard Ethernet for client access. Prevents storage I/O from impacting application latency.

# Key Takeaways

RDMA bypasses kernel, achieves 1-2µs latency vs 10-100µs for TCP/IP
InfiniBand offers lowest latency but requires dedicated fabric; RoCE v2 uses Ethernet with PFC
RoCE requires lossless Ethernet (PFC)—misconfig causes head-of-line blocking
iWARP runs over TCP/IP (higher latency) but works on any network; declining adoption
HPC clusters use fat-tree or dragonfly topologies for non-blocking bandwidth
Separate storage and compute networks to prevent I/O contention
RDMA critical for distributed storage, ML training, and MPI-based parallel computing