mateusz@systems ~/book/ch01/network-fs $ cat section.md

Distributed and Network Filesystems

Network filesystems allow multiple machines to share access to data, enabling collaboration and resource pooling. Understanding their trade-offs is critical: choosing the wrong filesystem for your workload can mean the difference between millisecond latencies and minute-long hangs, between graceful degradation and catastrophic data loss. This section introduces a framework for evaluating network filesystems across three key dimensions: semantics (what guarantees does the FS provide?), architecture (how does it scale and perform?), and failure modes (what breaks and how?). By understanding these dimensions, you'll know what to pay attention to when working with any network filesystem—even ones not covered here.

# Architecture Overview

Before diving into comparisons, let's visualize how these systems are structured. Architecture fundamentally determines scalability, failure modes, and performance characteristics.

NFS - Single Server Architecture

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3 │
└─────┬────┘  └─────┬────┘  └─────┬────┘
      │             │             │
      └─────────────┼─────────────┘
                    │
            ┌───────v────────┐
            │   NFS Server     < Single point of failure
               (metadata +       All ops bottleneck here
                data)       │
            └───────┬────────┘
                    │
            ┌───────v────────┐
            │  Local FS        (ext4, XFS, ZFS, etc.)
              + Storage     │
            └────────────────┘

Performance: All operations funnel through a single server, limiting throughput to one machine's capabilities. Metadata operations (stat, readdir) and data I/O compete for the same resources, creating contention under mixed workloads.

Resiliency: Server failure makes the entire filesystem unavailable—no failover, no redundancy. Common in simple deployments where simplicity outweighs HA requirements.

Lustre - Separated Metadata and Data

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3 │
└─────┬────┘  └─────┬────┘  └─────┬────┘
      │             │             │
   metadata     metadata        metadata
      │             │             │
      ├─────────────┼─────────────┤
      │             │             │    Parallel data paths
┌─────v─────┐       │             │    enable high throughput
    MDS           │             │
 (metadata)       │             │
└───────────┘   data           data
            ┌───────v──────┬──────v───────┬──────────┐
            │   OST 1         OST 2        OST 3     < Object Storage
               (data)        (data)       (data)       Targets (OSTs)
            └──────┬───────┴──────┬───────┴─────┬────┘
                   │              │             │
            ┌──────v──────┐┌──────v──────┐┌─────v─────┐
            │ XFS or ZFS  ││ XFS or ZFS  ││ XFS or ZFS  < Backend FS
            └─────────────┘└─────────────┘└───────────┘

Performance: Separating metadata and data enables massive parallel throughput—clients can hit multiple OSTs simultaneously for striped files, achieving aggregate bandwidth that scales with OST count. Metadata server (MDS) handles namespace operations independently.

Resiliency: MDS remains a single point of failure (unless using HA config), but OST failures only affect files stored on that target. Ideal for HPC workloads with large sequential I/O patterns.

GPFS - Shared Disk with Distributed Locking

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3   < All nodes can be
└─────┬────┘  └─────┬────┘  └─────┬────┘     clients + NSD servers
      │             │             │
      └─────────────┼─────────────┘
           Token-based locking
                    │
      ┌─────────────┼─────────────┐
      │             │             │
┌─────v─────┐ ┌─────v─────┐ ┌─────v─────┐
│NSD Server1│ │NSD Server2│ │NSD Server3  < Network Shared Disk
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘     servers manage I/O
      │             │             │
      └─────────────┼─────────────┘
                    │
            ┌───────v────────┐
            │  Shared Storage  (SAN, NVMe-oF, etc.)
            └────────────────┘

Performance: Token-based locking enables cache coherency without constant server round-trips—nodes can cache data as long as they hold the appropriate token. Direct storage access eliminates data forwarding overhead.

Resiliency: No single point of failure—any NSD server can fail and others take over. Token manager is distributed across the cluster. Shared storage itself needs redundancy (RAID, replication). Trade-off: lock contention can hurt metadata-intensive workloads.

WekaFS - Distributed with Coherent Cache

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3 │
│ + Cache  │  │ + Cache  │  │ + Cache    < Distributed coherent
└─────┬────┘  └─────┬────┘  └─────┬────┘     cache on every node
      │             │             │
      └─────────────┼─────────────┘
           Cache coherency protocol
                    │
      ┌─────────────┼─────────────┐
      │             │             │
┌─────v─────┐ ┌─────v─────┐ ┌─────v─────┐
│  Backend  │ │  Backend  │ │  Backend    < Distributed backend
  Node 1   │ │  Node 2   │ │  Node 3        (NVMe drives)
└───────────┘ └───────────┘ └───────────┘

Performance: Every client has local NVMe/RAM cache, minimizing network round-trips for hot data. Cache coherency protocol ensures consistency without sacrificing speed—reads hit local cache, writes invalidate remote copies. Distributed backend provides parallel access paths.

Resiliency: Fully distributed architecture with no single point of failure. Data is erasure-coded or replicated across backend nodes. Node failures trigger automatic rebalancing. Optimized for modern cloud-native and AI/ML workloads requiring both high throughput and low latency.

VAST - Disaggregated Shared Everything (DASE)

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3   < Standard NFS clients
└─────┬────┘  └─────┬────┘  └─────┬────┘     (multipath NFS/RDMA)
      │             │             │
      └─────────────┼─────────────┘
           Enhanced NFS protocol
                    │
      ┌─────────────┼─────────────┐
      │             │             │
┌─────v─────┐ ┌─────v─────┐ ┌─────v─────┐
│  CNode 1  │ │  CNode 2  │ │  CNode 3    < Stateless compute
 (CBox)    │ │ (CBox)    │ │ (CBox)         + write cache
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
      │             │             │
      NVMe-oF fabric (any CNode to any DBox)
      │             │             │
      └─────────────┼─────────────┘
                    │
      ┌─────────────┼─────────────┐
      │             │             │
┌─────v─────┐ ┌─────v─────┐ ┌─────v─────┐
│  DBox 1   │ │  DBox 2   │ │  DBox 3     < Storage enclosures
 SCM+QLC   │ │ SCM+QLC   │ │ SCM+QLC        (all-flash)
└───────────┘ └───────────┘ └───────────┘

Performance: VAST's "Disaggregated Shared Everything" (DASE) architecture separates stateless compute nodes (CBoxes) from storage enclosures (DBoxes). Any CNode can access any DBox over NVMe-oF fabric, enabling global data reduction and eliminating isolated data islands. Uses enhanced NFS with multipath and RDMA for client access—simpler than parallel FS clients but with competitive throughput.

Resiliency: No single point of failure—CNodes are stateless and can fail without data loss. DBoxes use Storage Class Memory (SCM) for metadata durability and QLC flash for capacity. Data is erasure-coded across DBoxes. Trade-off: relies on NFS semantics (close-to-open by default), and write cache in CNodes can become a bottleneck under sustained high-throughput writes.

# Filesystem Semantics Comparison

POSIX defines a set of filesystem behaviors that applications rely on—atomic operations, consistency guarantees, and locking primitives. Network filesystems make different trade-offs between full POSIX compliance and performance. Understanding these semantics tells you what your application can safely assume.

┌────────────┬──────────────────┬───────────────────────────┬────────────────────┐
│ Filesystem  POSIX Compliance  Consistency Model          Locking Support    │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ NFS         Close [1]         Close-to-open [2]          NLM (advisory) [3] │
│                               Relaxed during open        Lockd daemon       │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ Lustre      Full POSIX        Strict coherency [4]       LDLM [5]           │
│                               Byte-range via locks       Distributed locks  │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ GPFS        Full POSIX        Strict coherency           Token-based [6]    │
│                               All operations serialized  Byte-range locks   │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ WekaFS      Full POSIX        Strong consistency [7]     POSIX locks        │
│                               Coherent distributed cache Distributed mgmt   │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ VAST        NFS-based [8]     Close-to-open default      NFS locks (NLM/v4) │
│                               Strong consistency option  Advisory locking   │
└────────────┴──────────────────┴───────────────────────────┴────────────────────┘

[1] NFSv3 lacks some atomicity guarantees (e.g., O_EXCL create can race); NFSv4 improves this
[2] NFS close-to-open: changes by one client visible to others only after close()+open()
    See deep-dive section for details on edge cases
[3] NLM (Network Lock Manager) provides advisory locks; server crash can lose lock state
[4] Lustre LDLM (Distributed Lock Manager) provides strong cache coherency via lock callbacks
[5] Lustre locks can be revoked under contention, forcing clients to flush dirty data
[6] GPFS tokens grant read/write/lock permissions; distributed algorithm ensures consistency
[7] WekaFS maintains cache coherency across all clients via distributed consensus protocol
[8] VAST uses enhanced NFS (multipath, RDMA) but inherits NFS semantics; can configure stricter modes

# Key Implications

If your application requires strict POSIX semantics (e.g., databases with byte-range locking, build systems expecting consistent metadata), NFS's close-to-open model may surprise you. Lustre, GPFS, and WekaFS provide stricter guarantees but at the cost of lock traffic and potential contention. For read-mostly workloads or applications that don't share files between processes, NFS's relaxed model often suffices.

# Architecture, Scaling & Performance

Architecture determines how systems scale and where bottlenecks emerge. Single-server designs (NFS) are simple but limited by one machine's resources. Distributed designs (Lustre, GPFS, WekaFS) scale out but introduce complexity. Understanding these trade-offs helps you predict performance for your workload.

┌──────────┬──────────────────┬─────────────────────────────────────┬────────────────────────┐
│ FS        Architecture      Performance Profile                  Scaling Limits         │
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ NFS       Single server     Good: General-purpose workloads      Single server CPU/IO   │
│           All ops to 1 box  Moderate: Metadata (5-10K ops/s)[8]  Network bandwidth      │
│                             Poor: Highly concurrent small files  Typical: 100s clients  │
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ Lustre    MDS + OSTs        Excellent: Large files (100+ GB/s)   MDS metadata ops [9]   │
│           Separated meta/             Parallel I/O across OSTs   Single MDS: ~20K ops/s │
│           data              Poor: Small files (metadata bound)   Multi-MDT helps but    │
│                                   Millions of files in one dir   adds complexity        │
│                                                                  Scales to 1000s clients│
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ GPFS      Shared disk       Good: Mixed workloads                Network fabric         │
│           Distributed       Excellent: Metadata (50K+ ops/s)[10] Token contention under │
│           token mgmt                   Random I/O                heavy write sharing    │
│                             Good: Small and large files          Scales to 1000s clients│
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ WekaFS    Fully distributed Excellent: Small files, metadata     Designed for massive   │
│           Coherent cache               (millions of ops/s) [11]  scale (1000s clients)  │
│           everywhere                   Low latency (sub-ms)      Requires high-speed    │
│                             GPU-direct, RDMA-capable             network (100GbE+)      │
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ VAST      DASE: separated   Excellent: AI/ML workloads [12]      Network fabric size    │
│           compute (CBox)               Mixed read/write patterns Write cache fill under │
│           from storage      Good: Global dedup/compression       sustained high I/O     │
│           (DBox) + NVMe-oF  Simpler ops than Lustre/GPFS         Scales to 1000s clients│
└──────────┴──────────────────┴─────────────────────────────────────┴────────────────────────┘

[8]  NetApp NFS performance guide TR-4067; typical enterprise server
[9]  Lustre Operations Manual 2.15; single MDT performance limits
[10] IBM GPFS Performance Tuning Guide; large enterprise deployment
[11] Weka technical documentation; AI/ML workload benchmarks
[12] VAST Data technical documentation; designed for AI-era workloads with high concurrency

# Workload Matching

Architecture dictates performance sweet spots. Match your workload to filesystem strengths:

  • Deep learning training: Millions of small image files > WekaFS's distributed cache and metadata performance excel, while NFS would crawl under metadata load.
  • Genomics pipelines: Terabyte-scale files, sequential I/O > Lustre's parallel data paths across multiple OSTs shine, delivering 100+ GB/s aggregate throughput.
  • Mixed enterprise workloads: VMs, databases, home directories > GPFS or NFS offer good balance for varied access patterns without extreme specialization.
  • HPC checkpointing: Periodic massive writes from thousands of ranks > Lustre (if you can tolerate load shedding) or GPFS (better reliability but lower peak throughput).

# Failure Modes & Overload Behavior

Systems fail. Clients misbehave. Loads spike. How filesystems handle these scenarios determines whether you experience graceful degradation or catastrophic failure. Understanding failure modes is essential for production deployments—it's not if things break, but when and how.

┌──────────┬─────────────────────┬─────────────────────┬────────────────────────┐
│ FS        Read Overload        Write Overload       Server Failure         │
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ NFS       Slow responses       Clients hang [13]    Hard mount: clients    │
│           Queue on server      RPC timeout          hang until recovery    │
│           Graceful degradation (default: 60s,       (can be hours)         │
│                                 retries = forever)  Soft mount: EIO errors │
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ Lustre    Client throttling    Load shedding! [14]  MDS fail: all metadata │
│           Read-ahead reduced   OST silently drops   stops (open/stat/mkdir)│
│           Performance degrades writes - app sees    OST fail: only files on│
│                                success but data     that OST affected      │
│                                lost. Check logs!    No transparent failover│
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ GPFS      Token contention     Write-behind fills   Fast recovery (5-30s)  │
│           Increased latency    Graceful throttling  Distributed recovery   │
│           Caching helps absorb Blocks when full     protocol [15]          │
│                                No data loss         Quorum-based HA        │
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ WekaFS    Distributed cache    Intelligent          Fast failover (~10s)   │
│           absorbs read bursts  throttling           Distributed replicas   │
│           Maintains low        Maintains latency    No single point of     │
│           latency              targets              failure                │
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ VAST      All-flash backend    Write cache in       CNode fail: stateless, │
│           handles read bursts  CNodes can fill      clients reconnect [16] │
│           NVMe-oF low latency  Performance drops    DBox fail: erasure-    │
│                                to backend speed     coded, auto-rebuild    │
└──────────┴─────────────────────┴─────────────────────┴────────────────────────┘

[13] NFS hang behavior infamous in HPC; "the NFS hang of death"
[14] Lustre load shedding: OST silently discards writes under overload; write() succeeds but
     data never reaches disk. Only detectable via OST logs (search for "ENOSPC" or drop messages).
     Catastrophic for applications assuming POSIX write semantics.
[15] GPFS recovery protocol: "Scalable cluster-wide failure recovery" (FAST '08)
[16] VAST CNodes are stateless; failure means clients reconnect to another CNode with no data loss

# Misbehaving Client Scenarios

What happens when a client crashes mid-write while holding locks?

  • NFS: Server detects client timeout (typically 90s), releases locks, continues. Other clients may see partial writes if data wasn't flushed. NFSv4 improves with lease-based recovery.
  • Lustre: LDLM has lock recovery protocol. If client crashes with dirty data, that data is lost (client-side caching). Lock timeout ~20s, other clients can proceed. Corruption possible if application didn't fsync().
  • GPFS: Token manager detects node failure via heartbeat, forcibly revokes tokens, performs recovery. Strong consistency maintained. Recovery typically completes in seconds for small failures.
  • WekaFS: Distributed consensus detects failure, invalidates client's cache, reassigns ownership. Replica ensures no data loss. Other clients continue with minimal disruption.
  • VAST: NFS-based recovery—server detects client timeout and releases locks. CNodes are stateless, so no server-side dirty data to lose. Client reconnects to another CNode transparently. Data in DBoxes is erasure-coded and unaffected by client or CNode failures.

# Real-World Implications

Lustre's silent data loss: Under sustained write overload, Lustre OSTs can silently drop writes while reporting success to the application. Your write() returns successfully, but data never reaches disk. The only indication is cryptic messages in OST logs. For critical workloads, you must implement application-level checksumming or verification, and monitor OST logs for drop messages. This violates POSIX expectations and has caused data loss in production HPC systems.

NFS hangs: Can stall entire application stacks for hours. Use soft mounts for non-critical workloads (accepts EIO errors) or implement application-level timeouts. Hard mounts are appropriate only when data consistency matters more than availability.

GPFS, WekaFS, and VAST: Provide better resilience with fast recovery and no silent data loss, but at higher cost. VAST trades some POSIX strictness for operational simplicity (NFS-based client access), while GPFS and WekaFS offer full POSIX compliance at greater complexity. All three require significant infrastructure investment.

⚠ Object Storage: AWS S3 (Not a Filesystem)

S3 and similar object stores (Azure Blob, GCS) are not filesystems—they're key-value stores with an HTTP API. They lack POSIX semantics and behave fundamentally differently. However, they're ubiquitous in cloud environments, and FUSE bridges (s3fs, goofys) attempt to make them appear as filesystems. Understanding the mismatch is critical.

Key differences and common pitfalls:

  • No atomic rename/move: Updating a config file atomically (write to temp, rename over existing) requires copy + delete—two separate operations that can fail independently. Readers may observe inconsistent state.
  • No true "list directory": S3 only lists objects with a prefix. Listing a "directory" with millions of objects is slow (charged per 1000 list requests) and can take minutes. A simple ls that takes milliseconds in POSIX can cost dollars and time in S3.
  • No append operations: You cannot append to an object. For log files, you must download the entire object, append locally, and re-upload. A 100GB log file means 100GB transfer to add one line.
  • No partial updates: Cannot modify byte ranges. Databases doing random writes (e.g., SQLite) must read entire object, modify in memory, write back entirely. Performance catastrophe for DB workloads.
  • Object-level permissions via IAM, not POSIX: No chmod or chown. Permissions managed via IAM policies and bucket ACLs. Multi-user Linux scenarios don't map cleanly.
  • No hardlinks or symlinks: Every object is independent. Can't create efficient directory structures or file aliases.
  • Strong consistency (since Dec 2020): S3 now provides read-after-write consistency for all operations. However, this applies to whole objects, not byte ranges. Still very different from POSIX coherent consistency where concurrent readers see partial writes.

When S3 makes sense: Immutable data (backups, archives, media files), write-once-read-many workloads, bulk storage where cost matters more than performance. Avoid for: databases, logs with frequent appends, applications expecting POSIX semantics. FUSE bridges help for read-heavy workloads but have severe write limitations—use with caution and test your assumptions.