Distributed and Network Filesystems (Draft)

mateusz@systems ~/book/ch01/network-fs $ cat section.md

Distributed and Network Filesystems

Network filesystems allow multiple machines to share access to data, enabling collaboration and resource pooling. Understanding their trade-offs is critical: choosing the wrong filesystem for your workload can mean the difference between millisecond latencies and minute-long hangs, between graceful degradation and catastrophic data loss. This section introduces a framework for evaluating network filesystems across three key dimensions: semantics (what guarantees does the FS provide?), architecture (how does it scale and perform?), and failure modes (what breaks and how?). By understanding these dimensions, you'll know what to pay attention to when working with any network filesystem—even ones not covered here.

# Architecture Overview

Before diving into comparisons, let's visualize how these systems are structured. Architecture fundamentally determines scalability, failure modes, and performance characteristics.

NFS - Single Server Architecture

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3 │
└─────┬────┘  └─────┬────┘  └─────┬────┘
      │             │             │
      └─────────────┼─────────────┘
                    │
            ┌───────v────────┐
            │   NFS Server   │  < Single point of failure
            │   (metadata +  │     All ops bottleneck here
            │    data)       │
            └───────┬────────┘
                    │
            ┌───────v────────┐
            │  Local FS      │  (ext4, XFS, ZFS, etc.)
            │  + Storage     │
            └────────────────┘

Performance: All operations funnel through a single server, limiting throughput to one machine's capabilities. Metadata operations (stat, readdir) and data I/O compete for the same resources, creating contention under mixed workloads.

Resiliency: Server failure makes the entire filesystem unavailable—no failover, no redundancy. Common in simple deployments where simplicity outweighs HA requirements.

Lustre - Separated Metadata and Data

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3 │
└─────┬────┘  └─────┬────┘  └─────┬────┘
      │             │             │
   metadata     metadata        metadata
      │             │             │
      ├─────────────┼─────────────┤
      │             │             │    Parallel data paths
┌─────v─────┐       │             │    enable high throughput
│    MDS    │       │             │
│ (metadata)│       │             │
└───────────┘   data│           data
            ┌───────v──────┬──────v───────┬──────────┐
            │   OST 1      │   OST 2      │  OST 3   │  < Object Storage
            │   (data)     │   (data)     │  (data)  │     Targets (OSTs)
            └──────┬───────┴──────┬───────┴─────┬────┘
                   │              │             │
            ┌──────v──────┐┌──────v──────┐┌─────v─────┐
            │ XFS or ZFS  ││ XFS or ZFS  ││ XFS or ZFS│  < Backend FS
            └─────────────┘└─────────────┘└───────────┘

Performance: Separating metadata and data enables massive parallel throughput—clients can hit multiple OSTs simultaneously for striped files, achieving aggregate bandwidth that scales with OST count. Metadata server (MDS) handles namespace operations independently.

Resiliency: MDS remains a single point of failure (unless using HA config), but OST failures only affect files stored on that target. Ideal for HPC workloads with large sequential I/O patterns.

GPFS - Shared Disk with Distributed Locking

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3 │  < All nodes can be
└─────┬────┘  └─────┬────┘  └─────┬────┘     clients + NSD servers
      │             │             │
      └─────────────┼─────────────┘
           Token-based locking
                    │
      ┌─────────────┼─────────────┐
      │             │             │
┌─────v─────┐ ┌─────v─────┐ ┌─────v─────┐
│NSD Server1│ │NSD Server2│ │NSD Server3│  < Network Shared Disk
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘     servers manage I/O
      │             │             │
      └─────────────┼─────────────┘
                    │
            ┌───────v────────┐
            │  Shared Storage│  (SAN, NVMe-oF, etc.)
            └────────────────┘

Performance: Token-based locking enables cache coherency without constant server round-trips—nodes can cache data as long as they hold the appropriate token. Direct storage access eliminates data forwarding overhead.

Resiliency: No single point of failure—any NSD server can fail and others take over. Token manager is distributed across the cluster. Shared storage itself needs redundancy (RAID, replication). Trade-off: lock contention can hurt metadata-intensive workloads.

WekaFS - Distributed with Coherent Cache

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3 │
│ + Cache  │  │ + Cache  │  │ + Cache  │  < Distributed coherent
└─────┬────┘  └─────┬────┘  └─────┬────┘     cache on every node
      │             │             │
      └─────────────┼─────────────┘
           Cache coherency protocol
                    │
      ┌─────────────┼─────────────┐
      │             │             │
┌─────v─────┐ ┌─────v─────┐ ┌─────v─────┐
│  Backend  │ │  Backend  │ │  Backend  │  < Distributed backend
│  Node 1   │ │  Node 2   │ │  Node 3   │     (NVMe drives)
└───────────┘ └───────────┘ └───────────┘

Performance: Every client has local NVMe/RAM cache, minimizing network round-trips for hot data. Cache coherency protocol ensures consistency without sacrificing speed—reads hit local cache, writes invalidate remote copies. Distributed backend provides parallel access paths.

Resiliency: Fully distributed architecture with no single point of failure. Data is erasure-coded or replicated across backend nodes. Node failures trigger automatic rebalancing. Optimized for modern cloud-native and AI/ML workloads requiring both high throughput and low latency.

VAST - Disaggregated Shared Everything (DASE)

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3 │  < Standard NFS clients
└─────┬────┘  └─────┬────┘  └─────┬────┘     (multipath NFS/RDMA)
      │             │             │
      └─────────────┼─────────────┘
           Enhanced NFS protocol
                    │
      ┌─────────────┼─────────────┐
      │             │             │
┌─────v─────┐ ┌─────v─────┐ ┌─────v─────┐
│  CNode 1  │ │  CNode 2  │ │  CNode 3  │  < Stateless compute
│ (CBox)    │ │ (CBox)    │ │ (CBox)    │     + write cache
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
      │             │             │
      NVMe-oF fabric (any CNode to any DBox)
      │             │             │
      └─────────────┼─────────────┘
                    │
      ┌─────────────┼─────────────┐
      │             │             │
┌─────v─────┐ ┌─────v─────┐ ┌─────v─────┐
│  DBox 1   │ │  DBox 2   │ │  DBox 3   │  < Storage enclosures
│ SCM+QLC   │ │ SCM+QLC   │ │ SCM+QLC   │     (all-flash)
└───────────┘ └───────────┘ └───────────┘

Performance: VAST's "Disaggregated Shared Everything" (DASE) architecture separates stateless compute nodes (CBoxes) from storage enclosures (DBoxes). Any CNode can access any DBox over NVMe-oF fabric, enabling global data reduction and eliminating isolated data islands. Uses enhanced NFS with multipath and RDMA for client access—simpler than parallel FS clients but with competitive throughput.

Resiliency: No single point of failure—CNodes are stateless and can fail without data loss. DBoxes use Storage Class Memory (SCM) for metadata durability and QLC flash for capacity. Data is erasure-coded across DBoxes. Trade-off: relies on NFS semantics (close-to-open by default), and write cache in CNodes can become a bottleneck under sustained high-throughput writes.

# Filesystem Semantics Comparison

POSIX defines a set of filesystem behaviors that applications rely on—atomic operations, consistency guarantees, and locking primitives. Network filesystems make different trade-offs between full POSIX compliance and performance. Understanding these semantics tells you what your application can safely assume.

┌────────────┬──────────────────┬───────────────────────────┬────────────────────┐
│ Filesystem │ POSIX Compliance │ Consistency Model         │ Locking Support    │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ NFS        │ Close [1]        │ Close-to-open [2]         │ NLM (advisory) [3] │
│            │                  │ Relaxed during open       │ Lockd daemon       │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ Lustre     │ Full POSIX       │ Strict coherency [4]      │ LDLM [5]           │
│            │                  │ Byte-range via locks      │ Distributed locks  │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ GPFS       │ Full POSIX       │ Strict coherency          │ Token-based [6]    │
│            │                  │ All operations serialized │ Byte-range locks   │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ WekaFS     │ Full POSIX       │ Strong consistency [7]    │ POSIX locks        │
│            │                  │ Coherent distributed cache│ Distributed mgmt   │
├────────────┼──────────────────┼───────────────────────────┼────────────────────┤
│ VAST       │ NFS-based [8]    │ Close-to-open default     │ NFS locks (NLM/v4) │
│            │                  │ Strong consistency option │ Advisory locking   │
└────────────┴──────────────────┴───────────────────────────┴────────────────────┘

[1] NFSv3 lacks some atomicity guarantees (e.g., O_EXCL create can race); NFSv4 improves this
[2] NFS close-to-open: changes by one client visible to others only after close()+open()
    See deep-dive section for details on edge cases
[3] NLM (Network Lock Manager) provides advisory locks; server crash can lose lock state
[4] Lustre LDLM (Distributed Lock Manager) provides strong cache coherency via lock callbacks
[5] Lustre locks can be revoked under contention, forcing clients to flush dirty data
[6] GPFS tokens grant read/write/lock permissions; distributed algorithm ensures consistency
[7] WekaFS maintains cache coherency across all clients via distributed consensus protocol
[8] VAST uses enhanced NFS (multipath, RDMA) but inherits NFS semantics; can configure stricter modes

# Key Implications

If your application requires strict POSIX semantics (e.g., databases with byte-range locking, build systems expecting consistent metadata), NFS's close-to-open model may surprise you. Lustre, GPFS, and WekaFS provide stricter guarantees but at the cost of lock traffic and potential contention. For read-mostly workloads or applications that don't share files between processes, NFS's relaxed model often suffices.

# Architecture, Scaling & Performance

Architecture determines how systems scale and where bottlenecks emerge. Single-server designs (NFS) are simple but limited by one machine's resources. Distributed designs (Lustre, GPFS, WekaFS) scale out but introduce complexity. Understanding these trade-offs helps you predict performance for your workload.

┌──────────┬──────────────────┬─────────────────────────────────────┬────────────────────────┐
│ FS       │ Architecture     │ Performance Profile                 │ Scaling Limits         │
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ NFS      │ Single server    │ Good: General-purpose workloads     │ Single server CPU/IO   │
│          │ All ops to 1 box │ Moderate: Metadata (5-10K ops/s)[8] │ Network bandwidth      │
│          │                  │ Poor: Highly concurrent small files │ Typical: 100s clients  │
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ Lustre   │ MDS + OSTs       │ Excellent: Large files (100+ GB/s)  │ MDS metadata ops [9]   │
│          │ Separated meta/  │           Parallel I/O across OSTs  │ Single MDS: ~20K ops/s │
│          │ data             │ Poor: Small files (metadata bound)  │ Multi-MDT helps but    │
│          │                  │       Millions of files in one dir  │ adds complexity        │
│          │                  │                                     │ Scales to 1000s clients│
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ GPFS     │ Shared disk      │ Good: Mixed workloads               │ Network fabric         │
│          │ Distributed      │ Excellent: Metadata (50K+ ops/s)[10]│ Token contention under │
│          │ token mgmt       │            Random I/O               │ heavy write sharing    │
│          │                  │ Good: Small and large files         │ Scales to 1000s clients│
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ WekaFS   │ Fully distributed│ Excellent: Small files, metadata    │ Designed for massive   │
│          │ Coherent cache   │            (millions of ops/s) [11] │ scale (1000s clients)  │
│          │ everywhere       │            Low latency (sub-ms)     │ Requires high-speed    │
│          │                  │ GPU-direct, RDMA-capable            │ network (100GbE+)      │
├──────────┼──────────────────┼─────────────────────────────────────┼────────────────────────┤
│ VAST     │ DASE: separated  │ Excellent: AI/ML workloads [12]     │ Network fabric size    │
│          │ compute (CBox)   │            Mixed read/write patterns│ Write cache fill under │
│          │ from storage     │ Good: Global dedup/compression      │ sustained high I/O     │
│          │ (DBox) + NVMe-oF │ Simpler ops than Lustre/GPFS        │ Scales to 1000s clients│
└──────────┴──────────────────┴─────────────────────────────────────┴────────────────────────┘

[8]  NetApp NFS performance guide TR-4067; typical enterprise server
[9]  Lustre Operations Manual 2.15; single MDT performance limits
[10] IBM GPFS Performance Tuning Guide; large enterprise deployment
[11] Weka technical documentation; AI/ML workload benchmarks
[12] VAST Data technical documentation; designed for AI-era workloads with high concurrency

# Workload Matching

Architecture dictates performance sweet spots. Match your workload to filesystem strengths:

Deep learning training: Millions of small image files > WekaFS's distributed cache and metadata performance excel, while NFS would crawl under metadata load.
Genomics pipelines: Terabyte-scale files, sequential I/O > Lustre's parallel data paths across multiple OSTs shine, delivering 100+ GB/s aggregate throughput.
Mixed enterprise workloads: VMs, databases, home directories > GPFS or NFS offer good balance for varied access patterns without extreme specialization.
HPC checkpointing: Periodic massive writes from thousands of ranks > Lustre (if you can tolerate load shedding) or GPFS (better reliability but lower peak throughput).

# Failure Modes & Overload Behavior

Systems fail. Clients misbehave. Loads spike. How filesystems handle these scenarios determines whether you experience graceful degradation or catastrophic failure. Understanding failure modes is essential for production deployments—it's not if things break, but when and how.

┌──────────┬─────────────────────┬─────────────────────┬────────────────────────┐
│ FS       │ Read Overload       │ Write Overload      │ Server Failure         │
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ NFS      │ Slow responses      │ Clients hang [13]   │ Hard mount: clients    │
│          │ Queue on server     │ RPC timeout         │ hang until recovery    │
│          │ Graceful degradation│ (default: 60s,      │ (can be hours)         │
│          │                     │  retries = forever) │ Soft mount: EIO errors │
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ Lustre   │ Client throttling   │ Load shedding! [14] │ MDS fail: all metadata │
│          │ Read-ahead reduced  │ OST silently drops  │ stops (open/stat/mkdir)│
│          │ Performance degrades│ writes - app sees   │ OST fail: only files on│
│          │                     │ success but data    │ that OST affected      │
│          │                     │ lost. Check logs!   │ No transparent failover│
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ GPFS     │ Token contention    │ Write-behind fills  │ Fast recovery (5-30s)  │
│          │ Increased latency   │ Graceful throttling │ Distributed recovery   │
│          │ Caching helps absorb│ Blocks when full    │ protocol [15]          │
│          │                     │ No data loss        │ Quorum-based HA        │
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ WekaFS   │ Distributed cache   │ Intelligent         │ Fast failover (~10s)   │
│          │ absorbs read bursts │ throttling          │ Distributed replicas   │
│          │ Maintains low       │ Maintains latency   │ No single point of     │
│          │ latency             │ targets             │ failure                │
├──────────┼─────────────────────┼─────────────────────┼────────────────────────┤
│ VAST     │ All-flash backend   │ Write cache in      │ CNode fail: stateless, │
│          │ handles read bursts │ CNodes can fill     │ clients reconnect [16] │
│          │ NVMe-oF low latency │ Performance drops   │ DBox fail: erasure-    │
│          │                     │ to backend speed    │ coded, auto-rebuild    │
└──────────┴─────────────────────┴─────────────────────┴────────────────────────┘

[13] NFS hang behavior infamous in HPC; "the NFS hang of death"
[14] Lustre load shedding: OST silently discards writes under overload; write() succeeds but
     data never reaches disk. Only detectable via OST logs (search for "ENOSPC" or drop messages).
     Catastrophic for applications assuming POSIX write semantics.
[15] GPFS recovery protocol: "Scalable cluster-wide failure recovery" (FAST '08)
[16] VAST CNodes are stateless; failure means clients reconnect to another CNode with no data loss

# Misbehaving Client Scenarios

What happens when a client crashes mid-write while holding locks?

NFS: Server detects client timeout (typically 90s), releases locks, continues. Other clients may see partial writes if data wasn't flushed. NFSv4 improves with lease-based recovery.
Lustre: LDLM has lock recovery protocol. If client crashes with dirty data, that data is lost (client-side caching). Lock timeout ~20s, other clients can proceed. Corruption possible if application didn't fsync().
GPFS: Token manager detects node failure via heartbeat, forcibly revokes tokens, performs recovery. Strong consistency maintained. Recovery typically completes in seconds for small failures.
WekaFS: Distributed consensus detects failure, invalidates client's cache, reassigns ownership. Replica ensures no data loss. Other clients continue with minimal disruption.
VAST: NFS-based recovery—server detects client timeout and releases locks. CNodes are stateless, so no server-side dirty data to lose. Client reconnects to another CNode transparently. Data in DBoxes is erasure-coded and unaffected by client or CNode failures.

# Real-World Implications

Lustre's silent data loss: Under sustained write overload, Lustre OSTs can silently drop writes while reporting success to the application. Your write() returns successfully, but data never reaches disk. The only indication is cryptic messages in OST logs. For critical workloads, you must implement application-level checksumming or verification, and monitor OST logs for drop messages. This violates POSIX expectations and has caused data loss in production HPC systems.

NFS hangs: Can stall entire application stacks for hours. Use soft mounts for non-critical workloads (accepts EIO errors) or implement application-level timeouts. Hard mounts are appropriate only when data consistency matters more than availability.

GPFS, WekaFS, and VAST: Provide better resilience with fast recovery and no silent data loss, but at higher cost. VAST trades some POSIX strictness for operational simplicity (NFS-based client access), while GPFS and WekaFS offer full POSIX compliance at greater complexity. All three require significant infrastructure investment.

⚠ Object Storage: AWS S3 (Not a Filesystem)

S3 and similar object stores (Azure Blob, GCS) are not filesystems—they're key-value stores with an HTTP API. They lack POSIX semantics and behave fundamentally differently. However, they're ubiquitous in cloud environments, and FUSE bridges (s3fs, goofys) attempt to make them appear as filesystems. Understanding the mismatch is critical.

Key differences and common pitfalls:

No atomic rename/move: Updating a config file atomically (write to temp, rename over existing) requires copy + delete—two separate operations that can fail independently. Readers may observe inconsistent state.
No true "list directory": S3 only lists objects with a prefix. Listing a "directory" with millions of objects is slow (charged per 1000 list requests) and can take minutes. A simple ls that takes milliseconds in POSIX can cost dollars and time in S3.
No append operations: You cannot append to an object. For log files, you must download the entire object, append locally, and re-upload. A 100GB log file means 100GB transfer to add one line.
No partial updates: Cannot modify byte ranges. Databases doing random writes (e.g., SQLite) must read entire object, modify in memory, write back entirely. Performance catastrophe for DB workloads.
Object-level permissions via IAM, not POSIX: No chmod or chown. Permissions managed via IAM policies and bucket ACLs. Multi-user Linux scenarios don't map cleanly.
No hardlinks or symlinks: Every object is independent. Can't create efficient directory structures or file aliases.
Strong consistency (since Dec 2020): S3 now provides read-after-write consistency for all operations. However, this applies to whole objects, not byte ranges. Still very different from POSIX coherent consistency where concurrent readers see partial writes.

When S3 makes sense: Immutable data (backups, archives, media files), write-once-read-many workloads, bulk storage where cost matters more than performance. Avoid for: databases, logs with frequent appends, applications expecting POSIX semantics. FUSE bridges help for read-heavy workloads but have severe write limitations—use with caution and test your assumptions.