ZFS: A Modern Local Filesystem (Draft)

mateusz@systems ~/book/ch01/zfs $ cat section.md

ZFS: A Modern Local Filesystem

ZFS deserves special attention as a local filesystem that combines traditional filesystem responsibilities with volume management, offering features rarely found together: end-to-end checksumming, integrated RAID, atomic snapshots, and transparent compression. Originally developed by Sun Microsystems for Solaris, OpenZFS now runs on Linux, BSD, and other platforms.

# Key Features

End-to-end checksumming: Every block has a checksum. Silent data corruption (bit rot) is detected and, with redundancy, automatically repaired. No other major filesystem provides this by default.
Copy-on-write (CoW) architecture: Writes never overwrite existing data. New data goes to new blocks, metadata updated atomically. Enables instant snapshots and eliminates write-in-place corruption.
Integrated volume management: No need for LVM or mdadm. Create storage pools from multiple devices, dynamically allocate datasets (similar to filesystems). Add/remove devices online.
RAID-Z: ZFS's own RAID implementation (RAID-Z1/Z2/Z3 similar to RAID-5/6/7). Avoids write-hole problem of traditional RAID-5/6. Variable stripe width eliminates partial stripe writes.
Snapshots and clones: Instant, space-efficient snapshots that work like "git for data." Take snapshots trivially, send only deltas between snapshots across systems, and maintain full version history of your filesystem. Writable clones enable testing or rollback scenarios. ZFS's incremental send/receive efficiently transmits only changed blocks, making it a powerful tool for managing and replicating filesets.

ZFS Snapshot Replication: "git for data"

Primary Server (server-a)              Replica Server (server-b)

  tank/data                               backup/data
    |
    +-- @monday                --------->  @monday
    |      (1TB)                  full       (1TB)
    |                            send
    +-- @tuesday               --------->  @tuesday
    |      (+50GB)             delta only
    |                            (50GB)
    +-- @wednesday             --------->  @wednesday
           (+30GB)             delta only
                                 (30GB)

Command to send incremental snapshot:

  $ zfs send -i @tuesday @wednesday | \
      ssh server-b zfs receive backup/data
            ^           ^
            |           +-- target snapshot
            +-------------- base snapshot (what server-b already has)

Only the delta (30GB of changes) is transmitted, not the full dataset.
Snapshots share unchanged blocks via copy-on-write - space-efficient on both ends.

The efficiency of ZFS snapshots comes from its copy-on-write architecture. When you create a snapshot, no data is actually copied—instead, the snapshot simply references the existing data blocks. As new writes occur, only the modified blocks are written to new locations, while both the current filesystem and snapshots continue referencing the unchanged blocks. This means snapshots are nearly instantaneous and consume minimal space initially, growing only as the filesystem diverges from the snapshot state.

Copy-on-Write: How Snapshots Share Blocks

Initial state (no snapshot):
    File A: [Block 1][Block 2][Block 3]

After creating @snap1:
    File A:        [Block 1][Block 2][Block 3]
                       ^        ^        ^
    @snap1 refs:   ----+--------+--------+

Modify Block 2 (copy-on-write):
    File A:        [Block 1][Block 4][Block 3]  (Block 4 = new version)
                       ^        ^        ^
    @snap1 refs:   ----+        |        +-----> Still references old blocks
                                |
                            [Block 2]  (old version, kept for snapshot)

Result:
    - @snap1 preserves original state (Blocks 1, 2, 3)
    - Current filesystem uses Blocks 1, 4, 3
    - Only Block 4 is new storage - Blocks 1 and 3 shared
    - Space used: original data + 1 modified block

ARC (Adaptive Replacement Cache) and L2ARC: ZFS uses a sophisticated multi-tier caching system that dramatically improves read and write performance while maintaining filesystem guarantees. The ARC is ZFS's primary RAM-based cache, more intelligent than simple LRU (Least Recently Used) caching. It balances two lists: recently used data (MRU) and frequently used data (MFU), adapting to workload patterns. This means a large file scanned once won't evict frequently-accessed small files from cache. For writes, ZFS buffers them in RAM and coalesces them into transaction groups, allowing for efficient sequential writes even when applications issue random writes. The ZFS Intent Log (ZIL) ensures synchronous writes are safe: it writes to a small sequential log first, then acknowledges the write to the application, while the actual data is written lazily to its final location. L2ARC (Level 2 ARC) extends this caching to SSD, providing a cache tier between RAM and spinning disks. It's populated with data evicted from ARC, effectively creating a RAM → SSD → HDD hierarchy. This is especially powerful for workloads with large working sets that don't fit in RAM but benefit from SSD speeds. Configurable size, but defaults to consuming significant RAM.
Compression and deduplication: Transparent compression (lz4, zstd) with minimal performance overhead. Deduplication available but RAM-intensive (avoid unless truly needed).

ZFS Multi-Tier Caching: ARC, L2ARC, and ZIL

Read Path:                         Write Path:

Application read                   Application write (sync)
      |                                   |
      v                                   v
+------------+                      +-----------+
|    ARC     |  (RAM)               |    ZIL    | (Log device or pool)
| MRU | MFU  |  Hit? Return         +-----------+ Ack immediately
+------------+                            |
      | Miss                              v
      v                            (Actual data write happens
+------------+                      lazily in transaction groups)
|   L2ARC    |  (SSD)                     |
+------------+  Hit? Return               v
      | Miss                        +-------------+
      v                             | Data blocks | Coalesced writes
+------------+                      +-------------+
|  Storage   |  (HDD/SSD)                 |
+------------+  Read from disk             v
                                    [Block 1][Block 2][Block 3]...

ARC Algorithm:
    MRU (Most Recently Used):  Data accessed once recently
    MFU (Most Frequently Used): Data accessed multiple times

    - Scan-resistant: Large sequential scans don't evict hot data
    - Adapts split between MRU/MFU based on workload
    - Default size: up to 1/2 of system RAM (tunable)

L2ARC (Level 2 ARC):
    - Optional SSD cache tier
    - Populated with data evicted from ARC
    - Indexed in ARC (metadata in RAM points to L2ARC blocks)
    - Survives reboots with persistent L2ARC feature

ZIL (ZFS Intent Log):
    - Handles synchronous writes (O_SYNC, fsync, etc.)
    - Sequential writes to log → fast acknowledgment
    - Data later written to final location in transaction groups
    - Can use dedicated fast device (SLOG) for performance

# Trade-offs and Considerations

RAM requirements: Rule of thumb: 1GB RAM per TB of storage for basic usage, more for deduplication. ARC can be tuned but requires careful consideration. Low-RAM systems may struggle.
Performance characteristics: Excellent for sequential I/O and large files. Random writes can fragment due to CoW. Throughput sensitive to recordsize setting (default 128KB, tunable per-dataset).
Licensing: CDDL license incompatible with GPL, so not included in Linux kernel by default. Requires DKMS modules or distribution-specific packaging. Ubuntu includes ZFS support out-of-box.
Maturity on Linux: OpenZFS on Linux is production-ready and widely deployed, but originally designed for Solaris. Some features (e.g., certain performance tunings) work differently on Linux.
Use as Lustre backend: Can provide OST storage with checksumming benefits. However, XFS typically offers better raw performance. Choose ZFS for data integrity, XFS for maximum throughput.

# Understanding Recordsize

One of the most misunderstood aspects of ZFS is recordsize and how it differs from block size. The recordsize (default 128KB) is the maximum unit of data ZFS will write at once, but crucially, ZFS adapts recordsize to file size. A 4KB file is written as a 4KB record, not 128KB—there's no internal fragmentation or waste. This adaptive behavior surprises many people who expect fixed-size allocation.

Recordsize: Adaptive Allocation

Dataset recordsize = 128KB (default)

File sizes and actual allocation:
    1KB file   --> Written as 1KB record   (not 128KB!)
    4KB file   --> Written as 4KB record
   64KB file   --> Written as 64KB record
  200KB file   --> Written as 128KB + 72KB records (max recordsize applied)

Key insight: Small files don't waste space, even with large recordsize.

When recordsize matters: Recordsize primarily impacts I/O efficiency, not space usage. There are two key workload patterns to consider:

Workload Pattern 1: Whole-File Access
    Application reads/writes entire files at once
    Examples: Media files, archives, backups, ML datasets

    Large recordsize (e.g., 1MB) benefits:
        ✓ Better compression ratio (more data per compression unit)
        ✓ Fewer metadata operations (one record vs many)
        ✓ More efficient sequential I/O
        ✓ Small files still allocate correctly (adaptive)

    Recommendation: Use large recordsize (512KB - 1MB)
        $ zfs set recordsize=1M tank/media

Workload Pattern 2: Random/Partial Access
    Application reads/writes from middle of files
    Examples: Databases, random-access data structures, VMs

    Problem with large recordsize:
        Read 4KB at offset 50KB from file:
            - ZFS must read entire 128KB record from disk
            - Then return requested 4KB to application
            - Wasted I/O if accessing random locations

        Write 4KB at offset 50KB:
            - ZFS must read 128KB record (read-modify-write)
            - Modify 4KB portion
            - Write entire 128KB back (copy-on-write)
            - Extra I/O amplification

    Aligned recordsize = block size benefits:
        Read/write operations match underlying device block size
        Avoid read-modify-write cycles
        Predictable I/O patterns

    Recommendation: Match recordsize to access pattern
        Databases with 8KB pages:  $ zfs set recordsize=8K tank/db
        VMs with 4KB blocks:       $ zfs set recordsize=4K tank/vms
        Default 128KB is good for general mixed workloads

Compression interaction: Larger recordsize improves compression ratios because compression algorithms work on entire records. A 1MB recordsize with highly compressible data can achieve better ratios than 128KB, even for the same data. However, the first access to any part of a compressed record requires decompressing the entire record, so random access patterns may see worse performance with large recordsize, even accounting for compression.

# When to Choose ZFS

Data integrity critical: Archive systems, long-term storage, media libraries where bit rot detection matters. Checksumming catches corruption that other filesystems silently allow.
Snapshot workflows: Backup targets benefiting from incremental send/receive. Development environments needing instant rollback. VM/container storage with snapshot-based provisioning.
Integrated RAID: Avoiding LVM/mdadm complexity. Home servers or NAS appliances where ZFS's all-in-one approach simplifies management.
Avoid when: System has very limited RAM (<4GB for typical workloads). Need maximum random write performance (databases may prefer XFS/ext4). Kernel-space requirements preclude third-party modules.