mateusz@systems ~/book/ch01/special $ cat section.md

Filesystem Hacks

Unconventional filesystem techniques that solve real performance and deployment problems. These aren't theoretical edge cases—they're battle-tested approaches used in production systems.

squashfs: Reducing I/O Bottlenecks

squashfs is a compressed read-only filesystem. While commonly used for LiveCDs and snap packages, its most interesting applications exploit its ability to transform thousands of small file operations into efficient block reads.

Technique 1: squashfs on NFS. Compiler Explorer faced a challenging performance problem: serving 2,000+ compiler toolchains over NFS. Each compilation triggered thousands of file accesses (headers, libraries, binaries), and even with aggressive caching, NFS metadata operations created severe latency. Their solution: pack each compiler into a squashfs image, store the images on NFS, and mount them via loop devices. This "launders" away the NFS-ness—the squashfs driver sees a local block device and caches blocks normally, without constant metadata validation round trips. Reference: CEFS blog post

Traditional NFS (slow):          squashfs on NFS (fast):
[Client] ---stat header.h--+ [NFS]   [Client] ---read squashfs--+ [NFS]
         ---open header.h--+ [NFS]            (decompress locally)
         ---read header.h--+ [NFS]            [headers cached in RAM]
         ... thousands more ...               minimal network round trips

Technique 2: squashfs in RAM for Python imports. Python's import system stats dozens of potential locations for each module. For large environments (anaconda, deep learning frameworks), this creates a "stat storm" that hammers filesystems. Packing the entire environment into squashfs and mounting it on tmpfs (in RAM) eliminates this overhead—all file metadata is instantly available from memory.

tmpfs: Isolating I/O from Performance Testing

tmpfs is a memory-backed filesystem (typically mounted at /dev/shm). Data lives in RAM and swap, making it extremely fast but volatile. Beyond the obvious use for /tmp, tmpfs is invaluable for performance testing and fast build directories.

Performance testing workflow: When benchmarking code that does file I/O, use tmpfs to eliminate disk as a variable. If your test is still slow on tmpfs, you know the bottleneck isn't I/O—it's your code. Example: dd if=/dev/zero of=/dev/shm/test bs=1M count=1024 writes at memory speed, perfect for testing serialization logic without disk interference.

Build acceleration: Compilation generates millions of intermediate files (.o objects, .d dependency files). Keeping build directories on tmpfs can dramatically speed up builds, especially on systems with slow disks or network-mounted home directories.

⚠ ramfs vs tmpfs

There's also ramfs, which is similar to tmpfs but doesn't respect size limits and never swaps. It's slightly faster (no size accounting overhead) but dangerous—accidentally writing too much will OOM-kill your system. Stick with tmpfs unless you absolutely need guaranteed RAM speed and can control size precisely. ramfs is a sharp tool that can cut you.

Loop Devices: Files as Block Devices

Loop devices let you mount a file as if it were a block device. This is fundamental for working with disk images, testing filesystems, and modern package formats.

Common uses: Testing filesystem behavior without dedicating real partitions (dd if=/dev/zero of=test.img bs=1M count=1024 && mkfs.ext4 test.img && mount -o loop test.img /mnt). Creating VM disk images. Mounting ISO images without burning them to physical media. Ubuntu's snap packages are squashfs images mounted as loop devices—when you install a snap, you're creating a loop mount.

File on disk:                  Treated as block device:
+------------------+           +--------------------+
| disk.img         | mount -o  | /dev/loop0         |
| (just a file)    | -------+  | (block device)     |
+------------------+   loop    +--------------------+
                                        |
                                        |
                                +------------------+
                                | /mnt/test        |
                                | (filesystem)     |
                                +------------------+

Encrypted filesystems: Loop devices combined with dm-crypt allow file-based encryption. The loop device presents the encrypted file as a block device, dm-crypt decrypts it, and you mount the result—all transparent to applications.

Bind Mounts: Same Directory, Multiple Locations

Bind mounts (mount --bind /source /target) make the same directory tree appear in multiple locations. Unlike symlinks, bind mounts work at the VFS layer and are invisible to applications—programs see normal files, not links.

Chroot and containers: When you create a chroot environment or container, you need to expose certain host directories inside. Docker and systemd-nspawn use bind mounts extensively—mounting /proc, /sys, /dev, and user-specified volumes into the container's filesystem namespace.

NixOS sandboxing: Nix builds packages in isolated environments. The Nix store (/nix/store) is bind-mounted into build sandboxes along with specific dependencies, ensuring builds can't access undeclared dependencies. nix-user-chroot uses bind mounts to run Nix without root privileges.

Before:                        After: mount --bind /data /chroot/data

/data/                         Both paths point to same files:
  file1.txt                    /data/file1.txt --------+
  file2.txt                                            |
                                                       | (same inode)
/chroot/                                               |
  (empty)                      /chroot/data/file1.txt -+

Sparse Files: Huge Files, Small Space

Sparse files contain "holes"—ranges of zeros that don't consume disk space. When you read a hole, the filesystem returns zeros; when you write to a hole, it allocates real blocks. This is fundamental for virtual machine disk images and database files.

Creating sparse files: dd if=/dev/zero of=sparse.img bs=1 count=0 seek=10G creates a 10GB file that uses almost no disk space. As the VM writes data, blocks are allocated on demand.

Logical view:                  Physical reality:
0GB  [====]  data              Disk: [====]       (only 100MB used)
100MB [    ]  hole (zeros)
...
10GB [====]  data              Disk:      [====]  (only 100MB used)

ls -lh: 10GB                   du -h: 200MB

Gotchas: When sparse files materialize. Sparse files can accidentally expand to full size:

  • cp sparse.img copy.img without --sparse=always writes all zeros, expanding to 10GB
  • dd if=sparse.img of=copy.img without conv=sparse does the same
  • Writing a sparse file to a block device (USB drive, SD card) materializes all holes—10GB logical becomes 10GB physical
  • Some backup tools don't preserve sparseness by default (check rsync --sparse, tar -S)

FUSE: Filesystems in Userspace

FUSE (Filesystem in Userspace) lets you implement filesystems as regular programs instead of kernel modules. The kernel FUSE module passes filesystem operations (open, read, write) to a userspace daemon, which handles them however it wants.

[Application] ---+ [Kernel VFS] ---+ [FUSE kernel module]
                                                      |
                                                      |
                                               [Userspace daemon]
                                                      |
                                                      |
                                               [Actual storage]

Real-world examples:

  • sshfs: Mount remote directories over SSH. No special server setup needed—if you can SSH, you can mount. Perfect for development against remote machines.
  • rclone mount: Mount cloud storage (S3, Google Drive, Dropbox) as a local filesystem. Reads and writes go to the cloud transparently.
  • encfs: Encrypted filesystem where encrypted files live in a regular directory and the decrypted view is mounted elsewhere. Simpler than dm-crypt for file-level encryption.
  • Custom filesystems: Expose databases as directories (each row is a file), implement virtual filesystems for specific applications, or create read-only views of complex data structures.

Trade-off: FUSE is much easier to develop than kernel modules (write in any language, debug with normal tools, crashes don't panic the kernel), but there's performance overhead from userspace context switches. For many applications, the convenience is worth it.