Essential Linux Filesystem Semantics (Draft)

mateusz@systems ~/book/ch01/semantics $ cat section.md

Essential Linux Filesystem Semantics

Linux provides several atomic filesystem operations that are crucial for building robust systems. Understanding these semantics—and their limitations—is essential for reliable software.

# Atomic Rename

The rename() system call (exposed as the mv command) atomically replaces a target file if it exists. This operation is guaranteed to be atomic from the perspective of other processes—they will never see a partial or missing file during the rename.

Common use case: Safe configuration file updates. Write new config to a temporary file, then rename it over the existing config. Readers never see a half-written file, even if the writer crashes mid-update.

// Safe config update pattern
FILE *tmp = fopen("/etc/myapp/config.tmp", "w");
fprintf(tmp, "new_config_data");
fflush(tmp);
fsync(fileno(tmp));
fclose(tmp);
rename("/etc/myapp/config.tmp", "/etc/myapp/config");  // Atomic swap

Important limitation: rename() only works within the same filesystem. You cannot atomically rename across filesystem boundaries (e.g., from /tmp on tmpfs to /var on ext4). The system call will fail with EXDEV (cross-device link).

Code reference: fs/namei.c:vfs_rename() in the kernel.

NFS caveat: On network filesystems, rename() is still atomic on the server, but your data may not have reached the server yet. You must explicitly fflush() + fsync() before renaming to ensure the file contents are on stable storage. See the NFS close-to-open deep dive for the full protocol breakdown and RPC costs.

# Unlink with Open File Descriptors

On POSIX systems, unlinking (deleting) a file doesn't immediately remove it from disk if processes still have it open. The file's directory entry is removed, making it invisible to new lookups, but the inode and data blocks remain until the last file descriptor is closed.

This provides a powerful pattern for temporary files with automatic cleanup:

int fd = open("/tmp/workfile", O_CREAT | O_RDWR, 0600);
unlink("/tmp/workfile");  // Remove directory entry immediately
// File still accessible via 'fd'
// ... work with file ...
close(fd);  // Now file is truly deleted

Benefits: If your process crashes, the file is automatically cleaned up when the kernel closes all file descriptors during process termination. No orphaned temp files.

You can see these "deleted" files in /proc/<pid>/fd/. They appear with a "(deleted)" suffix when you use ls -l on the symlinks.

Code reference: fs/namei.c:vfs_unlink() removes the directory entry. The inode's i_nlink counter tracks references. Actual deletion occurs in iput_final() when both link count and open count reach zero.

# O_TMPFILE: Unnamed Inodes

Modern Linux (3.11+) provides O_TMPFILE, which creates an unnamed inode with no directory entry from the start. This is cleaner than the open-then-unlink pattern.

int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
// File exists on disk but has no name, completely invisible
// ... work with file ...
// Optional: link it into the filesystem
char path[64];
sprintf(path, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "/tmp/result", AT_SYMLINK_FOLLOW);
close(fd);

Use cases: Build files atomically (write to unnamed temp, link into place when complete), secure temporary storage (file is never visible in directory listings).

# File Locking

Linux provides two primary file locking mechanisms: flock() and POSIX fcntl() locks. They have different semantics and use cases.

flock() - BSD-style locks:

Locks the entire file (no byte-range locking)
Associated with open file description (shared by forked children)
Released when all file descriptors to that file are closed
Simple, easy to reason about
May not work correctly over NFS (depends on server implementation)

int fd = open("datafile", O_RDWR);
flock(fd, LOCK_EX);  // Exclusive lock
// ... critical section ...
flock(fd, LOCK_UN);  // Unlock
close(fd);

fcntl() - POSIX locks:

Byte-range locking (can lock portions of a file)
Associated with process, not file descriptor
Released when any file descriptor to the file is closed by that process
More complex semantics, easier to misuse
Better NFS support (uses NLM protocol)

struct flock fl;
fl.l_type = F_WRLCK;    // Write lock
fl.l_whence = SEEK_SET;
fl.l_start = 0;         // Offset
fl.l_len = 0;           // 0 means lock entire file
fcntl(fd, F_SETLKW, &fl);  // Block until lock acquired

Advisory vs Mandatory: Both flock() and fcntl() are advisory by default on Linux—they don't prevent processes from reading/writing locked files unless those processes also check for locks. Mandatory locking exists but is rarely used and not recommended.

NFS behavior: File locking over NFS is complex. flock() may be emulated using fcntl() locks. Lock state can be lost if the NFS server reboots. Modern NFSv4 has better lock recovery, but edge cases remain.

# Sparse Files and Hole Punching

Sparse files contain "holes"—regions that logically contain zeros but don't consume disk space. The filesystem only allocates blocks for written data.

Creating a sparse file:

int fd = open("sparse", O_CREAT | O_WRONLY, 0644);
lseek(fd, 1024*1024*1024, SEEK_SET);  // Seek 1GB ahead
write(fd, "x", 1);                     // Write single byte
close(fd);
// File appears as 1GB but only uses one filesystem block

fallocate() system call provides several operations:

Preallocate space: fallocate(fd, 0, offset, len) - allocate blocks without writing zeros, useful for reserving space for a database file
Punch holes: fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, len) - deallocate blocks in the middle of a file, creating sparseness
Zero range: fallocate(fd, FALLOC_FL_ZERO_RANGE, offset, len) - write zeros efficiently (may or may not deallocate blocks)

Finding holes efficiently: SEEK_HOLE and SEEK_DATA (added in Linux 3.1) allow seeking to the next hole or data region without reading the entire file.

off_t data = lseek(fd, 0, SEEK_DATA);  // Find first data
off_t hole = lseek(fd, data, SEEK_HOLE);  // Find next hole after that data

Tools like cp --sparse=always and rsync --sparse use these features to efficiently copy sparse files without inflating them.