Essential Linux Filesystem Semantics
Linux provides several atomic filesystem operations that are crucial for building robust systems. Understanding these semantics—and their limitations—is essential for reliable software.
# Atomic Rename
The rename() system call (exposed as the mv command) atomically replaces a
target file if it exists. This operation is guaranteed to be atomic from the perspective of other processes—they
will never see a partial or missing file during the rename.
Common use case: Safe configuration file updates. Write new config to a temporary file, then rename it over the existing config. Readers never see a half-written file, even if the writer crashes mid-update.
// Safe config update pattern
FILE *tmp = fopen("/etc/myapp/config.tmp", "w");
fprintf(tmp, "new_config_data");
fflush(tmp);
fsync(fileno(tmp));
fclose(tmp);
rename("/etc/myapp/config.tmp", "/etc/myapp/config"); // Atomic swap
Important limitation: rename() only works within the same filesystem.
You cannot atomically rename across filesystem boundaries (e.g., from /tmp on tmpfs to
/var on ext4). The system call will fail with EXDEV (cross-device link).
Code reference: fs/namei.c:vfs_rename() in the kernel.
# Unlink with Open File Descriptors
On POSIX systems, unlinking (deleting) a file doesn't immediately remove it from disk if processes still have it open. The file's directory entry is removed, making it invisible to new lookups, but the inode and data blocks remain until the last file descriptor is closed.
This provides a powerful pattern for temporary files with automatic cleanup:
int fd = open("/tmp/workfile", O_CREAT | O_RDWR, 0600);
unlink("/tmp/workfile"); // Remove directory entry immediately
// File still accessible via 'fd'
// ... work with file ...
close(fd); // Now file is truly deleted
Benefits: If your process crashes, the file is automatically cleaned up when the kernel closes all file descriptors during process termination. No orphaned temp files.
You can see these "deleted" files in /proc/<pid>/fd/. They appear with a
"(deleted)" suffix when you use ls -l on the symlinks.
Code reference: fs/namei.c:vfs_unlink() removes the directory entry. The inode's
i_nlink counter tracks references. Actual deletion occurs in iput_final()
when both link count and open count reach zero.
# O_TMPFILE: Unnamed Inodes
Modern Linux (3.11+) provides O_TMPFILE, which creates an unnamed inode with no directory
entry from the start. This is cleaner than the open-then-unlink pattern.
int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
// File exists on disk but has no name, completely invisible
// ... work with file ...
// Optional: link it into the filesystem
char path[64];
sprintf(path, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "/tmp/result", AT_SYMLINK_FOLLOW);
close(fd);
Use cases: Build files atomically (write to unnamed temp, link into place when complete), secure temporary storage (file is never visible in directory listings).
# File Locking
Linux provides two primary file locking mechanisms: flock() and POSIX fcntl() locks.
They have different semantics and use cases.
flock() - BSD-style locks:
- Locks the entire file (no byte-range locking)
- Associated with open file description (shared by forked children)
- Released when all file descriptors to that file are closed
- Simple, easy to reason about
- May not work correctly over NFS (depends on server implementation)
int fd = open("datafile", O_RDWR);
flock(fd, LOCK_EX); // Exclusive lock
// ... critical section ...
flock(fd, LOCK_UN); // Unlock
close(fd);
fcntl() - POSIX locks:
- Byte-range locking (can lock portions of a file)
- Associated with process, not file descriptor
- Released when any file descriptor to the file is closed by that process
- More complex semantics, easier to misuse
- Better NFS support (uses NLM protocol)
struct flock fl; fl.l_type = F_WRLCK; // Write lock fl.l_whence = SEEK_SET; fl.l_start = 0; // Offset fl.l_len = 0; // 0 means lock entire file fcntl(fd, F_SETLKW, &fl); // Block until lock acquired
Advisory vs Mandatory: Both flock() and fcntl() are advisory
by default on Linux—they don't prevent processes from reading/writing locked files unless those processes
also check for locks. Mandatory locking exists but is rarely used and not recommended.
NFS behavior: File locking over NFS is complex. flock() may be emulated
using fcntl() locks. Lock state can be lost if the NFS server reboots. Modern NFSv4 has
better lock recovery, but edge cases remain.
# Sparse Files and Hole Punching
Sparse files contain "holes"—regions that logically contain zeros but don't consume disk space. The filesystem only allocates blocks for written data.
Creating a sparse file:
int fd = open("sparse", O_CREAT | O_WRONLY, 0644);
lseek(fd, 1024*1024*1024, SEEK_SET); // Seek 1GB ahead
write(fd, "x", 1); // Write single byte
close(fd);
// File appears as 1GB but only uses one filesystem block
fallocate() system call provides several operations:
- Preallocate space:
fallocate(fd, 0, offset, len)- allocate blocks without writing zeros, useful for reserving space for a database file - Punch holes:
fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, len)- deallocate blocks in the middle of a file, creating sparseness - Zero range:
fallocate(fd, FALLOC_FL_ZERO_RANGE, offset, len)- write zeros efficiently (may or may not deallocate blocks)
Finding holes efficiently: SEEK_HOLE and SEEK_DATA (added in Linux 3.1)
allow seeking to the next hole or data region without reading the entire file.
off_t data = lseek(fd, 0, SEEK_DATA); // Find first data off_t hole = lseek(fd, data, SEEK_HOLE); // Find next hole after that data
Tools like cp --sparse=always and rsync --sparse use these features to
efficiently copy sparse files without inflating them.