IO Modes: Buffered vs Direct IO
Understanding how data flows from application to disk—and the role of buffering—is critical for performance optimization and correctness.
# Buffered IO (Default)
By default, file IO goes through the page cache. Reads pull data from the cache if available;
writes update cache pages and return immediately (unless O_SYNC is specified).
Read path:
- Application calls
read() - VFS layer checks page cache for requested offset
- Cache hit: Copy from cache to user buffer, return immediately
- Cache miss: Trigger read-ahead, wait for IO, populate cache, copy to user buffer
Code: mm/filemap.c:generic_file_read_iter()
Write path:
- Application calls
write() - VFS layer finds or allocates page cache pages for the write range
- Copy data from user buffer to cache pages
- Mark pages dirty
- Return to application (write appears complete)
- Later: Writeback thread flushes dirty pages to disk
Code: mm/filemap.c:generic_perform_write()
Performance characteristics:
- Excellent for sequential workloads (read-ahead hides latency)
- Repeated access to same data is very fast (cache hits)
- Writes are batched and coalesced, improving throughput
- No special alignment requirements
# Direct IO (O_DIRECT)
Opening a file with O_DIRECT bypasses the page cache. Reads and writes go directly to/from
the application's buffer to the storage device.
Requirements:
- User buffer must be aligned to filesystem block size (usually 4KB)
- IO offset must be aligned to filesystem block size
- IO length must be a multiple of filesystem block size
- Violating alignment causes
EINVALerrors
void *buf;
posix_memalign(&buf, 4096, 4096); // Align to 4KB
int fd = open("datafile", O_DIRECT | O_RDONLY);
pread(fd, buf, 4096, 0); // Read 4KB at offset 0
When to use Direct IO:
- Databases: Manage their own caching (e.g., InnoDB buffer pool, PostgreSQL shared buffers). Page cache would be redundant "double buffering."
- Large sequential reads: Streaming large files where data won't be reused. Avoids polluting page cache.
- Low-latency requirements: Eliminate cache management overhead for predictable latency.
Pitfalls:
- Small random reads/writes perform poorly (no buffering or merging)
- Application must handle alignment complexities
- Can't leverage kernel read-ahead
- Mixing direct and buffered IO on the same file causes coherency issues
Code path: fs/direct-io.c and filesystem-specific DIO implementations.
# Synchronous IO Flags
Several flags control write durability:
O_SYNC: Writes block until data and metadata are on stable storage. Expensive but ensures durability.O_DSYNC: LikeO_SYNCbut doesn't wait for metadata updates (e.g., file size, modification time) unless necessary for reading the data back.fsync(fd): System call to flush all dirty data and metadata for a file to disk. Blocks until complete.fdatasync(fd): Likefsync()but skips metadata updates when possible (similar toO_DSYNC).sync_file_range(): Partial sync—flush specific byte range. Doesn't wait for metadata or guarantee ordering.
Performance implications: Synchronous writes serialize IO and force disk flushes, destroying write batching. Use sparingly—only when durability is critical (e.g., database transaction commits).
# Interaction with Filesystems
Different filesystems handle buffered vs direct IO differently:
- ext4: Supports both modes. Direct IO requires extent-based files (default since ext4).
- XFS: Excellent direct IO support. Direct IO writes bypass page cache but may still update metadata.
- ZFS: Direct IO support varies by implementation. OpenZFS on Linux supports O_DIRECT but still writes go through ZFS's internal caching (ARC) before reaching vdevs.
- NFS: Direct IO still involves network round-trips. Can reduce client-side caching but doesn't eliminate all buffering.