Deep Dive: NFS Close-to-Open Consistency (Draft)

mateusz@systems ~/book/ch01/nfs-deep $ cat section.md

Deep Dive: NFS Close-to-Open Consistency

On a local filesystem, when process A writes to a file and process B reads it, B sees A's changes immediately. The kernel's page cache is shared—there's only one copy of the truth. NFS breaks this assumption. Each client has its own page cache, and there is no protocol to invalidate remote caches on every write. Instead, NFS offers a weaker guarantee called close-to-open consistency: changes made by one client become visible to other clients only after the writer calls close() and the reader calls open().

This section explores exactly what that guarantee means at the protocol level, what it costs in terms of network round-trips, what it does not guarantee, and how it affects the common atomic rename pattern.

# The Close-to-Open Contract

Close-to-open (CTO) provides two guarantees:

On close: All modified data and metadata are flushed from the client's cache to the server before close() returns.
On open: The client revalidates its cached copy against the server before open() returns. If the file changed on the server, the client purges its stale cache.

Together, these ensure a sequential handoff: if client A closes, then client B opens the same file, B is guaranteed to see everything A wrote. This is weaker than POSIX—which guarantees visibility after every write()—but strong enough for many workflows where files are written by one process and later consumed by another.

# Protocol Mechanics

To understand the cost of close-to-open, we need to look at the NFS RPCs involved. Let's trace what happens when client A writes a file and client B reads it afterward.

What happens on close()

When the application calls close(), the NFS client must push all dirty pages to the server. This involves two types of RPCs:

Client A                                     NFS Server
   |                                             |
   |--- WRITE (offset=0, count=8192) ----------->|   Dirty page 1
   |<-- OK --------------------------------------|
   |--- WRITE (offset=8192, count=8192) -------->|   Dirty page 2
   |<-- OK --------------------------------------|
   |--- WRITE (offset=16384, count=4096) ------->|   Dirty page 3 (partial)
   |<-- OK --------------------------------------|
   |                                             |
   |--- COMMIT (offset=0, count=20480) --------->|   Flush to stable storage
   |<-- OK --------------------------------------|
   |                                             |
   close() returns to application

WRITE transfers data from the client's page cache to the server. The server may buffer these in its own RAM. COMMIT asks the server to flush the written data to stable storage (disk). Without COMMIT, a server crash could lose buffered writes. The NFS client sends COMMIT at close time so that close-to-open consistency includes durability—not just visibility.

Cost: One WRITE RPC per dirty page (or group of pages, depending on wsize), plus one COMMIT. For a file with N dirty pages, that's N+1 round-trips at minimum [1]. This is why closing a large dirty file over NFS can be slow—the latency is hidden inside close(), and applications that don't check the return value of close() may miss write errors entirely.

What happens on open()

When client B calls open(), the NFS client checks whether its cached attributes for the file are still valid:

Client B                                     NFS Server
   |                                             |
   |--- GETATTR (file handle) ------------------>|   Check mtime, size, change attr
   |<-- mtime=T2, size=20480 --------------------|
   |                                             |
   | (cached mtime=T1 != T2, purge page cache)   |
   |                                             |
   open() returns to application
   |                                             |
   |--- READ (offset=0, count=32768) ----------->|   First read fetches fresh data
   |<-- 20480 bytes of data ---------------------|

The GETATTR RPC fetches the file's current attributes from the server. If the mtime or change attribute differs from what the client has cached, the client invalidates its page cache. The actual data isn't fetched until the application calls read()—open just ensures the cache is known to be stale.

Cost: One GETATTR round-trip on every open(). This is the price of CTO consistency. For workloads that open many small files (e.g., a build system traversing source trees), these GETATTRs can become a significant bottleneck.

# The Attribute Cache

Outside of open(), the NFS client does not revalidate on every operation. Instead, it caches file attributes for a tunable duration. During this window, stat() calls return cached data without contacting the server, and read() calls may serve stale page cache contents.

Mount option          Default    Effect
----------------------------------------------------------------------
acregmin                3s       Min time to cache regular file attrs
acregmax               60s       Max time to cache regular file attrs
acdirmin               30s       Min time to cache directory attrs
acdirmax               60s       Max time to cache directory attrs
actimeo               (none)     Sets all four values at once
noac                  (off)      Disables attribute caching entirely

The actual cache duration is adaptive between min and max: the client uses a heuristic based on how recently the file was modified. A file that hasn't changed in a long time gets cached closer to max; a frequently-changing file gets cached closer to min.

Important: The attribute cache only affects operations between open() and close(), and operations that don't go through open() at all (like stat()). The CTO guarantee overrides the attribute cache at the open/close boundaries. But if your application polls a file with stat() without reopening it, you'll see stale attributes for up to acregmax seconds.

noac: The Nuclear Option

Mounting with noac disables the attribute cache, forcing a GETATTR on every operation. This gives much stronger consistency—close to POSIX behavior—but at significant cost. Every stat(), read(), and directory listing generates a network round-trip. This can reduce metadata performance by 10-100x depending on workload, and puts heavy load on the server. Use only when correctness requirements absolutely demand it and you can absorb the performance hit.

# What Close-to-Open Does NOT Guarantee

CTO is a narrow contract. Several common expectations from local filesystems break on NFS:

No mid-file consistency: If clients A and B both have the same file open, A's writes are not visible to B until A closes and B reopens. During the open window, B may read stale data from its local page cache. There is no invalidation protocol between clients.
No mmap() consistency: Memory-mapped regions bypass close-to-open entirely. The NFS client does not flush mmap() dirty pages on close() unless msync() is called first. Readers using mmap() may never see updates from other clients, even after reopening.
No write ordering between files: If client A writes to file X then writes to file Y, another client may see Y's changes before X's changes. Each file's close-to-open is independent—there is no cross-file ordering guarantee.
No directory consistency: Directory listings (readdir()) are subject to the attribute cache. A newly created file may not appear in ls output on another client for up to acdirmax seconds, even though opening the file by name would work (since open() forces a LOOKUP RPC).

# Atomic Rename over NFS

The atomic rename pattern—write to a temp file, then rename() over the target—is the standard way to update files safely on local filesystems. On NFS, rename() is still atomic on the server: the RENAME RPC is a single server-side operation. But there's a catch.

The problem is that when you call rename(), your data might not have reached the server yet. The NFS client's page cache may still hold dirty pages from your writes. The RENAME RPC tells the server to swap directory entries—but if the data pages haven't been flushed, the renamed file may be empty or incomplete from the perspective of other clients.

Let's trace what goes wrong, and then what goes right.

Wrong: rename() without flushing

Client A                                     NFS Server
   |                                             |
   | write("config.tmp", data)                   |
   |   (data stays in local page cache)          |   Server has nothing yet!
   |                                             |
   |--- RENAME config.tmp -> config ------------>|   Server renames the empty/
   |<-- OK --------------------------------------|   partial file
   |                                             |
   Client B opens "config" -- sees empty or partial data

Because rename() doesn't trigger a flush of the source file's dirty pages, the server may rename a file whose data hasn't arrived yet. On a local filesystem this can't happen—both write() and rename() operate on the same page cache. On NFS, the page cache is local to the client, and the RENAME RPC operates on the server.

Correct: flush + fsync before rename

Client A                                     NFS Server
   |                                             |
   | write("config.tmp", data)                   |
   |   (data in local page cache)                |
   |                                             |
   | fflush(fp)   -- push to kernel buffers      |
   | fsync(fd)    -- triggers NFS WRITE+COMMIT   |
   |                                             |
   |--- WRITE (data) --------------------------->|   Data reaches server
   |<-- OK --------------------------------------|
   |--- COMMIT --------------------------------->|   Data on stable storage
   |<-- OK --------------------------------------|
   |                                             |
   | close(fd)    -- no dirty pages remain       |
   |                                             |
   |--- RENAME config.tmp -> config ------------>|   Server renames the
   |<-- OK --------------------------------------|   complete file
   |                                             |
   Client B opens "config" -- GETATTR, sees new mtime
   Client B reads  "config" -- gets complete data

The correct sequence is: fflush() (push userspace buffers to kernel) → fsync() (trigger WRITE+COMMIT to server) → close() → rename(). The fsync() is the critical step—it forces the NFS client to send WRITE RPCs for all dirty pages followed by a COMMIT, ensuring the data is on the server's stable storage before we rename.

You might wonder: doesn't close() already flush dirty pages? It does—that's the CTO guarantee. But if you close() and then rename(), you're relying on the file being fully flushed to the server between those two calls. A crash between close() and rename() could leave you with the temp file flushed but not renamed. The fsync() before close() ensures durability on the server before you proceed to the rename, giving you a clean recovery path: if the rename never happens, the old file is still intact.

Cost of the correct pattern

Operation           RPCs generated             Round-trips
------------------------------------------------------------------
write()             (none -- buffered locally)  0
fflush()            (none -- kernel buffers)    0
fsync()             WRITE x N + COMMIT          N + 1
close()             (nothing to flush)          0  [2]
rename()            RENAME                      1
------------------------------------------------------------------
Total                                           N + 2

For a small config file (one page), that's 3 round-trips: WRITE, COMMIT, RENAME. At typical datacenter NFS latencies (0.5-2ms per RPC), this adds 1.5-6ms versus near-zero on a local filesystem. For large files, the WRITE RPCs dominate, but they can be pipelined by the NFS client to overlap multiple WRITEs in flight.

[1] Modern NFS clients pipeline WRITEs, sending multiple RPCs concurrently up to the wsize window. The actual wall-clock time depends on network bandwidth and server throughput, not just round-trip count.
[2] close() generates zero RPCs here because fsync() already flushed everything. Without the preceding fsync(), close() would generate the same WRITE+COMMIT sequence.

# NFSv4 Delegations

NFSv4 introduced delegations—a mechanism where the server grants a client exclusive (or read) access to a file. While a client holds a delegation, it can cache aggressively without revalidating on every open(), because the server guarantees no other client is modifying the file. If another client tries to access the file, the server issues a callback to recall the delegation, forcing the holder to flush and relinquish its cached state.

Delegations effectively give you stronger-than-CTO consistency for uncontended files while reducing RPC overhead (no GETATTRs needed on open while the delegation is held). When contention occurs, the delegation is recalled and you fall back to standard CTO behavior. This makes NFSv4 significantly better than v3 for workloads where files are typically accessed by a single client at a time.

# Practical Implications

Common symptoms of CTO surprises

Stale reads: Application on client B reads old data even though client A has written new data. Usually caused by B not reopening the file, or by the attribute cache serving stale attrs between opens.
Empty files after rename: Atomic rename pattern used without fsync(). The rename succeeds but the file data hasn't reached the server.
Build failures on NFS: Build systems that write object files and immediately read them from another process can hit CTO windows. The compiler writes foo.o and closes it, but if the linker opens it before the attribute cache expires, it may read stale cached attributes and see old (or empty) content—even though both processes share the same kernel NFS client.
"File not found" after create: Directory attribute cache causes readdir() to return stale listings. The file exists (and can be opened by name), but doesn't appear in ls yet.

Tuning and mount options

Option                Effect                           Performance cost
---------------------------------------------------------------------------
actimeo=0             Revalidate attrs on every op     High: GETATTR on every
                                                       stat/read/readdir
noac                  Disable attribute cache +        Very high: all metadata
                      disable client-side caching      ops hit the server
sync                  Synchronous writes (no           Extreme: every write()
                      client-side buffering)           becomes WRITE+COMMIT
proto=rdma            Use RDMA transport               Lower latency per RPC,
                                                       helps all patterns
nconnect=N            Multiple TCP connections          Better throughput for
                      (Linux 5.3+)                     parallel workloads

The general trade-off is always the same: stricter consistency means more RPCs, which means higher latency and more server load. Start with the default CTO behavior, design your application to work within its guarantees, and only tighten mount options when you've identified a specific correctness need.

Debugging

When you suspect CTO-related issues:

nfsstat -c — shows client-side RPC statistics. Look at GETATTR, WRITE, and COMMIT counts. A sudden spike in GETATTRs may indicate cache invalidation storms.
mountstats — per-mount NFS statistics including average RPC latency, retransmissions, and cache hit rates.
rpcdebug -m nfs -s all — enables kernel-level NFS client debug logging (verbose; use sparingly). Shows individual RPC calls with timing.
strace -e trace=%file — on the application process, traces file-related system calls (open, close, read, write, fsync). NFS operations appear as regular filesystem calls to the application, so this reveals which syscalls trigger RPCs.