Deep Dive: NFS Close-to-Open Consistency
On a local filesystem, when process A writes to a file and process B reads it, B sees A's
changes immediately. The kernel's page cache is shared—there's only one copy of the
truth. NFS breaks this assumption. Each client has its own page cache, and
there is no protocol to invalidate remote caches on every write. Instead, NFS offers a
weaker guarantee called close-to-open consistency: changes made by one
client become visible to other clients only after the writer calls close()
and the reader calls open().
This section explores exactly what that guarantee means at the protocol level, what it costs in terms of network round-trips, what it does not guarantee, and how it affects the common atomic rename pattern.
# The Close-to-Open Contract
Close-to-open (CTO) provides two guarantees:
- On close: All modified data and metadata are flushed from the
client's cache to the server before
close()returns. - On open: The client revalidates its cached copy against the server
before
open()returns. If the file changed on the server, the client purges its stale cache.
Together, these ensure a sequential handoff: if client A closes, then client B
opens the same file, B is guaranteed to see everything A wrote. This is weaker than
POSIX—which guarantees visibility after every write()—but
strong enough for many workflows where files are written by one process and later
consumed by another.
# Protocol Mechanics
To understand the cost of close-to-open, we need to look at the NFS RPCs involved. Let's trace what happens when client A writes a file and client B reads it afterward.
What happens on close()
When the application calls close(), the NFS client must push all dirty
pages to the server. This involves two types of RPCs:
Client A NFS Server | | |--- WRITE (offset=0, count=8192) ----------->| Dirty page 1 |<-- OK --------------------------------------| |--- WRITE (offset=8192, count=8192) -------->| Dirty page 2 |<-- OK --------------------------------------| |--- WRITE (offset=16384, count=4096) ------->| Dirty page 3 (partial) |<-- OK --------------------------------------| | | |--- COMMIT (offset=0, count=20480) --------->| Flush to stable storage |<-- OK --------------------------------------| | | close() returns to application
WRITE transfers data from the client's page cache to the server. The server may buffer these in its own RAM. COMMIT asks the server to flush the written data to stable storage (disk). Without COMMIT, a server crash could lose buffered writes. The NFS client sends COMMIT at close time so that close-to-open consistency includes durability—not just visibility.
Cost: One WRITE RPC per dirty page (or group of pages, depending on
wsize), plus one COMMIT. For a file with N dirty pages, that's N+1
round-trips at minimum [1]. This is why closing a large dirty file over NFS can be
slow—the latency is hidden inside close(), and applications that
don't check the return value of close() may miss write errors entirely.
What happens on open()
When client B calls open(), the NFS client checks whether its cached
attributes for the file are still valid:
Client B NFS Server | | |--- GETATTR (file handle) ------------------>| Check mtime, size, change attr |<-- mtime=T2, size=20480 --------------------| | | | (cached mtime=T1 != T2, purge page cache) | | | open() returns to application | | |--- READ (offset=0, count=32768) ----------->| First read fetches fresh data |<-- 20480 bytes of data ---------------------|
The GETATTR RPC fetches the file's current attributes from the server. If the
mtime or change attribute differs from what the client has cached, the client
invalidates its page cache. The actual data isn't fetched until the application calls
read()—open just ensures the cache is known to be stale.
Cost: One GETATTR round-trip on every open(). This is the
price of CTO consistency. For workloads that open many small files (e.g., a build system
traversing source trees), these GETATTRs can become a significant bottleneck.
# The Attribute Cache
Outside of open(), the NFS client does not revalidate on every
operation. Instead, it caches file attributes for a tunable duration. During this
window, stat() calls return cached data without contacting the server,
and read() calls may serve stale page cache contents.
Mount option Default Effect ---------------------------------------------------------------------- acregmin 3s Min time to cache regular file attrs acregmax 60s Max time to cache regular file attrs acdirmin 30s Min time to cache directory attrs acdirmax 60s Max time to cache directory attrs actimeo (none) Sets all four values at once noac (off) Disables attribute caching entirely
The actual cache duration is adaptive between min and max: the client uses a heuristic based on how recently the file was modified. A file that hasn't changed in a long time gets cached closer to max; a frequently-changing file gets cached closer to min.
Important: The attribute cache only affects operations between
open() and close(), and operations that don't go through
open() at all (like stat()). The CTO guarantee overrides
the attribute cache at the open/close boundaries. But if your application polls a file
with stat() without reopening it, you'll see stale attributes for up to
acregmax seconds.
noac: The Nuclear Option
Mounting with noac disables the attribute cache, forcing a GETATTR
on every operation. This gives much stronger consistency—close to POSIX
behavior—but at significant cost. Every stat(),
read(), and directory listing generates a network round-trip.
This can reduce metadata performance by 10-100x depending on workload, and
puts heavy load on the server. Use only when correctness requirements absolutely
demand it and you can absorb the performance hit.
# What Close-to-Open Does NOT Guarantee
CTO is a narrow contract. Several common expectations from local filesystems break on NFS:
- No mid-file consistency: If clients A and B both have the same file open, A's writes are not visible to B until A closes and B reopens. During the open window, B may read stale data from its local page cache. There is no invalidation protocol between clients.
- No mmap() consistency: Memory-mapped regions bypass close-to-open
entirely. The NFS client does not flush
mmap()dirty pages onclose()unlessmsync()is called first. Readers usingmmap()may never see updates from other clients, even after reopening. - No write ordering between files: If client A writes to file X then writes to file Y, another client may see Y's changes before X's changes. Each file's close-to-open is independent—there is no cross-file ordering guarantee.
- No directory consistency: Directory listings
(
readdir()) are subject to the attribute cache. A newly created file may not appear inlsoutput on another client for up toacdirmaxseconds, even though opening the file by name would work (sinceopen()forces a LOOKUP RPC).
# Atomic Rename over NFS
The atomic rename pattern—write to a temp file,
then rename() over the target—is the standard way to update files
safely on local filesystems. On NFS, rename() is still atomic on the
server: the RENAME RPC is a single server-side operation. But there's a catch.
The problem is that when you call rename(), your data might not have
reached the server yet. The NFS client's page cache may still hold dirty pages from
your writes. The RENAME RPC tells the server to swap directory entries—but if the
data pages haven't been flushed, the renamed file may be empty or incomplete from the
perspective of other clients.
Let's trace what goes wrong, and then what goes right.
Wrong: rename() without flushing
Client A NFS Server
| |
| write("config.tmp", data) |
| (data stays in local page cache) | Server has nothing yet!
| |
|--- RENAME config.tmp -> config ------------>| Server renames the empty/
|<-- OK --------------------------------------| partial file
| |
Client B opens "config" -- sees empty or partial data
Because rename() doesn't trigger a flush of the source file's dirty
pages, the server may rename a file whose data hasn't arrived yet. On a local
filesystem this can't happen—both write() and rename()
operate on the same page cache. On NFS, the page cache is local to the client, and the
RENAME RPC operates on the server.
Correct: flush + fsync before rename
Client A NFS Server
| |
| write("config.tmp", data) |
| (data in local page cache) |
| |
| fflush(fp) -- push to kernel buffers |
| fsync(fd) -- triggers NFS WRITE+COMMIT |
| |
|--- WRITE (data) --------------------------->| Data reaches server
|<-- OK --------------------------------------|
|--- COMMIT --------------------------------->| Data on stable storage
|<-- OK --------------------------------------|
| |
| close(fd) -- no dirty pages remain |
| |
|--- RENAME config.tmp -> config ------------>| Server renames the
|<-- OK --------------------------------------| complete file
| |
Client B opens "config" -- GETATTR, sees new mtime
Client B reads "config" -- gets complete data
The correct sequence is: fflush() (push userspace buffers to kernel)
→ fsync() (trigger WRITE+COMMIT to server) →
close() → rename(). The fsync() is the
critical step—it forces the NFS client to send WRITE RPCs for all dirty pages
followed by a COMMIT, ensuring the data is on the server's stable storage before we
rename.
You might wonder: doesn't close() already flush dirty pages? It does—that's
the CTO guarantee. But if you close() and then rename(),
you're relying on the file being fully flushed to the server between those two calls.
A crash between close() and rename() could leave you with
the temp file flushed but not renamed. The fsync() before
close() ensures durability on the server before you proceed to
the rename, giving you a clean recovery path: if the rename never happens, the old
file is still intact.
Cost of the correct pattern
Operation RPCs generated Round-trips ------------------------------------------------------------------ write() (none -- buffered locally) 0 fflush() (none -- kernel buffers) 0 fsync() WRITE x N + COMMIT N + 1 close() (nothing to flush) 0 [2] rename() RENAME 1 ------------------------------------------------------------------ Total N + 2
For a small config file (one page), that's 3 round-trips: WRITE, COMMIT, RENAME. At typical datacenter NFS latencies (0.5-2ms per RPC), this adds 1.5-6ms versus near-zero on a local filesystem. For large files, the WRITE RPCs dominate, but they can be pipelined by the NFS client to overlap multiple WRITEs in flight.
[1] Modern NFS clients pipeline WRITEs, sending multiple RPCs concurrently up to the
wsize window. The actual wall-clock time depends on network bandwidth and
server throughput, not just round-trip count.
[2] close() generates zero RPCs here because fsync() already
flushed everything. Without the preceding fsync(), close()
would generate the same WRITE+COMMIT sequence.
# NFSv4 Delegations
NFSv4 introduced delegations—a mechanism where the server grants
a client exclusive (or read) access to a file. While a client holds a delegation, it
can cache aggressively without revalidating on every open(), because the
server guarantees no other client is modifying the file. If another client tries to
access the file, the server issues a callback to recall the delegation,
forcing the holder to flush and relinquish its cached state.
Delegations effectively give you stronger-than-CTO consistency for uncontended files while reducing RPC overhead (no GETATTRs needed on open while the delegation is held). When contention occurs, the delegation is recalled and you fall back to standard CTO behavior. This makes NFSv4 significantly better than v3 for workloads where files are typically accessed by a single client at a time.
# Practical Implications
Common symptoms of CTO surprises
- Stale reads: Application on client B reads old data even though client A has written new data. Usually caused by B not reopening the file, or by the attribute cache serving stale attrs between opens.
- Empty files after rename: Atomic rename pattern used without
fsync(). The rename succeeds but the file data hasn't reached the server. - Build failures on NFS: Build systems that write object files
and immediately read them from another process can hit CTO windows. The compiler
writes
foo.oand closes it, but if the linker opens it before the attribute cache expires, it may read stale cached attributes and see old (or empty) content—even though both processes share the same kernel NFS client. - "File not found" after create: Directory attribute cache causes
readdir()to return stale listings. The file exists (and can be opened by name), but doesn't appear inlsyet.
Tuning and mount options
Option Effect Performance cost
---------------------------------------------------------------------------
actimeo=0 Revalidate attrs on every op High: GETATTR on every
stat/read/readdir
noac Disable attribute cache + Very high: all metadata
disable client-side caching ops hit the server
sync Synchronous writes (no Extreme: every write()
client-side buffering) becomes WRITE+COMMIT
proto=rdma Use RDMA transport Lower latency per RPC,
helps all patterns
nconnect=N Multiple TCP connections Better throughput for
(Linux 5.3+) parallel workloads
The general trade-off is always the same: stricter consistency means more RPCs, which means higher latency and more server load. Start with the default CTO behavior, design your application to work within its guarantees, and only tighten mount options when you've identified a specific correctness need.
Debugging
When you suspect CTO-related issues:
nfsstat -c— shows client-side RPC statistics. Look at GETATTR, WRITE, and COMMIT counts. A sudden spike in GETATTRs may indicate cache invalidation storms.mountstats— per-mount NFS statistics including average RPC latency, retransmissions, and cache hit rates.rpcdebug -m nfs -s all— enables kernel-level NFS client debug logging (verbose; use sparingly). Shows individual RPC calls with timing.strace -e trace=%file— on the application process, traces file-related system calls (open, close, read, write, fsync). NFS operations appear as regular filesystem calls to the application, so this reveals which syscalls trigger RPCs.