8.2. Lock Ranges
The protocol allows a lock owner to request a lock with a byte range
and then either upgrade or unlock a sub-range of the initial lock.
It is expected that this will be an uncommon type of request. In any
case, servers or server filesystems may not be able to support sub-
range lock semantics. In the event that a server receives a locking
request that represents a sub-range of current locking state for the
lock owner, the server is allowed to return the error
NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock
operations. Therefore, the client should be prepared to receive this
error and, if appropriate, report the error to the requesting
The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to
the recovery of file locking state in the event of server failure.
As discussed in the section "Server Failure and Recovery" below, the
server may employ certain optimizations during recovery that work
effectively only when the client's behavior during lock recovery is
similar to the client's locking behavior prior to server failure.
8.3. Upgrading and Downgrading Locks
If a client has a write lock on a record, it can request an atomic
downgrade of the lock to a read lock via the LOCK request, by setting
the type to READ_LT. If the server supports atomic downgrade, the
request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP.
The client should be prepared to receive this error, and if
appropriate, report the error to the requesting application.
If a client has a read lock on a record, it can request an atomic
upgrade of the lock to a write lock via the LOCK request by setting
the type to WRITE_LT or WRITEW_LT. If the server does not support
atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade
can be achieved without an existing conflict, the request will
succeed. Otherwise, the server will return either NFS4ERR_DENIED or
NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the
client issued the LOCK request with the type set to WRITEW_LT and the
server has detected a deadlock. The client should be prepared to
receive such errors and if appropriate, report the error to the
8.4. Blocking Locks
Some clients require the support of blocking locks. The NFS version
4 protocol must not rely on a callback mechanism and therefore is
unable to notify a client when a previously denied lock has been
granted. Clients have no choice but to continually poll for the
lock. This presents a fairness problem. Two new lock types are
added, READW and WRITEW, and are used to indicate to the server that
the client is requesting a blocking lock. The server should maintain
an ordered list of pending blocking locks. When the conflicting lock
is released, the server may wait the lease period for the first
waiting client to re-request the lock. After the lease period
expires the next waiting client request is allowed the lock. Clients
are required to poll at an interval sufficiently small that it is
likely to acquire the lock in a timely manner. The server is not
required to maintain a list of pending blocked locks as it is used to
increase fairness and not correct operation. Because of the
unordered nature of crash recovery, storing of lock state to stable
storage would be required to guarantee ordered granting of blocking
Servers may also note the lock types and delay returning denial of
the request to allow extra time for a conflicting lock to be
released, allowing a successful return. In this way, clients can
avoid the burden of needlessly frequent polling for blocking locks.
The server should take care in the length of delay in the event the
client retransmits the request.
8.5. Lease Renewal
The purpose of a lease is to allow a server to remove stale locks
that are held by a client that has crashed or is otherwise
unreachable. It is not a mechanism for cache consistency and lease
renewals may not be denied if the lease interval has not expired.
The following events cause implicit renewal of all of the leases for
a given client (i.e., all those sharing a given clientid). Each of
these is a positive indication that the client is still active and
that the associated state held at the server, for the client, is
o An OPEN with a valid clientid.
o Any operation made with a valid stateid (CLOSE, DELEGPURGE,
DELEGRETURN, LOCK, LOCKU, OPEN, OPEN_CONFIRM, OPEN_DOWNGRADE,
READ, RENEW, SETATTR, WRITE). This does not include the special
stateids of all bits 0 or all bits 1.
Note that if the client had restarted or rebooted, the client
would not be making these requests without issuing the
SETCLIENTID/SETCLIENTID_CONFIRM sequence. The use of the
SETCLIENTID/SETCLIENTID_CONFIRM sequence (one that changes the
client verifier) notifies the server to drop the locking state
associated with the client. SETCLIENTID/SETCLIENTID_CONFIRM never
renews a lease.
If the server has rebooted, the stateids (NFS4ERR_STALE_STATEID
error) or the clientid (NFS4ERR_STALE_CLIENTID error) will not be
valid hence preventing spurious renewals.
This approach allows for low overhead lease renewal which scales
well. In the typical case no extra RPC calls are required for lease
renewal and in the worst case one RPC is required every lease period
(i.e., a RENEW operation). The number of locks held by the client is
not a factor since all state for the client is involved with the
lease renewal action.
Since all operations that create a new lease also renew existing
leases, the server must maintain a common lease expiration time for
all valid leases for a given client. This lease time can then be
easily updated upon implicit lease renewal actions.
8.6. Crash Recovery
The important requirement in crash recovery is that both the client
and the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server
restarts or reboots. All READ and WRITE operations that may have
been queued within the client or network buffers must wait until the
client has successfully recovered the locks protecting the READ and
8.6.1. Client Failure and Recovery
In the event that a client fails, the server may recover the client's
locks when the associated leases have expired. Conflicting locks
from another client may only be granted after this lease expiration.
If the client is able to restart or reinitialize within the lease
period the client may be forced to wait the remainder of the lease
period before obtaining new locks.
To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client supplied verifier. This
verifier is part of the initial SETCLIENTID call made by the client.
The server returns a clientid as a result of the SETCLIENTID
operation. The client then confirms the use of the clientid with
SETCLIENTID_CONFIRM. The clientid in combination with an opaque
owner field is then used by the client to identify the lock owner for
OPEN. This chain of associations is then used to identify all locks
for a particular client.
Since the verifier will be changed by the client upon each
initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release
all locks held which are associated with the old clientid which was
derived from the old verifier.
Note that the verifier must have the same uniqueness properties of
the verifier for the COMMIT operation.
8.6.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart
or reboot), it must allow clients time to discover this fact and re-
establish the lost locking state. The client must be able to re-
establish the locking state without having the server deny valid
requests because the server has granted conflicting access to another
client. Likewise, if there is the possibility that clients have not
yet re-established their locking state for a file, the server must
disallow READ and WRITE operations for that file. The duration of
this recovery period is equal to the duration of the lease period.
A client can determine that server failure (and thus loss of locking
state) has occurred, when it receives one of two errors. The
NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a
reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a
clientid invalidated by reboot or restart. When either of these are
received, the client must establish a new clientid (See the section
"Client ID") and re-establish the locking state as discussed below.
The period of special handling of locking and READs and WRITEs, equal
in duration to the lease period, is referred to as the "grace
period". During the grace period, clients recover locks and the
associated state by reclaim-type locking requests (i.e., LOCK
requests with reclaim set to true and OPEN operations with a claim
type of CLAIM_PREVIOUS). During the grace period, the server must
reject READ and WRITE operations and non-reclaim locking requests
(i.e., other LOCK and OPEN operations) with an error of
If the server can reliably determine that granting a non-reclaim
request will not conflict with reclamation of locks by other clients,
the NFS4ERR_GRACE error does not have to be returned and the non-
reclaim client request can be serviced. For the server to be able to
service READ and WRITE operations during the grace period, it must
again be able to guarantee that no possible conflict could arise
between an impending reclaim locking request and the READ or WRITE
operation. If the server is unable to offer that guarantee, the
NFS4ERR_GRACE error must be returned to the client.
For a server to provide simple, valid handling during the grace
period, the easiest method is to simply reject all non-reclaim
locking requests and READ and WRITE operations by returning the
NFS4ERR_GRACE error. However, a server may keep information about
granted locks in stable storage. With this information, the server
could determine if a regular lock or READ or WRITE operation can be
For example, if a count of locks on a given file is available in
stable storage, the server can track reclaimed locks for the file and
when all reclaims have been processed, non-reclaim locking requests
may be processed. This way the server can ensure that non-reclaim
locking requests will not conflict with potential reclaim requests.
With respect to I/O requests, if the server is able to determine that
there are no outstanding reclaim requests for a file by information
from stable storage or another similar mechanism, the processing of
I/O requests could proceed normally for the file.
To reiterate, for a server that allows non-reclaim lock and I/O
requests to be processed during the grace period, it MUST determine
that no lock subsequently reclaimed will be rejected and that no lock
subsequently reclaimed would have prevented any I/O operation
processed during the grace period.
Clients should be prepared for the return of NFS4ERR_GRACE errors for
non-reclaim lock and I/O requests. In this case the client should
employ a retry mechanism for the request. A delay (on the order of
several seconds) between retries should be used to avoid overwhelming
the server. Further discussion of the general issue is included in
[Floyd]. The client must account for the server that is able to
perform I/O and non-reclaim locking requests within the grace period
as well as those that can not do so.
A reclaim-type locking request outside the server's grace period can
only succeed if the server can guarantee that no conflicting lock or
I/O request has been granted since reboot or restart.
A server may, upon restart, establish a new value for the lease
period. Therefore, clients should, once a new clientid is
established, refetch the lease_time attribute and use it as the basis
for lease renewal for the lease associated with that server.
However, the server must establish, for this restart event, a grace
period at least as long as the lease period for the previous server
instantiation. This allows the client state obtained during the
previous server instance to be reliably re-established.
8.6.3. Network Partitions and Recovery
If the duration of a network partition is greater than the lease
period provided by the server, the server will have not received a
lease renewal from the client. If this occurs, the server may free
all locks held for the client. As a result, all stateids held by the
client will become invalid or stale. Once the client is able to
reach the server after such a network partition, all I/O submitted by
the client with the now invalid stateids will fail with the server
returning the error NFS4ERR_EXPIRED. Once this error is received,
the client will suitably notify the application that held the lock.
As a courtesy to the client or as an optimization, the server may
continue to hold locks on behalf of a client for which recent
communication has extended beyond the lease period. If the server
receives a lock or I/O request that conflicts with one of these
courtesy locks, the server must free the courtesy lock and grant the
When a network partition is combined with a server reboot, there are
edge conditions that place requirements on the server in order to
avoid silent data corruption following the server reboot. Two of
these edge conditions are known, and are discussed below.
The first edge condition has the following scenario:
1. Client A acquires a lock.
2. Client A and server experience mutual network partition, such
that client A is unable to renew its lease.
3. Client A's lease expires, so server releases lock.
4. Client B acquires a lock that would have conflicted with that
of Client A.
5. Client B releases the lock
6. Server reboots
7. Network partition between client A and server heals.
8. Client A issues a RENEW operation, and gets back a
9. Client A reclaims its lock within the server's grace period.
Thus, at the final step, the server has erroneously granted client
A's lock reclaim. If client B modified the object the lock was
protecting, client A will experience object corruption.
The second known edge condition follows:
1. Client A acquires a lock.
2. Server reboots.
3. Client A and server experience mutual network partition, such
that client A is unable to reclaim its lock within the grace
4. Server's reclaim grace period ends. Client A has no locks
recorded on server.
5. Client B acquires a lock that would have conflicted with that
of Client A.
6. Client B releases the lock.
7. Server reboots a second time.
8. Network partition between client A and server heals.
9. Client A issues a RENEW operation, and gets back a
10. Client A reclaims its lock within the server's grace period.
As with the first edge condition, the final step of the scenario of
the second edge condition has the server erroneously granting client
A's lock reclaim.
Solving the first and second edge conditions requires that the server
either assume after it reboots that edge condition occurs, and thus
return NFS4ERR_NO_GRACE for all reclaim attempts, or that the server
record some information stable storage. The amount of information
the server records in stable storage is in inverse proportion to how
harsh the server wants to be whenever the edge conditions occur. The
server that is completely tolerant of all edge conditions will record
in stable storage every lock that is acquired, removing the lock
record from stable storage only when the lock is unlocked by the
client and the lock's lockowner advances the sequence number such
that the lock release is not the last stateful event for the
lockowner's sequence. For the two aforementioned edge conditions,
the harshest a server can be, and still support a grace period for
reclaims, requires that the server record in stable storage
information some minimal information. For example, a server
implementation could, for each client, save in stable storage a
o the client's id string
o a boolean that indicates if the client's lease expired or if there
was administrative intervention (see the section, Server
Revocation of Locks) to revoke a record lock, share reservation,
o a timestamp that is updated the first time after a server boot or
reboot the client acquires record locking, share reservation, or
delegation state on the server. The timestamp need not be updated
on subsequent lock requests until the server reboots.
The server implementation would also record in the stable storage the
timestamps from the two most recent server reboots.
Assuming the above record keeping, for the first edge condition,
after the server reboots, the record that client A's lease expired
means that another client could have acquired a conflicting record
lock, share reservation, or delegation. Hence the server must reject
a reclaim from client A with the error NFS4ERR_NO_GRACE.
For the second edge condition, after the server reboots for a second
time, the record that the client had an unexpired record lock, share
reservation, or delegation established before the server's previous
incarnation means that the server must reject a reclaim from client A
with the error NFS4ERR_NO_GRACE.
Regardless of the level and approach to record keeping, the server
MUST implement one of the following strategies (which apply to
reclaims of share reservations, record locks, and delegations):
1. Reject all reclaims with NFS4ERR_NO_GRACE. This is superharsh,
but necessary if the server does not want to record lock state
in stable storage.
2. Record sufficient state in stable storage such that all known
edge conditions involving server reboot, including the two
noted in this section, are detected. False positives are
acceptable. Note that at this time, it is not known if there
are other edge conditions.
In the event, after a server reboot, the server determines that
there is unrecoverable damage or corruption to the the stable
storage, then for all clients and/or locks affected, the server
MUST return NFS4ERR_NO_GRACE.
A mandate for the client's handling of the NFS4ERR_NO_GRACE error is
outside the scope of this specification, since the strategies for
such handling are very dependent on the client's operating
environment. However, one potential approach is described below.
When the client receives NFS4ERR_NO_GRACE, it could examine the
change attribute of the objects the client is trying to reclaim state
for, and use that to determine whether to re-establish the state via
normal OPEN or LOCK requests. This is acceptable provided the
client's operating environment allows it. In otherwords, the client
implementor is advised to document for his users the behavior. The
client could also inform the application that its record lock or
share reservations (whether they were delegated or not) have been
lost, such as via a UNIX signal, a GUI pop-up window, etc. See the
section, "Data Caching and Revocation" for a discussion of what the
client should do for dealing with unreclaimed delegations on client
For further discussion of revocation of locks see the section "Server
Revocation of Locks".
8.7. Recovery from a Lock Request Timeout or Abort
In the event a lock request times out, a client may decide to not
retry the request. The client may also abort the request when the
process for which it was issued is terminated (e.g., in UNIX due to a
signal). It is possible though that the server received the request
and acted upon it. This would change the state on the server without
the client being aware of the change. It is paramount that the
client re-synchronize state with server before it attempts any other
operation that takes a seqid and/or a stateid with the same
lock_owner. This is straightforward to do without a special re-
Since the server maintains the last lock request and response
received on the lock_owner, for each lock_owner, the client should
cache the last lock request it sent such that the lock request did
not receive a response. From this, the next time the client does a
lock operation for the lock_owner, it can send the cached request, if
there is one, and if the request was one that established state
(e.g., a LOCK or OPEN operation), the server will return the cached
result or if never saw the request, perform it. The client can
follow up with a request to remove the state (e.g., a LOCKU or CLOSE
operation). With this approach, the sequencing and stateid
information on the client and server for the given lock_owner will
re-synchronize and in turn the lock state will re-synchronize.
8.8. Server Revocation of Locks
At any point, the server can revoke locks held by a client and the
client must be prepared for this event. When the client detects that
its locks have been or may have been revoked, the client is
responsible for validating the state information between itself and
the server. Validating locking state for the client means that it
must verify or reclaim state for each lock currently held.
The first instance of lock revocation is upon server reboot or re-
initialization. In this instance the client will receive an error
(NFS4ERR_STALE_STATEID or NFS4ERR_STALE_CLIENTID) and the client will
proceed with normal crash recovery as described in the previous
The second lock revocation event is the inability to renew the lease
before expiration. While this is considered a rare or unusual event,
the client must be prepared to recover. Both the server and client
will be able to detect the failure to renew the lease and are capable
of recovering without data corruption. For the server, it tracks the
last renewal event serviced for the client and knows when the lease
will expire. Similarly, the client must track operations which will
renew the lease period. Using the time that each such request was
sent and the time that the corresponding reply was received, the
client should bound the time that the corresponding renewal could
have occurred on the server and thus determine if it is possible that
a lease period expiration could have occurred.
The third lock revocation event can occur as a result of
administrative intervention within the lease period. While this is
considered a rare event, it is possible that the server's
administrator has decided to release or revoke a particular lock held
by the client. As a result of revocation, the client will receive an
error of NFS4ERR_ADMIN_REVOKED. In this instance the client may
assume that only the lock_owner's locks have been lost. The client
notifies the lock holder appropriately. The client may not assume
the lease period has been renewed as a result of failed operation.
When the client determines the lease period may have expired, the
client must mark all locks held for the associated lease as
"unvalidated". This means the client has been unable to re-establish
or confirm the appropriate lock state with the server. As described
in the previous section on crash recovery, there are scenarios in
which the server may grant conflicting locks after the lease period
has expired for a client. When it is possible that the lease period
has expired, the client must validate each lock currently held to
ensure that a conflicting lock has not been granted. The client may
accomplish this task by issuing an I/O request, either a pending I/O
or a zero-length read, specifying the stateid associated with the
lock in question. If the response to the request is success, the
client has validated all of the locks governed by that stateid and
re-established the appropriate state between itself and the server.
If the I/O request is not successful, then one or more of the locks
associated with the stateid was revoked by the server and the client
must notify the owner.
8.9. Share Reservations
A share reservation is a mechanism to control access to a file. It
is a separate and independent mechanism from record locking. When a
client opens a file, it issues an OPEN operation to the server
specifying the type of access required (READ, WRITE, or BOTH) and the
type of access to deny others (deny NONE, READ, WRITE, or BOTH). If
the OPEN fails the client will fail the application's open request.
Pseudo-code definition of the semantics:
if (request.access == 0)
if ((request.access & file_state.deny)) ||
(request.deny & file_state.access))
This checking of share reservations on OPEN is done with no exception
for an existing OPEN for the same open_owner.
The constants used for the OPEN and OPEN_DOWNGRADE operations for the
access and deny fields are as follows:
const OPEN4_SHARE_ACCESS_READ = 0x00000001;
const OPEN4_SHARE_ACCESS_WRITE = 0x00000002;
const OPEN4_SHARE_ACCESS_BOTH = 0x00000003;
const OPEN4_SHARE_DENY_NONE = 0x00000000;
const OPEN4_SHARE_DENY_READ = 0x00000001;
const OPEN4_SHARE_DENY_WRITE = 0x00000002;
const OPEN4_SHARE_DENY_BOTH = 0x00000003;
8.10. OPEN/CLOSE Operations
To provide correct share semantics, a client MUST use the OPEN
operation to obtain the initial filehandle and indicate the desired
access and what if any access to deny. Even if the client intends to
use a stateid of all 0's or all 1's, it must still obtain the
filehandle for the regular file with the OPEN operation so the
appropriate share semantics can be applied. For clients that do not
have a deny mode built into their open programming interfaces, deny
equal to NONE should be used.
The OPEN operation with the CREATE flag, also subsumes the CREATE
operation for regular files as used in previous versions of the NFS
protocol. This allows a create with a share to be done atomically.
The CLOSE operation removes all share reservations held by the
lock_owner on that file. If record locks are held, the client SHOULD
release all locks before issuing a CLOSE. The server MAY free all
outstanding locks on CLOSE but some servers may not support the CLOSE
of a file that still has record locks held. The server MUST return
failure, NFS4ERR_LOCKS_HELD, if any locks would exist after the
The LOOKUP operation will return a filehandle without establishing
any lock state on the server. Without a valid stateid, the server
will assume the client has the least access. For example, a file
opened with deny READ/WRITE cannot be accessed using a filehandle
obtained through LOOKUP because it would not have a valid stateid
(i.e., using a stateid of all bits 0 or all bits 1).
8.10.1. Close and Retention of State Information
Since a CLOSE operation requests deallocation of a stateid, dealing
with retransmission of the CLOSE, may pose special difficulties,
since the state information, which normally would be used to
determine the state of the open file being designated, might be
deallocated, resulting in an NFS4ERR_BAD_STATEID error.
Servers may deal with this problem in a number of ways. To provide
the greatest degree assurance that the protocol is being used
properly, a server should, rather than deallocate the stateid, mark
it as close-pending, and retain the stateid with this status, until
later deallocation. In this way, a retransmitted CLOSE can be
recognized since the stateid points to state information with this
distinctive status, so that it can be handled without error.
When adopting this strategy, a server should retain the state
information until the earliest of:
o Another validly sequenced request for the same lockowner, that is
not a retransmission.
o The time that a lockowner is freed by the server due to period
with no activity.
o All locks for the client are freed as a result of a SETCLIENTID.
Servers may avoid this complexity, at the cost of less complete
protocol error checking, by simply responding NFS4_OK in the event of
a CLOSE for a deallocated stateid, on the assumption that this case
must be caused by a retransmitted close. When adopting this
approach, it is desirable to at least log an error when returning a
no-error indication in this situation. If the server maintains a
reply-cache mechanism, it can verify the CLOSE is indeed a
retransmission and avoid error logging in most cases.
8.11. Open Upgrade and Downgrade
When an OPEN is done for a file and the lockowner for which the open
is being done already has the file open, the result is to upgrade the
open file status maintained on the server to include the access and
deny bits specified by the new OPEN as well as those for the existing
OPEN. The result is that there is one open file, as far as the
protocol is concerned, and it includes the union of the access and
deny bits for all of the OPEN requests completed. Only a single
CLOSE will be done to reset the effects of both OPENs. Note that the
client, when issuing the OPEN, may not know that the same file is in
fact being opened. The above only applies if both OPENs result in
the OPENed object being designated by the same filehandle.
When the server chooses to export multiple filehandles corresponding
to the same file object and returns different filehandles on two
different OPENs of the same file object, the server MUST NOT "OR"
together the access and deny bits and coalesce the two open files.
Instead the server must maintain separate OPENs with separate
stateids and will require separate CLOSEs to free them.
When multiple open files on the client are merged into a single open
file object on the server, the close of one of the open files (on the
client) may necessitate change of the access and deny status of the
open file on the server. This is because the union of the access and
deny bits for the remaining opens may be smaller (i.e., a proper
subset) than previously. The OPEN_DOWNGRADE operation is used to
make the necessary change and the client should use it to update the
server so that share reservation requests by other clients are
8.12. Short and Long Leases
When determining the time period for the server lease, the usual
lease tradeoffs apply. Short leases are good for fast server
recovery at a cost of increased RENEW or READ (with zero length)
requests. Longer leases are certainly kinder and gentler to servers
trying to handle very large numbers of clients. The number of RENEW
requests drop in proportion to the lease time. The disadvantages of
long leases are slower recovery after server failure (the server must
wait for the leases to expire and the grace period to elapse before
granting new lock requests) and increased file contention (if client
fails to transmit an unlock request then server must wait for lease
expiration before granting new locks).
Long leases are usable if the server is able to store lease state in
non-volatile memory. Upon recovery, the server can reconstruct the
lease state from its non-volatile memory and continue operation with
its clients and therefore long leases would not be an issue.
8.13. Clocks, Propagation Delay, and Calculating Lease Expiration
To avoid the need for synchronized clocks, lease times are granted by
the server as a time delta. However, there is a requirement that the
client and server clocks do not drift excessively over the duration
of the lock. There is also the issue of propagation delay across the
network which could easily be several hundred milliseconds as well as
the possibility that requests will be lost and need to be
To take propagation delay into account, the client should subtract it
from lease times (e.g., if the client estimates the one-way
propagation delay as 200 msec, then it can assume that the lease is
already 200 msec old when it gets it). In addition, it will take
another 200 msec to get a response back to the server. So the client
must send a lock renewal or write data back to the server 400 msec
before the lease would expire.
The server's lease period configuration should take into account the
network distance of the clients that will be accessing the server's
resources. It is expected that the lease period will take into
account the network propagation delays and other network delay
factors for the client population. Since the protocol does not allow
for an automatic method to determine an appropriate lease period, the
server's administrator may have to tune the lease period.
8.14. Migration, Replication and State
When responsibility for handling a given file system is transferred
to a new server (migration) or the client chooses to use an alternate
server (e.g., in response to server unresponsiveness) in the context
of file system replication, the appropriate handling of state shared
between the client and server (i.e., locks, leases, stateids, and
clientids) is as described below. The handling differs between
migration and replication. For related discussion of file server
state and recover of such see the sections under "File Locking and
If server replica or a server immigrating a filesystem agrees to, or
is expected to, accept opaque values from the client that originated
from another server, then it is a wise implementation practice for
the servers to encode the "opaque" values in network byte order.
This way, servers acting as replicas or immigrating filesystems will
be able to parse values like stateids, directory cookies,
filehandles, etc. even if their native byte order is different from
other servers cooperating in the replication and migration of the
8.14.1. Migration and State
In the case of migration, the servers involved in the migration of a
filesystem SHOULD transfer all server state from the original to the
new server. This must be done in a way that is transparent to the
client. This state transfer will ease the client's transition when a
filesystem migration occurs. If the servers are successful in
transferring all state, the client will continue to use stateids
assigned by the original server. Therefore the new server must
recognize these stateids as valid. This holds true for the clientid
as well. Since responsibility for an entire filesystem is
transferred with a migration event, there is no possibility that
conflicts will arise on the new server as a result of the transfer of
As part of the transfer of information between servers, leases would
be transferred as well. The leases being transferred to the new
server will typically have a different expiration time from those for
the same client, previously on the old server. To maintain the
property that all leases on a given server for a given client expire
at the same time, the server should advance the expiration time to
the later of the leases being transferred or the leases already
present. This allows the client to maintain lease renewal of both
classes without special effort.
The servers may choose not to transfer the state information upon
migration. However, this choice is discouraged. In this case, when
the client presents state information from the original server, the
client must be prepared to receive either NFS4ERR_STALE_CLIENTID or
NFS4ERR_STALE_STATEID from the new server. The client should then
recover its state information as it normally would in response to a
server failure. The new server must take care to allow for the
recovery of state information as it would in the event of server
8.14.2. Replication and State
Since client switch-over in the case of replication is not under
server control, the handling of state is different. In this case,
leases, stateids and clientids do not have validity across a
transition from one server to another. The client must re-establish
its locks on the new server. This can be compared to the re-
establishment of locks by means of reclaim-type requests after a
server reboot. The difference is that the server has no provision to
distinguish requests reclaiming locks from those obtaining new locks
or to defer the latter. Thus, a client re-establishing a lock on the
new server (by means of a LOCK or OPEN request), may have the
requests denied due to a conflicting lock. Since replication is
intended for read-only use of filesystems, such denial of locks
should not pose large difficulties in practice. When an attempt to
re-establish a lock on a new server is denied, the client should
treat the situation as if his original lock had been revoked.
8.14.3. Notification of Migrated Lease
In the case of lease renewal, the client may not be submitting
requests for a filesystem that has been migrated to another server.
This can occur because of the implicit lease renewal mechanism. The
client renews leases for all filesystems when submitting a request to
any one filesystem at the server.
In order for the client to schedule renewal of leases that may have
been relocated to the new server, the client must find out about
lease relocation before those leases expire. To accomplish this, all
operations which implicitly renew leases for a client (i.e., OPEN,
CLOSE, READ, WRITE, RENEW, LOCK, LOCKT, LOCKU), will return the error
NFS4ERR_LEASE_MOVED if responsibility for any of the leases to be
renewed has been transferred to a new server. This condition will
continue until the client receives an NFS4ERR_MOVED error and the
server receives the subsequent GETATTR(fs_locations) for an access to
each filesystem for which a lease has been moved to a new server.
When a client receives an NFS4ERR_LEASE_MOVED error, it should
perform an operation on each filesystem associated with the server in
question. When the client receives an NFS4ERR_MOVED error, the
client can follow the normal process to obtain the new server
information (through the fs_locations attribute) and perform renewal
of those leases on the new server. If the server has not had state
transferred to it transparently, the client will receive either
NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from the new server,
as described above, and the client can then recover state information
as it does in the event of server failure.
8.14.4. Migration and the Lease_time Attribute
In order that the client may appropriately manage its leases in the
case of migration, the destination server must establish proper
values for the lease_time attribute.
When state is transferred transparently, that state should include
the correct value of the lease_time attribute. The lease_time
attribute on the destination server must never be less than that on
the source since this would result in premature expiration of leases
granted by the source server. Upon migration in which state is
transferred transparently, the client is under no obligation to re-
fetch the lease_time attribute and may continue to use the value
previously fetched (on the source server).
If state has not been transferred transparently (i.e., the client
sees a real or simulated server reboot), the client should fetch the
value of lease_time on the new (i.e., destination) server, and use it
for subsequent locking requests. However the server must respect a
grace period at least as long as the lease_time on the source server,
in order to ensure that clients have ample time to reclaim their
locks before potentially conflicting non-reclaimed locks are granted.
The means by which the new server obtains the value of lease_time on
the old server is left to the server implementations. It is not
specified by the NFS version 4 protocol.
9. Client-Side Caching
Client-side caching of data, of file attributes, and of file names is
essential to providing good performance with the NFS protocol.
Providing distributed cache coherence is a difficult problem and
previous versions of the NFS protocol have not attempted it.
Instead, several NFS client implementation techniques have been used
to reduce the problems that a lack of coherence poses for users.
These techniques have not been clearly defined by earlier protocol
specifications and it is often unclear what is valid or invalid
The NFS version 4 protocol uses many techniques similar to those that
have been used in previous protocol versions. The NFS version 4
protocol does not provide distributed cache coherence. However, it
defines a more limited set of caching guarantees to allow locks and
share reservations to be used without destructive interference from
client side caching.
In addition, the NFS version 4 protocol introduces a delegation
mechanism which allows many decisions normally made by the server to
be made locally by clients. This mechanism provides efficient
support of the common cases where sharing is infrequent or where
sharing is read-only.
9.1. Performance Challenges for Client-Side Caching
Caching techniques used in previous versions of the NFS protocol have
been successful in providing good performance. However, several
scalability challenges can arise when those techniques are used with
very large numbers of clients. This is particularly true when
clients are geographically distributed which classically increases
the latency for cache revalidation requests.
The previous versions of the NFS protocol repeat their file data
cache validation requests at the time the file is opened. This
behavior can have serious performance drawbacks. A common case is
one in which a file is only accessed by a single client. Therefore,
sharing is infrequent.
In this case, repeated reference to the server to find that no
conflicts exist is expensive. A better option with regards to
performance is to allow a client that repeatedly opens a file to do
so without reference to the server. This is done until potentially
conflicting operations from another client actually occur.
A similar situation arises in connection with file locking. Sending
file lock and unlock requests to the server as well as the read and
write requests necessary to make data caching consistent with the
locking semantics (see the section "Data Caching and File Locking")
can severely limit performance. When locking is used to provide
protection against infrequent conflicts, a large penalty is incurred.
This penalty may discourage the use of file locking by applications.
The NFS version 4 protocol provides more aggressive caching
strategies with the following design goals:
o Compatibility with a large range of server semantics.
o Provide the same caching benefits as previous versions of the NFS
protocol when unable to provide the more aggressive model.
o Requirements for aggressive caching are organized so that a large
portion of the benefit can be obtained even when not all of the
requirements can be met.
The appropriate requirements for the server are discussed in later
sections in which specific forms of caching are covered. (see the
section "Open Delegation").
9.2. Delegation and Callbacks
Recallable delegation of server responsibilities for a file to a
client improves performance by avoiding repeated requests to the
server in the absence of inter-client conflict. With the use of a
"callback" RPC from server to client, a server recalls delegated
responsibilities when another client engages in sharing of a
A delegation is passed from the server to the client, specifying the
object of the delegation and the type of delegation. There are
different types of delegations but each type contains a stateid to be
used to represent the delegation when performing operations that
depend on the delegation. This stateid is similar to those
associated with locks and share reservations but differs in that the
stateid for a delegation is associated with a clientid and may be
used on behalf of all the open_owners for the given client. A
delegation is made to the client as a whole and not to any specific
process or thread of control within it.
Because callback RPCs may not work in all environments (due to
firewalls, for example), correct protocol operation does not depend
on them. Preliminary testing of callback functionality by means of a
CB_NULL procedure determines whether callbacks can be supported. The
CB_NULL procedure checks the continuity of the callback path. A
server makes a preliminary assessment of callback availability to a
given client and avoids delegating responsibilities until it has
determined that callbacks are supported. Because the granting of a
delegation is always conditional upon the absence of conflicting
access, clients must not assume that a delegation will be granted and
they must always be prepared for OPENs to be processed without any
delegations being granted.
Once granted, a delegation behaves in most ways like a lock. There
is an associated lease that is subject to renewal together with all
of the other leases held by that client.
Unlike locks, an operation by a second client to a delegated file
will cause the server to recall a delegation through a callback.
On recall, the client holding the delegation must flush modified
state (such as modified data) to the server and return the
delegation. The conflicting request will not receive a response
until the recall is complete. The recall is considered complete when
the client returns the delegation or the server times out on the
recall and revokes the delegation as a result of the timeout.
Following the resolution of the recall, the server has the
information necessary to grant or deny the second client's request.
At the time the client receives a delegation recall, it may have
substantial state that needs to be flushed to the server. Therefore,
the server should allow sufficient time for the delegation to be
returned since it may involve numerous RPCs to the server. If the
server is able to determine that the client is diligently flushing
state to the server as a result of the recall, the server may extend
the usual time allowed for a recall. However, the time allowed for
recall completion should not be unbounded.
An example of this is when responsibility to mediate opens on a given
file is delegated to a client (see the section "Open Delegation").
The server will not know what opens are in effect on the client.
Without this knowledge the server will be unable to determine if the
access and deny state for the file allows any particular open until
the delegation for the file has been returned.
A client failure or a network partition can result in failure to
respond to a recall callback. In this case, the server will revoke
the delegation which in turn will render useless any modified state
still on the client.
9.2.1. Delegation Recovery
There are three situations that delegation recovery must deal with:
o Client reboot or restart
o Server reboot or restart
o Network partition (full or callback-only)
In the event the client reboots or restarts, the failure to renew
leases will result in the revocation of record locks and share
reservations. Delegations, however, may be treated a bit
There will be situations in which delegations will need to be
reestablished after a client reboots or restarts. The reason for
this is the client may have file data stored locally and this data
was associated with the previously held delegations. The client will
need to reestablish the appropriate file state on the server.
To allow for this type of client recovery, the server MAY extend the
period for delegation recovery beyond the typical lease expiration
period. This implies that requests from other clients that conflict
with these delegations will need to wait. Because the normal recall
process may require significant time for the client to flush changed
state to the server, other clients need be prepared for delays that
occur because of a conflicting delegation. This longer interval
would increase the window for clients to reboot and consult stable
storage so that the delegations can be reclaimed. For open
delegations, such delegations are reclaimed using OPEN with a claim
type of CLAIM_DELEGATE_PREV. (See the sections on "Data Caching and
Revocation" and "Operation 18: OPEN" for discussion of open
delegation and the details of OPEN respectively).
A server MAY support a claim type of CLAIM_DELEGATE_PREV, but if it
does, it MUST NOT remove delegations upon SETCLIENTID_CONFIRM, and
instead MUST, for a period of time no less than that of the value of
the lease_time attribute, maintain the client's delegations to allow
time for the client to issue CLAIM_DELEGATE_PREV requests. The
server that supports CLAIM_DELEGATE_PREV MUST support the DELEGPURGE
When the server reboots or restarts, delegations are reclaimed (using
the OPEN operation with CLAIM_PREVIOUS) in a similar fashion to
record locks and share reservations. However, there is a slight
semantic difference. In the normal case if the server decides that a
delegation should not be granted, it performs the requested action
(e.g., OPEN) without granting any delegation. For reclaim, the
server grants the delegation but a special designation is applied so
that the client treats the delegation as having been granted but
recalled by the server. Because of this, the client has the duty to
write all modified state to the server and then return the
delegation. This process of handling delegation reclaim reconciles
three principles of the NFS version 4 protocol:
o Upon reclaim, a client reporting resources assigned to it by an
earlier server instance must be granted those resources.
o The server has unquestionable authority to determine whether
delegations are to be granted and, once granted, whether they are
to be continued.
o The use of callbacks is not to be depended upon until the client
has proven its ability to receive them.
When a network partition occurs, delegations are subject to freeing
by the server when the lease renewal period expires. This is similar
to the behavior for locks and share reservations. For delegations,
however, the server may extend the period in which conflicting
requests are held off. Eventually the occurrence of a conflicting
request from another client will cause revocation of the delegation.
A loss of the callback path (e.g., by later network configuration
change) will have the same effect. A recall request will fail and
revocation of the delegation will result.
A client normally finds out about revocation of a delegation when it
uses a stateid associated with a delegation and receives the error
NFS4ERR_EXPIRED. It also may find out about delegation revocation
after a client reboot when it attempts to reclaim a delegation and
receives that same error. Note that in the case of a revoked write
open delegation, there are issues because data may have been modified
by the client whose delegation is revoked and separately by other
clients. See the section "Revocation Recovery for Write Open
Delegation" for a discussion of such issues. Note also that when
delegations are revoked, information about the revoked delegation
will be written by the server to stable storage (as described in the
section "Crash Recovery"). This is done to deal with the case in
which a server reboots after revoking a delegation but before the
client holding the revoked delegation is notified about the
9.3. Data Caching
When applications share access to a set of files, they need to be
implemented so as to take account of the possibility of conflicting
access by another application. This is true whether the applications
in question execute on different clients or reside on the same
Share reservations and record locks are the facilities the NFS
version 4 protocol provides to allow applications to coordinate
access by providing mutual exclusion facilities. The NFS version 4
protocol's data caching must be implemented such that it does not
invalidate the assumptions that those using these facilities depend
9.3.1. Data Caching and OPENs
In order to avoid invalidating the sharing assumptions that
applications rely on, NFS version 4 clients should not provide cached
data to applications or modify it on behalf of an application when it
would not be valid to obtain or modify that same data via a READ or
Furthermore, in the absence of open delegation (see the section "Open
Delegation") two additional rules apply. Note that these rules are
obeyed in practice by many NFS version 2 and version 3 clients.
o First, cached data present on a client must be revalidated after
doing an OPEN. Revalidating means that the client fetches the
change attribute from the server, compares it with the cached
change attribute, and if different, declares the cached data (as
well as the cached attributes) as invalid. This is to ensure that
the data for the OPENed file is still correctly reflected in the
client's cache. This validation must be done at least when the
client's OPEN operation includes DENY=WRITE or BOTH thus
terminating a period in which other clients may have had the
opportunity to open the file with WRITE access. Clients may
choose to do the revalidation more often (i.e., at OPENs
specifying DENY=NONE) to parallel the NFS version 3 protocol's
practice for the benefit of users assuming this degree of cache
Since the change attribute is updated for data and metadata
modifications, some client implementors may be tempted to use the
time_modify attribute and not change to validate cached data, so
that metadata changes do not spuriously invalidate clean data.
The implementor is cautioned in this approach. The change
attribute is guaranteed to change for each update to the file,
whereas time_modify is guaranteed to change only at the
granularity of the time_delta attribute. Use by the client's data
cache validation logic of time_modify and not change runs the risk
of the client incorrectly marking stale data as valid.
o Second, modified data must be flushed to the server before closing
a file OPENed for write. This is complementary to the first rule.
If the data is not flushed at CLOSE, the revalidation done after
client OPENs as file is unable to achieve its purpose. The other
aspect to flushing the data before close is that the data must be
committed to stable storage, at the server, before the CLOSE
operation is requested by the client. In the case of a server
reboot or restart and a CLOSEd file, it may not be possible to
retransmit the data to be written to the file. Hence, this
9.3.2. Data Caching and File Locking
For those applications that choose to use file locking instead of
share reservations to exclude inconsistent file access, there is an
analogous set of constraints that apply to client side data caching.
These rules are effective only if the file locking is used in a way
that matches in an equivalent way the actual READ and WRITE
operations executed. This is as opposed to file locking that is
based on pure convention. For example, it is possible to manipulate
a two-megabyte file by dividing the file into two one-megabyte
regions and protecting access to the two regions by file locks on
bytes zero and one. A lock for write on byte zero of the file would
represent the right to do READ and WRITE operations on the first
region. A lock for write on byte one of the file would represent the
right to do READ and WRITE operations on the second region. As long
as all applications manipulating the file obey this convention, they
will work on a local filesystem. However, they may not work with the
NFS version 4 protocol unless clients refrain from data caching.
The rules for data caching in the file locking environment are:
o First, when a client obtains a file lock for a particular region,
the data cache corresponding to that region (if any cached data
exists) must be revalidated. If the change attribute indicates
that the file may have been updated since the cached data was
obtained, the client must flush or invalidate the cached data for
the newly locked region. A client might choose to invalidate all
of non-modified cached data that it has for the file but the only
requirement for correct operation is to invalidate all of the data
in the newly locked region.
o Second, before releasing a write lock for a region, all modified
data for that region must be flushed to the server. The modified
data must also be written to stable storage.
Note that flushing data to the server and the invalidation of cached
data must reflect the actual byte ranges locked or unlocked.
Rounding these up or down to reflect client cache block boundaries
will cause problems if not carefully done. For example, writing a
modified block when only half of that block is within an area being
unlocked may cause invalid modification to the region outside the
unlocked area. This, in turn, may be part of a region locked by
another client. Clients can avoid this situation by synchronously
performing portions of write operations that overlap that portion
(initial or final) that is not a full block. Similarly, invalidating
a locked area which is not an integral number of full buffer blocks
would require the client to read one or two partial blocks from the
server if the revalidation procedure shows that the data which the
client possesses may not be valid.
The data that is written to the server as a prerequisite to the
unlocking of a region must be written, at the server, to stable
storage. The client may accomplish this either with synchronous
writes or by following asynchronous writes with a COMMIT operation.
This is required because retransmission of the modified data after a
server reboot might conflict with a lock held by another client.
A client implementation may choose to accommodate applications which
use record locking in non-standard ways (e.g., using a record lock as
a global semaphore) by flushing to the server more data upon an LOCKU
than is covered by the locked range. This may include modified data
within files other than the one for which the unlocks are being done.
In such cases, the client must not interfere with applications whose
READs and WRITEs are being done only within the bounds of record
locks which the application holds. For example, an application locks
a single byte of a file and proceeds to write that single byte. A
client that chose to handle a LOCKU by flushing all modified data to
the server could validly write that single byte in response to an
unrelated unlock. However, it would not be valid to write the entire
block in which that single written byte was located since it includes
an area that is not locked and might be locked by another client.
Client implementations can avoid this problem by dividing files with
modified data into those for which all modifications are done to
areas covered by an appropriate record lock and those for which there
are modifications not covered by a record lock. Any writes done for
the former class of files must not include areas not locked and thus
not modified on the client.
9.3.3. Data Caching and Mandatory File Locking
Client side data caching needs to respect mandatory file locking when
it is in effect. The presence of mandatory file locking for a given
file is indicated when the client gets back NFS4ERR_LOCKED from a
READ or WRITE on a file it has an appropriate share reservation for.
When mandatory locking is in effect for a file, the client must check
for an appropriate file lock for data being read or written. If a
lock exists for the range being read or written, the client may
satisfy the request using the client's validated cache. If an
appropriate file lock is not held for the range of the read or write,
the read or write request must not be satisfied by the client's cache
and the request must be sent to the server for processing. When a
read or write request partially overlaps a locked region, the request
should be subdivided into multiple pieces with each region (locked or
not) treated appropriately.
9.3.4. Data Caching and File Identity
When clients cache data, the file data needs to be organized
according to the filesystem object to which the data belongs. For
NFS version 3 clients, the typical practice has been to assume for
the purpose of caching that distinct filehandles represent distinct
filesystem objects. The client then has the choice to organize and
maintain the data cache on this basis.
In the NFS version 4 protocol, there is now the possibility to have
significant deviations from a "one filehandle per object" model
because a filehandle may be constructed on the basis of the object's
pathname. Therefore, clients need a reliable method to determine if
two filehandles designate the same filesystem object. If clients
were simply to assume that all distinct filehandles denote distinct
objects and proceed to do data caching on this basis, caching
inconsistencies would arise between the distinct client side objects
which mapped to the same server side object.
By providing a method to differentiate filehandles, the NFS version 4
protocol alleviates a potential functional regression in comparison
with the NFS version 3 protocol. Without this method, caching
inconsistencies within the same client could occur and this has not
been present in previous versions of the NFS protocol. Note that it
is possible to have such inconsistencies with applications executing
on multiple clients but that is not the issue being addressed here.
For the purposes of data caching, the following steps allow an NFS
version 4 client to determine whether two distinct filehandles denote
the same server side object:
o If GETATTR directed to two filehandles returns different values of
the fsid attribute, then the filehandles represent distinct
o If GETATTR for any file with an fsid that matches the fsid of the
two filehandles in question returns a unique_handles attribute
with a value of TRUE, then the two objects are distinct.
o If GETATTR directed to the two filehandles does not return the
fileid attribute for both of the handles, then it cannot be
determined whether the two objects are the same. Therefore,
operations which depend on that knowledge (e.g., client side data
caching) cannot be done reliably.
o If GETATTR directed to the two filehandles returns different
values for the fileid attribute, then they are distinct objects.
o Otherwise they are the same object.