Tech-invite3GPPspaceIETF RFCsSIP
in Index   Prev   Next

RFC 3530

Network File System (NFS) version 4 Protocol

Pages: 275
Obsoletes:  3010
Obsoleted by:  7530
Part 4 of 8 – Pages 76 to 102
First   Prev   Next

ToP   noToC   RFC3530 - Page 76   prevText

8.2. Lock Ranges

The protocol allows a lock owner to request a lock with a byte range and then either upgrade or unlock a sub-range of the initial lock. It is expected that this will be an uncommon type of request. In any case, servers or server filesystems may not be able to support sub- range lock semantics. In the event that a server receives a locking request that represents a sub-range of current locking state for the lock owner, the server is allowed to return the error NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock operations. Therefore, the client should be prepared to receive this error and, if appropriate, report the error to the requesting application. The client is discouraged from combining multiple independent locking ranges that happen to be adjacent into a single request since the server may not support sub-range requests and for reasons related to the recovery of file locking state in the event of server failure. As discussed in the section "Server Failure and Recovery" below, the server may employ certain optimizations during recovery that work effectively only when the client's behavior during lock recovery is similar to the client's locking behavior prior to server failure.

8.3. Upgrading and Downgrading Locks

If a client has a write lock on a record, it can request an atomic downgrade of the lock to a read lock via the LOCK request, by setting the type to READ_LT. If the server supports atomic downgrade, the request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP. The client should be prepared to receive this error, and if appropriate, report the error to the requesting application.
ToP   noToC   RFC3530 - Page 77
   If a client has a read lock on a record, it can request an atomic
   upgrade of the lock to a write lock via the LOCK request by setting
   the type to WRITE_LT or WRITEW_LT.  If the server does not support
   atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP.  If the upgrade
   can be achieved without an existing conflict, the request will
   succeed.  Otherwise, the server will return either NFS4ERR_DENIED or
   NFS4ERR_DEADLOCK.  The error NFS4ERR_DEADLOCK is returned if the
   client issued the LOCK request with the type set to WRITEW_LT and the
   server has detected a deadlock.  The client should be prepared to
   receive such errors and if appropriate, report the error to the
   requesting application.

8.4. Blocking Locks

Some clients require the support of blocking locks. The NFS version 4 protocol must not rely on a callback mechanism and therefore is unable to notify a client when a previously denied lock has been granted. Clients have no choice but to continually poll for the lock. This presents a fairness problem. Two new lock types are added, READW and WRITEW, and are used to indicate to the server that the client is requesting a blocking lock. The server should maintain an ordered list of pending blocking locks. When the conflicting lock is released, the server may wait the lease period for the first waiting client to re-request the lock. After the lease period expires the next waiting client request is allowed the lock. Clients are required to poll at an interval sufficiently small that it is likely to acquire the lock in a timely manner. The server is not required to maintain a list of pending blocked locks as it is used to increase fairness and not correct operation. Because of the unordered nature of crash recovery, storing of lock state to stable storage would be required to guarantee ordered granting of blocking locks. Servers may also note the lock types and delay returning denial of the request to allow extra time for a conflicting lock to be released, allowing a successful return. In this way, clients can avoid the burden of needlessly frequent polling for blocking locks. The server should take care in the length of delay in the event the client retransmits the request.

8.5. Lease Renewal

The purpose of a lease is to allow a server to remove stale locks that are held by a client that has crashed or is otherwise unreachable. It is not a mechanism for cache consistency and lease renewals may not be denied if the lease interval has not expired.
ToP   noToC   RFC3530 - Page 78
   The following events cause implicit renewal of all of the leases for
   a given client (i.e., all those sharing a given clientid).  Each of
   these is a positive indication that the client is still active and
   that the associated state held at the server, for the client, is
   still valid.

   o  An OPEN with a valid clientid.

   o  Any operation made with a valid stateid (CLOSE, DELEGPURGE,
      READ, RENEW, SETATTR, WRITE).  This does not include the special
      stateids of all bits 0 or all bits 1.

      Note that if the client had restarted or rebooted, the client
      would not be making these requests without issuing the
      SETCLIENTID/SETCLIENTID_CONFIRM sequence.  The use of the
      SETCLIENTID/SETCLIENTID_CONFIRM sequence (one that changes the
      client verifier) notifies the server to drop the locking state
      associated with the client.  SETCLIENTID/SETCLIENTID_CONFIRM never
      renews a lease.

      If the server has rebooted, the stateids (NFS4ERR_STALE_STATEID
      error) or the clientid (NFS4ERR_STALE_CLIENTID error) will not be
      valid hence preventing spurious renewals.

   This approach allows for low overhead lease renewal which scales
   well.  In the typical case no extra RPC calls are required for lease
   renewal and in the worst case one RPC is required every lease period
   (i.e., a RENEW operation).  The number of locks held by the client is
   not a factor since all state for the client is involved with the
   lease renewal action.

   Since all operations that create a new lease also renew existing
   leases, the server must maintain a common lease expiration time for
   all valid leases for a given client.  This lease time can then be
   easily updated upon implicit lease renewal actions.

8.6. Crash Recovery

The important requirement in crash recovery is that both the client and the server know when the other has failed. Additionally, it is required that a client sees a consistent view of data across server restarts or reboots. All READ and WRITE operations that may have been queued within the client or network buffers must wait until the client has successfully recovered the locks protecting the READ and WRITE operations.
ToP   noToC   RFC3530 - Page 79

8.6.1. Client Failure and Recovery

In the event that a client fails, the server may recover the client's locks when the associated leases have expired. Conflicting locks from another client may only be granted after this lease expiration. If the client is able to restart or reinitialize within the lease period the client may be forced to wait the remainder of the lease period before obtaining new locks. To minimize client delay upon restart, lock requests are associated with an instance of the client by a client supplied verifier. This verifier is part of the initial SETCLIENTID call made by the client. The server returns a clientid as a result of the SETCLIENTID operation. The client then confirms the use of the clientid with SETCLIENTID_CONFIRM. The clientid in combination with an opaque owner field is then used by the client to identify the lock owner for OPEN. This chain of associations is then used to identify all locks for a particular client. Since the verifier will be changed by the client upon each initialization, the server can compare a new verifier to the verifier associated with currently held locks and determine that they do not match. This signifies the client's new instantiation and subsequent loss of locking state. As a result, the server is free to release all locks held which are associated with the old clientid which was derived from the old verifier. Note that the verifier must have the same uniqueness properties of the verifier for the COMMIT operation.

8.6.2. Server Failure and Recovery

If the server loses locking state (usually as a result of a restart or reboot), it must allow clients time to discover this fact and re- establish the lost locking state. The client must be able to re- establish the locking state without having the server deny valid requests because the server has granted conflicting access to another client. Likewise, if there is the possibility that clients have not yet re-established their locking state for a file, the server must disallow READ and WRITE operations for that file. The duration of this recovery period is equal to the duration of the lease period. A client can determine that server failure (and thus loss of locking state) has occurred, when it receives one of two errors. The NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a
ToP   noToC   RFC3530 - Page 80
   clientid invalidated by reboot or restart.  When either of these are
   received, the client must establish a new clientid (See the section
   "Client ID") and re-establish the locking state as discussed below.

   The period of special handling of locking and READs and WRITEs, equal
   in duration to the lease period, is referred to as the "grace
   period".  During the grace period, clients recover locks and the
   associated state by reclaim-type locking requests (i.e., LOCK
   requests with reclaim set to true and OPEN operations with a claim
   type of CLAIM_PREVIOUS).  During the grace period, the server must
   reject READ and WRITE operations and non-reclaim locking requests
   (i.e., other LOCK and OPEN operations) with an error of

   If the server can reliably determine that granting a non-reclaim
   request will not conflict with reclamation of locks by other clients,
   the NFS4ERR_GRACE error does not have to be returned and the non-
   reclaim client request can be serviced.  For the server to be able to
   service READ and WRITE operations during the grace period, it must
   again be able to guarantee that no possible conflict could arise
   between an impending reclaim locking request and the READ or WRITE
   operation.  If the server is unable to offer that guarantee, the
   NFS4ERR_GRACE error must be returned to the client.

   For a server to provide simple, valid handling during the grace
   period, the easiest method is to simply reject all non-reclaim
   locking requests and READ and WRITE operations by returning the
   NFS4ERR_GRACE error.  However, a server may keep information about
   granted locks in stable storage.  With this information, the server
   could determine if a regular lock or READ or WRITE operation can be
   safely processed.

   For example, if a count of locks on a given file is available in
   stable storage, the server can track reclaimed locks for the file and
   when all reclaims have been processed, non-reclaim locking requests
   may be processed.  This way the server can ensure that non-reclaim
   locking requests will not conflict with potential reclaim requests.
   With respect to I/O requests, if the server is able to determine that
   there are no outstanding reclaim requests for a file by information
   from stable storage or another similar mechanism, the processing of
   I/O requests could proceed normally for the file.

   To reiterate, for a server that allows non-reclaim lock and I/O
   requests to be processed during the grace period, it MUST determine
   that no lock subsequently reclaimed will be rejected and that no lock
   subsequently reclaimed would have prevented any I/O operation
   processed during the grace period.
ToP   noToC   RFC3530 - Page 81
   Clients should be prepared for the return of NFS4ERR_GRACE errors for
   non-reclaim lock and I/O requests.  In this case the client should
   employ a retry mechanism for the request.  A delay (on the order of
   several seconds) between retries should be used to avoid overwhelming
   the server.  Further discussion of the general issue is included in
   [Floyd].  The client must account for the server that is able to
   perform I/O and non-reclaim locking requests within the grace period
   as well as those that can not do so.

   A reclaim-type locking request outside the server's grace period can
   only succeed if the server can guarantee that no conflicting lock or
   I/O request has been granted since reboot or restart.

   A server may, upon restart, establish a new value for the lease
   period.  Therefore, clients should, once a new clientid is
   established, refetch the lease_time attribute and use it as the basis
   for lease renewal for the lease associated with that server.
   However, the server must establish, for this restart event, a grace
   period at least as long as the lease period for the previous server
   instantiation.  This allows the client state obtained during the
   previous server instance to be reliably re-established.

8.6.3. Network Partitions and Recovery

If the duration of a network partition is greater than the lease period provided by the server, the server will have not received a lease renewal from the client. If this occurs, the server may free all locks held for the client. As a result, all stateids held by the client will become invalid or stale. Once the client is able to reach the server after such a network partition, all I/O submitted by the client with the now invalid stateids will fail with the server returning the error NFS4ERR_EXPIRED. Once this error is received, the client will suitably notify the application that held the lock. As a courtesy to the client or as an optimization, the server may continue to hold locks on behalf of a client for which recent communication has extended beyond the lease period. If the server receives a lock or I/O request that conflicts with one of these courtesy locks, the server must free the courtesy lock and grant the new request. When a network partition is combined with a server reboot, there are edge conditions that place requirements on the server in order to avoid silent data corruption following the server reboot. Two of these edge conditions are known, and are discussed below.
ToP   noToC   RFC3530 - Page 82
   The first edge condition has the following scenario:

      1. Client A acquires a lock.

      2. Client A and server experience mutual network partition, such
         that client A is unable to renew its lease.

      3. Client A's lease expires, so server releases lock.

      4. Client B acquires a lock that would have conflicted with that
         of Client A.

      5. Client B releases the lock

      6. Server reboots

      7. Network partition between client A and server heals.

      8. Client A issues a RENEW operation, and gets back a

      9. Client A reclaims its lock within the server's grace period.

   Thus, at the final step, the server has erroneously granted client
   A's lock reclaim.  If client B modified the object the lock was
   protecting, client A will experience object corruption.

   The second known edge condition follows:

      1. Client A acquires a lock.

      2. Server reboots.

      3. Client A and server experience mutual network partition, such
         that client A is unable to reclaim its lock within the grace

      4. Server's reclaim grace period ends.  Client A has no locks
         recorded on server.

      5. Client B acquires a lock that would have conflicted with that
         of Client A.

      6. Client B releases the lock.

      7. Server reboots a second time.

      8. Network partition between client A and server heals.
ToP   noToC   RFC3530 - Page 83
      9. Client A issues a RENEW operation, and gets back a

     10. Client A reclaims its lock within the server's grace period.

   As with the first edge condition, the final step of the scenario of
   the second edge condition has the server erroneously granting client
   A's lock reclaim.

   Solving the first and second edge conditions requires that the server
   either assume after it reboots that edge condition occurs, and thus
   return NFS4ERR_NO_GRACE for all reclaim attempts, or that the server
   record some information stable storage.  The amount of information
   the server records in stable storage is in inverse proportion to how
   harsh the server wants to be whenever the edge conditions occur.  The
   server that is completely tolerant of all edge conditions will record
   in stable storage every lock that is acquired, removing the lock
   record from stable storage only when the lock is unlocked by the
   client and the lock's lockowner advances the sequence number such
   that the lock release is not the last stateful event for the
   lockowner's sequence.  For the two aforementioned edge conditions,
   the harshest a server can be, and still support a grace period for
   reclaims, requires that the server record in stable storage
   information some minimal information.  For example, a server
   implementation could, for each client, save in stable storage a
   record containing:

   o  the client's id string

   o  a boolean that indicates if the client's lease expired or if there
      was administrative intervention (see the section, Server
      Revocation of Locks) to revoke a record lock, share reservation,
      or delegation

   o  a timestamp that is updated the first time after a server boot or
      reboot the client acquires record locking, share reservation, or
      delegation state on the server.  The timestamp need not be updated
      on subsequent lock requests until the server reboots.

   The server implementation would also record in the stable storage the
   timestamps from the two most recent server reboots.

   Assuming the above record keeping, for the first edge condition,
   after the server reboots, the record that client A's lease expired
   means that another client could have acquired a conflicting record
   lock, share reservation, or delegation.  Hence the server must reject
   a reclaim from client A with the error NFS4ERR_NO_GRACE.
ToP   noToC   RFC3530 - Page 84
   For the second edge condition, after the server reboots for a second
   time, the record that the client had an unexpired record lock, share
   reservation, or delegation established before the server's previous
   incarnation means that the server must reject a reclaim from client A
   with the error NFS4ERR_NO_GRACE.

   Regardless of the level and approach to record keeping, the server
   MUST implement one of the following strategies (which apply to
   reclaims of share reservations, record locks, and delegations):

      1. Reject all reclaims with NFS4ERR_NO_GRACE.  This is superharsh,
         but necessary if the server does not want to record lock state
         in stable storage.

      2. Record sufficient state in stable storage such that all known
         edge conditions involving server reboot, including the two
         noted in this section, are detected.  False positives are
         acceptable.  Note that at this time, it is not known if there
         are other edge conditions.

         In the event, after a server reboot, the server determines that
         there is unrecoverable damage or corruption to the the stable
         storage, then for all clients and/or locks affected, the server
         MUST return NFS4ERR_NO_GRACE.

   A mandate for the client's handling of the NFS4ERR_NO_GRACE error is
   outside the scope of this specification, since the strategies for
   such handling are very dependent on the client's operating
   environment.  However, one potential approach is described below.

   When the client receives NFS4ERR_NO_GRACE, it could examine the
   change attribute of the objects the client is trying to reclaim state
   for, and use that to determine whether to re-establish the state via
   normal OPEN or LOCK requests.  This is acceptable provided the
   client's operating environment allows it.  In otherwords, the client
   implementor is advised to document for his users the behavior.  The
   client could also inform the application that its record lock or
   share reservations (whether they were delegated or not) have been
   lost, such as via a UNIX signal, a GUI pop-up window, etc.  See the
   section, "Data Caching and Revocation" for a discussion of what the
   client should do for dealing with unreclaimed delegations on client

   For further discussion of revocation of locks see the section "Server
   Revocation of Locks".
ToP   noToC   RFC3530 - Page 85

8.7. Recovery from a Lock Request Timeout or Abort

In the event a lock request times out, a client may decide to not retry the request. The client may also abort the request when the process for which it was issued is terminated (e.g., in UNIX due to a signal). It is possible though that the server received the request and acted upon it. This would change the state on the server without the client being aware of the change. It is paramount that the client re-synchronize state with server before it attempts any other operation that takes a seqid and/or a stateid with the same lock_owner. This is straightforward to do without a special re- synchronize operation. Since the server maintains the last lock request and response received on the lock_owner, for each lock_owner, the client should cache the last lock request it sent such that the lock request did not receive a response. From this, the next time the client does a lock operation for the lock_owner, it can send the cached request, if there is one, and if the request was one that established state (e.g., a LOCK or OPEN operation), the server will return the cached result or if never saw the request, perform it. The client can follow up with a request to remove the state (e.g., a LOCKU or CLOSE operation). With this approach, the sequencing and stateid information on the client and server for the given lock_owner will re-synchronize and in turn the lock state will re-synchronize.

8.8. Server Revocation of Locks

At any point, the server can revoke locks held by a client and the client must be prepared for this event. When the client detects that its locks have been or may have been revoked, the client is responsible for validating the state information between itself and the server. Validating locking state for the client means that it must verify or reclaim state for each lock currently held. The first instance of lock revocation is upon server reboot or re- initialization. In this instance the client will receive an error (NFS4ERR_STALE_STATEID or NFS4ERR_STALE_CLIENTID) and the client will proceed with normal crash recovery as described in the previous section. The second lock revocation event is the inability to renew the lease before expiration. While this is considered a rare or unusual event, the client must be prepared to recover. Both the server and client will be able to detect the failure to renew the lease and are capable of recovering without data corruption. For the server, it tracks the last renewal event serviced for the client and knows when the lease will expire. Similarly, the client must track operations which will
ToP   noToC   RFC3530 - Page 86
   renew the lease period.  Using the time that each such request was
   sent and the time that the corresponding reply was received, the
   client should bound the time that the corresponding renewal could
   have occurred on the server and thus determine if it is possible that
   a lease period expiration could have occurred.

   The third lock revocation event can occur as a result of
   administrative intervention within the lease period.  While this is
   considered a rare event, it is possible that the server's
   administrator has decided to release or revoke a particular lock held
   by the client.  As a result of revocation, the client will receive an
   error of NFS4ERR_ADMIN_REVOKED.  In this instance the client may
   assume that only the lock_owner's locks have been lost.  The client
   notifies the lock holder appropriately.  The client may not assume
   the lease period has been renewed as a result of failed operation.

   When the client determines the lease period may have expired, the
   client must mark all locks held for the associated lease as
   "unvalidated".  This means the client has been unable to re-establish
   or confirm the appropriate lock state with the server.  As described
   in the previous section on crash recovery, there are scenarios in
   which the server may grant conflicting locks after the lease period
   has expired for a client.  When it is possible that the lease period
   has expired, the client must validate each lock currently held to
   ensure that a conflicting lock has not been granted.  The client may
   accomplish this task by issuing an I/O request, either a pending I/O
   or a zero-length read, specifying the stateid associated with the
   lock in question.  If the response to the request is success, the
   client has validated all of the locks governed by that stateid and
   re-established the appropriate state between itself and the server.

   If the I/O request is not successful, then one or more of the locks
   associated with the stateid was revoked by the server and the client
   must notify the owner.

8.9. Share Reservations

A share reservation is a mechanism to control access to a file. It is a separate and independent mechanism from record locking. When a client opens a file, it issues an OPEN operation to the server specifying the type of access required (READ, WRITE, or BOTH) and the type of access to deny others (deny NONE, READ, WRITE, or BOTH). If the OPEN fails the client will fail the application's open request. Pseudo-code definition of the semantics: if (request.access == 0) return (NFS4ERR_INVAL)
ToP   noToC   RFC3530 - Page 87
      if ((request.access & file_state.deny)) ||
            (request.deny & file_state.access))
                    return (NFS4ERR_DENIED)

   This checking of share reservations on OPEN is done with no exception
   for an existing OPEN for the same open_owner.

   The constants used for the OPEN and OPEN_DOWNGRADE operations for the
   access and deny fields are as follows:

   const OPEN4_SHARE_ACCESS_READ   = 0x00000001;
   const OPEN4_SHARE_ACCESS_WRITE  = 0x00000002;
   const OPEN4_SHARE_ACCESS_BOTH   = 0x00000003;

   const OPEN4_SHARE_DENY_NONE     = 0x00000000;
   const OPEN4_SHARE_DENY_READ     = 0x00000001;
   const OPEN4_SHARE_DENY_WRITE    = 0x00000002;
   const OPEN4_SHARE_DENY_BOTH     = 0x00000003;

8.10. OPEN/CLOSE Operations

To provide correct share semantics, a client MUST use the OPEN operation to obtain the initial filehandle and indicate the desired access and what if any access to deny. Even if the client intends to use a stateid of all 0's or all 1's, it must still obtain the filehandle for the regular file with the OPEN operation so the appropriate share semantics can be applied. For clients that do not have a deny mode built into their open programming interfaces, deny equal to NONE should be used. The OPEN operation with the CREATE flag, also subsumes the CREATE operation for regular files as used in previous versions of the NFS protocol. This allows a create with a share to be done atomically. The CLOSE operation removes all share reservations held by the lock_owner on that file. If record locks are held, the client SHOULD release all locks before issuing a CLOSE. The server MAY free all outstanding locks on CLOSE but some servers may not support the CLOSE of a file that still has record locks held. The server MUST return failure, NFS4ERR_LOCKS_HELD, if any locks would exist after the CLOSE. The LOOKUP operation will return a filehandle without establishing any lock state on the server. Without a valid stateid, the server will assume the client has the least access. For example, a file
ToP   noToC   RFC3530 - Page 88
   opened with deny READ/WRITE cannot be accessed using a filehandle
   obtained through LOOKUP because it would not have a valid stateid
   (i.e., using a stateid of all bits 0 or all bits 1).

8.10.1. Close and Retention of State Information

Since a CLOSE operation requests deallocation of a stateid, dealing with retransmission of the CLOSE, may pose special difficulties, since the state information, which normally would be used to determine the state of the open file being designated, might be deallocated, resulting in an NFS4ERR_BAD_STATEID error. Servers may deal with this problem in a number of ways. To provide the greatest degree assurance that the protocol is being used properly, a server should, rather than deallocate the stateid, mark it as close-pending, and retain the stateid with this status, until later deallocation. In this way, a retransmitted CLOSE can be recognized since the stateid points to state information with this distinctive status, so that it can be handled without error. When adopting this strategy, a server should retain the state information until the earliest of: o Another validly sequenced request for the same lockowner, that is not a retransmission. o The time that a lockowner is freed by the server due to period with no activity. o All locks for the client are freed as a result of a SETCLIENTID. Servers may avoid this complexity, at the cost of less complete protocol error checking, by simply responding NFS4_OK in the event of a CLOSE for a deallocated stateid, on the assumption that this case must be caused by a retransmitted close. When adopting this approach, it is desirable to at least log an error when returning a no-error indication in this situation. If the server maintains a reply-cache mechanism, it can verify the CLOSE is indeed a retransmission and avoid error logging in most cases.

8.11. Open Upgrade and Downgrade

When an OPEN is done for a file and the lockowner for which the open is being done already has the file open, the result is to upgrade the open file status maintained on the server to include the access and deny bits specified by the new OPEN as well as those for the existing OPEN. The result is that there is one open file, as far as the protocol is concerned, and it includes the union of the access and
ToP   noToC   RFC3530 - Page 89
   deny bits for all of the OPEN requests completed.  Only a single
   CLOSE will be done to reset the effects of both OPENs.  Note that the
   client, when issuing the OPEN, may not know that the same file is in
   fact being opened.  The above only applies if both OPENs result in
   the OPENed object being designated by the same filehandle.

   When the server chooses to export multiple filehandles corresponding
   to the same file object and returns different filehandles on two
   different OPENs of the same file object, the server MUST NOT "OR"
   together the access and deny bits and coalesce the two open files.
   Instead the server must maintain separate OPENs with separate
   stateids and will require separate CLOSEs to free them.

   When multiple open files on the client are merged into a single open
   file object on the server, the close of one of the open files (on the
   client) may necessitate change of the access and deny status of the
   open file on the server.  This is because the union of the access and
   deny bits for the remaining opens may be smaller (i.e., a proper
   subset) than previously.  The OPEN_DOWNGRADE operation is used to
   make the necessary change and the client should use it to update the
   server so that share reservation requests by other clients are
   handled properly.

8.12. Short and Long Leases

When determining the time period for the server lease, the usual lease tradeoffs apply. Short leases are good for fast server recovery at a cost of increased RENEW or READ (with zero length) requests. Longer leases are certainly kinder and gentler to servers trying to handle very large numbers of clients. The number of RENEW requests drop in proportion to the lease time. The disadvantages of long leases are slower recovery after server failure (the server must wait for the leases to expire and the grace period to elapse before granting new lock requests) and increased file contention (if client fails to transmit an unlock request then server must wait for lease expiration before granting new locks). Long leases are usable if the server is able to store lease state in non-volatile memory. Upon recovery, the server can reconstruct the lease state from its non-volatile memory and continue operation with its clients and therefore long leases would not be an issue.

8.13. Clocks, Propagation Delay, and Calculating Lease Expiration

To avoid the need for synchronized clocks, lease times are granted by the server as a time delta. However, there is a requirement that the client and server clocks do not drift excessively over the duration of the lock. There is also the issue of propagation delay across the
ToP   noToC   RFC3530 - Page 90
   network which could easily be several hundred milliseconds as well as
   the possibility that requests will be lost and need to be

   To take propagation delay into account, the client should subtract it
   from lease times (e.g., if the client estimates the one-way
   propagation delay as 200 msec, then it can assume that the lease is
   already 200 msec old when it gets it).  In addition, it will take
   another 200 msec to get a response back to the server.  So the client
   must send a lock renewal or write data back to the server 400 msec
   before the lease would expire.

   The server's lease period configuration should take into account the
   network distance of the clients that will be accessing the server's
   resources.  It is expected that the lease period will take into
   account the network propagation delays and other network delay
   factors for the client population.  Since the protocol does not allow
   for an automatic method to determine an appropriate lease period, the
   server's administrator may have to tune the lease period.

8.14. Migration, Replication and State

When responsibility for handling a given file system is transferred to a new server (migration) or the client chooses to use an alternate server (e.g., in response to server unresponsiveness) in the context of file system replication, the appropriate handling of state shared between the client and server (i.e., locks, leases, stateids, and clientids) is as described below. The handling differs between migration and replication. For related discussion of file server state and recover of such see the sections under "File Locking and Share Reservations". If server replica or a server immigrating a filesystem agrees to, or is expected to, accept opaque values from the client that originated from another server, then it is a wise implementation practice for the servers to encode the "opaque" values in network byte order. This way, servers acting as replicas or immigrating filesystems will be able to parse values like stateids, directory cookies, filehandles, etc. even if their native byte order is different from other servers cooperating in the replication and migration of the filesystem.

8.14.1. Migration and State

In the case of migration, the servers involved in the migration of a filesystem SHOULD transfer all server state from the original to the new server. This must be done in a way that is transparent to the client. This state transfer will ease the client's transition when a
ToP   noToC   RFC3530 - Page 91
   filesystem migration occurs.  If the servers are successful in
   transferring all state, the client will continue to use stateids
   assigned by the original server.  Therefore the new server must
   recognize these stateids as valid.  This holds true for the clientid
   as well.  Since responsibility for an entire filesystem is
   transferred with a migration event, there is no possibility that
   conflicts will arise on the new server as a result of the transfer of

   As part of the transfer of information between servers, leases would
   be transferred as well.  The leases being transferred to the new
   server will typically have a different expiration time from those for
   the same client, previously on the old server.  To maintain the
   property that all leases on a given server for a given client expire
   at the same time, the server should advance the expiration time to
   the later of the leases being transferred or the leases already
   present.  This allows the client to maintain lease renewal of both
   classes without special effort.

   The servers may choose not to transfer the state information upon
   migration.  However, this choice is discouraged.  In this case, when
   the client presents state information from the original server, the
   client must be prepared to receive either NFS4ERR_STALE_CLIENTID or
   NFS4ERR_STALE_STATEID from the new server.  The client should then
   recover its state information as it normally would in response to a
   server failure.  The new server must take care to allow for the
   recovery of state information as it would in the event of server

8.14.2. Replication and State

Since client switch-over in the case of replication is not under server control, the handling of state is different. In this case, leases, stateids and clientids do not have validity across a transition from one server to another. The client must re-establish its locks on the new server. This can be compared to the re- establishment of locks by means of reclaim-type requests after a server reboot. The difference is that the server has no provision to distinguish requests reclaiming locks from those obtaining new locks or to defer the latter. Thus, a client re-establishing a lock on the new server (by means of a LOCK or OPEN request), may have the requests denied due to a conflicting lock. Since replication is intended for read-only use of filesystems, such denial of locks should not pose large difficulties in practice. When an attempt to re-establish a lock on a new server is denied, the client should treat the situation as if his original lock had been revoked.
ToP   noToC   RFC3530 - Page 92

8.14.3. Notification of Migrated Lease

In the case of lease renewal, the client may not be submitting requests for a filesystem that has been migrated to another server. This can occur because of the implicit lease renewal mechanism. The client renews leases for all filesystems when submitting a request to any one filesystem at the server. In order for the client to schedule renewal of leases that may have been relocated to the new server, the client must find out about lease relocation before those leases expire. To accomplish this, all operations which implicitly renew leases for a client (i.e., OPEN, CLOSE, READ, WRITE, RENEW, LOCK, LOCKT, LOCKU), will return the error NFS4ERR_LEASE_MOVED if responsibility for any of the leases to be renewed has been transferred to a new server. This condition will continue until the client receives an NFS4ERR_MOVED error and the server receives the subsequent GETATTR(fs_locations) for an access to each filesystem for which a lease has been moved to a new server. When a client receives an NFS4ERR_LEASE_MOVED error, it should perform an operation on each filesystem associated with the server in question. When the client receives an NFS4ERR_MOVED error, the client can follow the normal process to obtain the new server information (through the fs_locations attribute) and perform renewal of those leases on the new server. If the server has not had state transferred to it transparently, the client will receive either NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from the new server, as described above, and the client can then recover state information as it does in the event of server failure.

8.14.4. Migration and the Lease_time Attribute

In order that the client may appropriately manage its leases in the case of migration, the destination server must establish proper values for the lease_time attribute. When state is transferred transparently, that state should include the correct value of the lease_time attribute. The lease_time attribute on the destination server must never be less than that on the source since this would result in premature expiration of leases granted by the source server. Upon migration in which state is transferred transparently, the client is under no obligation to re- fetch the lease_time attribute and may continue to use the value previously fetched (on the source server). If state has not been transferred transparently (i.e., the client sees a real or simulated server reboot), the client should fetch the value of lease_time on the new (i.e., destination) server, and use it
ToP   noToC   RFC3530 - Page 93
   for subsequent locking requests.  However the server must respect a
   grace period at least as long as the lease_time on the source server,
   in order to ensure that clients have ample time to reclaim their
   locks before potentially conflicting non-reclaimed locks are granted.
   The means by which the new server obtains the value of lease_time on
   the old server is left to the server implementations.  It is not
   specified by the NFS version 4 protocol.

9. Client-Side Caching

Client-side caching of data, of file attributes, and of file names is essential to providing good performance with the NFS protocol. Providing distributed cache coherence is a difficult problem and previous versions of the NFS protocol have not attempted it. Instead, several NFS client implementation techniques have been used to reduce the problems that a lack of coherence poses for users. These techniques have not been clearly defined by earlier protocol specifications and it is often unclear what is valid or invalid client behavior. The NFS version 4 protocol uses many techniques similar to those that have been used in previous protocol versions. The NFS version 4 protocol does not provide distributed cache coherence. However, it defines a more limited set of caching guarantees to allow locks and share reservations to be used without destructive interference from client side caching. In addition, the NFS version 4 protocol introduces a delegation mechanism which allows many decisions normally made by the server to be made locally by clients. This mechanism provides efficient support of the common cases where sharing is infrequent or where sharing is read-only.

9.1. Performance Challenges for Client-Side Caching

Caching techniques used in previous versions of the NFS protocol have been successful in providing good performance. However, several scalability challenges can arise when those techniques are used with very large numbers of clients. This is particularly true when clients are geographically distributed which classically increases the latency for cache revalidation requests. The previous versions of the NFS protocol repeat their file data cache validation requests at the time the file is opened. This behavior can have serious performance drawbacks. A common case is one in which a file is only accessed by a single client. Therefore, sharing is infrequent.
ToP   noToC   RFC3530 - Page 94
   In this case, repeated reference to the server to find that no
   conflicts exist is expensive.  A better option with regards to
   performance is to allow a client that repeatedly opens a file to do
   so without reference to the server.  This is done until potentially
   conflicting operations from another client actually occur.

   A similar situation arises in connection with file locking.  Sending
   file lock and unlock requests to the server as well as the read and
   write requests necessary to make data caching consistent with the
   locking semantics (see the section "Data Caching and File Locking")
   can severely limit performance.  When locking is used to provide
   protection against infrequent conflicts, a large penalty is incurred.
   This penalty may discourage the use of file locking by applications.

   The NFS version 4 protocol provides more aggressive caching
   strategies with the following design goals:

   o  Compatibility with a large range of server semantics.

   o  Provide the same caching benefits as previous versions of the NFS
      protocol when unable to provide the more aggressive model.

   o  Requirements for aggressive caching are organized so that a large
      portion of the benefit can be obtained even when not all of the
      requirements can be met.

   The appropriate requirements for the server are discussed in later
   sections in which specific forms of caching are covered.  (see the
   section "Open Delegation").

9.2. Delegation and Callbacks

Recallable delegation of server responsibilities for a file to a client improves performance by avoiding repeated requests to the server in the absence of inter-client conflict. With the use of a "callback" RPC from server to client, a server recalls delegated responsibilities when another client engages in sharing of a delegated file. A delegation is passed from the server to the client, specifying the object of the delegation and the type of delegation. There are different types of delegations but each type contains a stateid to be used to represent the delegation when performing operations that depend on the delegation. This stateid is similar to those associated with locks and share reservations but differs in that the stateid for a delegation is associated with a clientid and may be
ToP   noToC   RFC3530 - Page 95
   used on behalf of all the open_owners for the given client.  A
   delegation is made to the client as a whole and not to any specific
   process or thread of control within it.

   Because callback RPCs may not work in all environments (due to
   firewalls, for example), correct protocol operation does not depend
   on them.  Preliminary testing of callback functionality by means of a
   CB_NULL procedure determines whether callbacks can be supported.  The
   CB_NULL procedure checks the continuity of the callback path.  A
   server makes a preliminary assessment of callback availability to a
   given client and avoids delegating responsibilities until it has
   determined that callbacks are supported.  Because the granting of a
   delegation is always conditional upon the absence of conflicting
   access, clients must not assume that a delegation will be granted and
   they must always be prepared for OPENs to be processed without any
   delegations being granted.

   Once granted, a delegation behaves in most ways like a lock.  There
   is an associated lease that is subject to renewal together with all
   of the other leases held by that client.

   Unlike locks, an operation by a second client to a delegated file
   will cause the server to recall a delegation through a callback.

   On recall, the client holding the delegation must flush modified
   state (such as modified data) to the server and return the
   delegation.  The conflicting request will not receive a response
   until the recall is complete.  The recall is considered complete when
   the client returns the delegation or the server times out on the
   recall and revokes the delegation as a result of the timeout.
   Following the resolution of the recall, the server has the
   information necessary to grant or deny the second client's request.

   At the time the client receives a delegation recall, it may have
   substantial state that needs to be flushed to the server.  Therefore,
   the server should allow sufficient time for the delegation to be
   returned since it may involve numerous RPCs to the server.  If the
   server is able to determine that the client is diligently flushing
   state to the server as a result of the recall, the server may extend
   the usual time allowed for a recall.  However, the time allowed for
   recall completion should not be unbounded.

   An example of this is when responsibility to mediate opens on a given
   file is delegated to a client (see the section "Open Delegation").
   The server will not know what opens are in effect on the client.
   Without this knowledge the server will be unable to determine if the
   access and deny state for the file allows any particular open until
   the delegation for the file has been returned.
ToP   noToC   RFC3530 - Page 96
   A client failure or a network partition can result in failure to
   respond to a recall callback.  In this case, the server will revoke
   the delegation which in turn will render useless any modified state
   still on the client.

9.2.1. Delegation Recovery

There are three situations that delegation recovery must deal with: o Client reboot or restart o Server reboot or restart o Network partition (full or callback-only) In the event the client reboots or restarts, the failure to renew leases will result in the revocation of record locks and share reservations. Delegations, however, may be treated a bit differently. There will be situations in which delegations will need to be reestablished after a client reboots or restarts. The reason for this is the client may have file data stored locally and this data was associated with the previously held delegations. The client will need to reestablish the appropriate file state on the server. To allow for this type of client recovery, the server MAY extend the period for delegation recovery beyond the typical lease expiration period. This implies that requests from other clients that conflict with these delegations will need to wait. Because the normal recall process may require significant time for the client to flush changed state to the server, other clients need be prepared for delays that occur because of a conflicting delegation. This longer interval would increase the window for clients to reboot and consult stable storage so that the delegations can be reclaimed. For open delegations, such delegations are reclaimed using OPEN with a claim type of CLAIM_DELEGATE_PREV. (See the sections on "Data Caching and Revocation" and "Operation 18: OPEN" for discussion of open delegation and the details of OPEN respectively). A server MAY support a claim type of CLAIM_DELEGATE_PREV, but if it does, it MUST NOT remove delegations upon SETCLIENTID_CONFIRM, and instead MUST, for a period of time no less than that of the value of the lease_time attribute, maintain the client's delegations to allow time for the client to issue CLAIM_DELEGATE_PREV requests. The server that supports CLAIM_DELEGATE_PREV MUST support the DELEGPURGE operation.
ToP   noToC   RFC3530 - Page 97
   When the server reboots or restarts, delegations are reclaimed (using
   the OPEN operation with CLAIM_PREVIOUS) in a similar fashion to
   record locks and share reservations.  However, there is a slight
   semantic difference.  In the normal case if the server decides that a
   delegation should not be granted, it performs the requested action
   (e.g., OPEN) without granting any delegation.  For reclaim, the
   server grants the delegation but a special designation is applied so
   that the client treats the delegation as having been granted but
   recalled by the server.  Because of this, the client has the duty to
   write all modified state to the server and then return the
   delegation.  This process of handling delegation reclaim reconciles
   three principles of the NFS version 4 protocol:

   o  Upon reclaim, a client reporting resources assigned to it by an
      earlier server instance must be granted those resources.

   o  The server has unquestionable authority to determine whether
      delegations are to be granted and, once granted, whether they are
      to be continued.

   o  The use of callbacks is not to be depended upon until the client
      has proven its ability to receive them.

   When a network partition occurs, delegations are subject to freeing
   by the server when the lease renewal period expires.  This is similar
   to the behavior for locks and share reservations.  For delegations,
   however, the server may extend the period in which conflicting
   requests are held off.  Eventually the occurrence of a conflicting
   request from another client will cause revocation of the delegation.
   A loss of the callback path (e.g., by later network configuration
   change) will have the same effect.  A recall request will fail and
   revocation of the delegation will result.

   A client normally finds out about revocation of a delegation when it
   uses a stateid associated with a delegation and receives the error
   NFS4ERR_EXPIRED.  It also may find out about delegation revocation
   after a client reboot when it attempts to reclaim a delegation and
   receives that same error.  Note that in the case of a revoked write
   open delegation, there are issues because data may have been modified
   by the client whose delegation is revoked and separately by other
   clients.  See the section "Revocation Recovery for Write Open
   Delegation" for a discussion of such issues.  Note also that when
   delegations are revoked, information about the revoked delegation
   will be written by the server to stable storage (as described in the
   section "Crash Recovery").  This is done to deal with the case in
   which a server reboots after revoking a delegation but before the
   client holding the revoked delegation is notified about the
ToP   noToC   RFC3530 - Page 98

9.3. Data Caching

When applications share access to a set of files, they need to be implemented so as to take account of the possibility of conflicting access by another application. This is true whether the applications in question execute on different clients or reside on the same client. Share reservations and record locks are the facilities the NFS version 4 protocol provides to allow applications to coordinate access by providing mutual exclusion facilities. The NFS version 4 protocol's data caching must be implemented such that it does not invalidate the assumptions that those using these facilities depend upon.

9.3.1. Data Caching and OPENs

In order to avoid invalidating the sharing assumptions that applications rely on, NFS version 4 clients should not provide cached data to applications or modify it on behalf of an application when it would not be valid to obtain or modify that same data via a READ or WRITE operation. Furthermore, in the absence of open delegation (see the section "Open Delegation") two additional rules apply. Note that these rules are obeyed in practice by many NFS version 2 and version 3 clients. o First, cached data present on a client must be revalidated after doing an OPEN. Revalidating means that the client fetches the change attribute from the server, compares it with the cached change attribute, and if different, declares the cached data (as well as the cached attributes) as invalid. This is to ensure that the data for the OPENed file is still correctly reflected in the client's cache. This validation must be done at least when the client's OPEN operation includes DENY=WRITE or BOTH thus terminating a period in which other clients may have had the opportunity to open the file with WRITE access. Clients may choose to do the revalidation more often (i.e., at OPENs specifying DENY=NONE) to parallel the NFS version 3 protocol's practice for the benefit of users assuming this degree of cache revalidation. Since the change attribute is updated for data and metadata modifications, some client implementors may be tempted to use the time_modify attribute and not change to validate cached data, so that metadata changes do not spuriously invalidate clean data. The implementor is cautioned in this approach. The change attribute is guaranteed to change for each update to the file,
ToP   noToC   RFC3530 - Page 99
      whereas time_modify is guaranteed to change only at the
      granularity of the time_delta attribute.  Use by the client's data
      cache validation logic of time_modify and not change runs the risk
      of the client incorrectly marking stale data as valid.

   o  Second, modified data must be flushed to the server before closing
      a file OPENed for write.  This is complementary to the first rule.
      If the data is not flushed at CLOSE, the revalidation done after
      client OPENs as file is unable to achieve its purpose.  The other
      aspect to flushing the data before close is that the data must be
      committed to stable storage, at the server, before the CLOSE
      operation is requested by the client.  In the case of a server
      reboot or restart and a CLOSEd file, it may not be possible to
      retransmit the data to be written to the file.  Hence, this

9.3.2. Data Caching and File Locking

For those applications that choose to use file locking instead of share reservations to exclude inconsistent file access, there is an analogous set of constraints that apply to client side data caching. These rules are effective only if the file locking is used in a way that matches in an equivalent way the actual READ and WRITE operations executed. This is as opposed to file locking that is based on pure convention. For example, it is possible to manipulate a two-megabyte file by dividing the file into two one-megabyte regions and protecting access to the two regions by file locks on bytes zero and one. A lock for write on byte zero of the file would represent the right to do READ and WRITE operations on the first region. A lock for write on byte one of the file would represent the right to do READ and WRITE operations on the second region. As long as all applications manipulating the file obey this convention, they will work on a local filesystem. However, they may not work with the NFS version 4 protocol unless clients refrain from data caching. The rules for data caching in the file locking environment are: o First, when a client obtains a file lock for a particular region, the data cache corresponding to that region (if any cached data exists) must be revalidated. If the change attribute indicates that the file may have been updated since the cached data was obtained, the client must flush or invalidate the cached data for the newly locked region. A client might choose to invalidate all of non-modified cached data that it has for the file but the only requirement for correct operation is to invalidate all of the data in the newly locked region.
ToP   noToC   RFC3530 - Page 100
   o  Second, before releasing a write lock for a region, all modified
      data for that region must be flushed to the server.  The modified
      data must also be written to stable storage.

   Note that flushing data to the server and the invalidation of cached
   data must reflect the actual byte ranges locked or unlocked.
   Rounding these up or down to reflect client cache block boundaries
   will cause problems if not carefully done.  For example, writing a
   modified block when only half of that block is within an area being
   unlocked may cause invalid modification to the region outside the
   unlocked area.  This, in turn, may be part of a region locked by
   another client.  Clients can avoid this situation by synchronously
   performing portions of write operations that overlap that portion
   (initial or final) that is not a full block.  Similarly, invalidating
   a locked area which is not an integral number of full buffer blocks
   would require the client to read one or two partial blocks from the
   server if the revalidation procedure shows that the data which the
   client possesses may not be valid.

   The data that is written to the server as a prerequisite to the
   unlocking of a region must be written, at the server, to stable
   storage.  The client may accomplish this either with synchronous
   writes or by following asynchronous writes with a COMMIT operation.
   This is required because retransmission of the modified data after a
   server reboot might conflict with a lock held by another client.

   A client implementation may choose to accommodate applications which
   use record locking in non-standard ways (e.g., using a record lock as
   a global semaphore) by flushing to the server more data upon an LOCKU
   than is covered by the locked range.  This may include modified data
   within files other than the one for which the unlocks are being done.
   In such cases, the client must not interfere with applications whose
   READs and WRITEs are being done only within the bounds of record
   locks which the application holds.  For example, an application locks
   a single byte of a file and proceeds to write that single byte.  A
   client that chose to handle a LOCKU by flushing all modified data to
   the server could validly write that single byte in response to an
   unrelated unlock.  However, it would not be valid to write the entire
   block in which that single written byte was located since it includes
   an area that is not locked and might be locked by another client.
   Client implementations can avoid this problem by dividing files with
   modified data into those for which all modifications are done to
   areas covered by an appropriate record lock and those for which there
   are modifications not covered by a record lock.  Any writes done for
   the former class of files must not include areas not locked and thus
   not modified on the client.
ToP   noToC   RFC3530 - Page 101

9.3.3. Data Caching and Mandatory File Locking

Client side data caching needs to respect mandatory file locking when it is in effect. The presence of mandatory file locking for a given file is indicated when the client gets back NFS4ERR_LOCKED from a READ or WRITE on a file it has an appropriate share reservation for. When mandatory locking is in effect for a file, the client must check for an appropriate file lock for data being read or written. If a lock exists for the range being read or written, the client may satisfy the request using the client's validated cache. If an appropriate file lock is not held for the range of the read or write, the read or write request must not be satisfied by the client's cache and the request must be sent to the server for processing. When a read or write request partially overlaps a locked region, the request should be subdivided into multiple pieces with each region (locked or not) treated appropriately.

9.3.4. Data Caching and File Identity

When clients cache data, the file data needs to be organized according to the filesystem object to which the data belongs. For NFS version 3 clients, the typical practice has been to assume for the purpose of caching that distinct filehandles represent distinct filesystem objects. The client then has the choice to organize and maintain the data cache on this basis. In the NFS version 4 protocol, there is now the possibility to have significant deviations from a "one filehandle per object" model because a filehandle may be constructed on the basis of the object's pathname. Therefore, clients need a reliable method to determine if two filehandles designate the same filesystem object. If clients were simply to assume that all distinct filehandles denote distinct objects and proceed to do data caching on this basis, caching inconsistencies would arise between the distinct client side objects which mapped to the same server side object. By providing a method to differentiate filehandles, the NFS version 4 protocol alleviates a potential functional regression in comparison with the NFS version 3 protocol. Without this method, caching inconsistencies within the same client could occur and this has not been present in previous versions of the NFS protocol. Note that it is possible to have such inconsistencies with applications executing on multiple clients but that is not the issue being addressed here. For the purposes of data caching, the following steps allow an NFS version 4 client to determine whether two distinct filehandles denote the same server side object:
ToP   noToC   RFC3530 - Page 102
   o  If GETATTR directed to two filehandles returns different values of
      the fsid attribute, then the filehandles represent distinct

   o  If GETATTR for any file with an fsid that matches the fsid of the
      two filehandles in question returns a unique_handles attribute
      with a value of TRUE, then the two objects are distinct.

   o  If GETATTR directed to the two filehandles does not return the
      fileid attribute for both of the handles, then it cannot be
      determined whether the two objects are the same.  Therefore,
      operations which depend on that knowledge (e.g., client side data
      caching) cannot be done reliably.

   o  If GETATTR directed to the two filehandles returns different
      values for the fileid attribute, then they are distinct objects.

   o  Otherwise they are the same object.

(page 102 continued on part 5)

Next Section