RFC 3530

Network File System (NFS) version 4 Protocol

Pages: 275
Obsoletes: 3010
Obsoleted by: 7530

Part 3 of 8 – Pages 47 to 76

noToC RFC3530 - Page 47 prevText

5.8.  Interpreting owner and owner_group

   The recommended attributes "owner" and "owner_group" (and also users
   and groups within the "acl" attribute) are represented in terms of a
   UTF-8 string.  To avoid a representation that is tied to a particular
   underlying implementation at the client or server, the use of the
   UTF-8 string has been chosen.  Note that section 6.1 of [RFC2624]
   provides additional rationale.  It is expected that the client and
   server will have their own local representation of owner and
   owner_group that is used for local storage or presentation to the end
   user.  Therefore, it is expected that when these attributes are
   transferred between the client and server that the local
   representation is translated to a syntax of the form
   "user@dns_domain".  This will allow for a client and server that do
   not use the same local representation the ability to translate to a
   common syntax that can be interpreted by both.

   Similarly, security principals may be represented in different ways
   by different security mechanisms.  Servers normally translate these
   representations into a common format, generally that used by local
   storage, to serve as a means of identifying the users corresponding
   to these security principals.  When these local identifiers are
   translated to the form of the owner attribute, associated with files
   created by such principals they identify, in a common format, the
   users associated with each corresponding set of security principals.

   The translation used to interpret owner and group strings is not
   specified as part of the protocol.  This allows various solutions to
   be employed.  For example, a local translation table may be consulted
   that maps between a numeric id to the user@dns_domain syntax.  A name
   service may also be used to accomplish the translation.  A server may
   provide a more general service, not limited by any particular
   translation (which would only translate a limited set of possible
   strings) by storing the owner and owner_group attributes in local
   storage without any translation or it may augment a translation
   method by storing the entire string for attributes for which no
   translation is available while using the local representation for
   those cases in which a translation is available.

   Servers that do not provide support for all possible values of the
   owner and owner_group attributes, should return an error
   (NFS4ERR_BADOWNER) when a string is presented that has no
   translation, as the value to be set for a SETATTR of the owner,
   owner_group, or acl attributes.  When a server does accept an owner
   or owner_group value as valid on a SETATTR (and similarly for the
   owner and group strings in an acl), it is promising to return that
   same string when a corresponding GETATTR is done.  Configuration
   changes and ill-constructed name translations (those that contain

noToC RFC3530 - Page 48

   aliasing) may make that promise impossible to honor.  Servers should
   make appropriate efforts to avoid a situation in which these
   attributes have their values changed when no real change to ownership
   has occurred.

   The "dns_domain" portion of the owner string is meant to be a DNS
   domain name.  For example, user@ietf.org.  Servers should accept as
   valid a set of users for at least one domain.  A server may treat
   other domains as having no valid translations.  A more general
   service is provided when a server is capable of accepting users for
   multiple domains, or for all domains, subject to security
   constraints.

   In the case where there is no translation available to the client or
   server, the attribute value must be constructed without the "@".
   Therefore, the absence of the @ from the owner or owner_group
   attribute signifies that no translation was available at the sender
   and that the receiver of the attribute should not use that string as
   a basis for translation into its own internal format.  Even though
   the attribute value can not be translated, it may still be useful.
   In the case of a client, the attribute string may be used for local
   display of ownership.

   To provide a greater degree of compatibility with previous versions
   of NFS (i.e., v2 and v3), which identified users and groups by 32-bit
   unsigned uid's and gid's, owner and group strings that consist of
   decimal numeric values with no leading zeros can be given a special
   interpretation by clients and servers which choose to provide such
   support.  The receiver may treat such a user or group string as
   representing the same user as would be represented by a v2/v3 uid or
   gid having the corresponding numeric value.  A server is not
   obligated to accept such a string, but may return an NFS4ERR_BADOWNER
   instead.  To avoid this mechanism being used to subvert user and
   group translation, so that a client might pass all of the owners and
   groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER
   error when there is a valid translation for the user or owner
   designated in this way.  In that case, the client must use the
   appropriate name@domain string and not the special form for
   compatibility.

   The owner string "nobody" may be used to designate an anonymous user,
   which will be associated with a file created by a security principal
   that cannot be mapped through normal means to the owner attribute.

noToC RFC3530 - Page 49

5.9.  Character Case Attributes

   With respect to the case_insensitive and case_preserving attributes,
   each UCS-4 character (which UTF-8 encodes) has a "long descriptive
   name" [RFC1345] which may or may not included the word "CAPITAL" or
   "SMALL".  The presence of SMALL or CAPITAL allows an NFS server to
   implement unambiguous and efficient table driven mappings for case
   insensitive comparisons, and non-case-preserving storage.  For
   general character handling and internationalization issues, see the
   section "Internationalization".

5.10.  Quota Attributes

   For the attributes related to filesystem quotas, the following
   definitions apply:

   quota_avail_soft
         The value in bytes which represents the amount of additional
         disk space that can be allocated to this file or directory
         before the user may reasonably be warned.  It is understood
         that this space may be consumed by allocations to other files
         or directories though there is a rule as to which other files
         or directories.

   quota_avail_hard
         The value in bytes which represent the amount of additional
         disk space beyond the current allocation that can be allocated
         to this file or directory before further allocations will be
         refused.  It is understood that this space may be consumed by
         allocations to other files or directories.

   quota_used
         The value in bytes which represent the amount of disc space
         used by this file or directory and possibly a number of other
         similar files or directories, where the set of "similar" meets
         at least the criterion that allocating space to any file or
         directory in the set will reduce the "quota_avail_hard" of
         every other file or directory in the set.

         Note that there may be a number of distinct but overlapping
         sets of files or directories for which a quota_used value is
         maintained (e.g., "all files with a given owner", "all files
         with a given group owner", etc.).

         The server is at liberty to choose any of those sets but should
         do so in a repeatable way.  The rule may be configured per-
         filesystem or may be "choose the set with the smallest quota".

noToC RFC3530 - Page 50

5.11.  Access Control Lists

   The NFS version 4 ACL attribute is an array of access control entries
   (ACE).  Although, the client can read and write the ACL attribute,
   the NFSv4 model is the server does all access control based on the
   server's interpretation of the ACL.  If at any point the client wants
   to check access without issuing an operation that modifies or reads
   data or metadata, the client can use the OPEN and ACCESS operations
   to do so.  There are various access control entry types, as defined
   in the Section "ACE type".  The server is able to communicate which
   ACE types are supported by returning the appropriate value within the
   aclsupport attribute.  Each ACE covers one or more operations on a
   file or directory as described in the Section "ACE Access Mask".  It
   may also contain one or more flags that modify the semantics of the
   ACE as defined in the Section "ACE flag".

   The NFS ACE attribute is defined as follows:

         typedef uint32_t        acetype4;
         typedef uint32_t        aceflag4;
         typedef uint32_t        acemask4;

         struct nfsace4 {
                 acetype4        type;
                 aceflag4        flag;
                 acemask4        access_mask;
                 utf8str_mixed   who;
         };

   To determine if a request succeeds, each nfsace4 entry is processed
   in order by the server.  Only ACEs which have a "who" that matches
   the requester are considered.  Each ACE is processed until all of the
   bits of the requester's access have been ALLOWED.  Once a bit (see
   below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer
   considered in the processing of later ACEs.  If an ACCESS_DENIED_ACE
   is encountered where the requester's access still has unALLOWED bits
   in common with the "access_mask" of the ACE, the request is denied.
   However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT
   ACE types do not affect a requester's access, and instead are for
   triggering events as a result of a requester's access attempt.

   Therefore, all AUDIT and ALARM ACEs are processed until end of the
   ACL.  When the ACL is fully processed, if there are bits in
   requester's mask that have not been considered whether the server
   allows or denies the access is undefined.  If there is a mode
   attribute on the file, then this cannot happen, since the mode's
   MODE4_*OTH bits will map to EVERYONE@ ACEs that unambiguously specify
   the requester's access.

noToC RFC3530 - Page 51

   The NFS version 4 ACL model is quite rich.  Some server platforms may
   provide access control functionality that goes beyond the UNIX-style
   mode attribute, but which is not as rich as the NFS ACL model.  So
   that users can take advantage of this more limited functionality, the
   server may indicate that it supports ACLs as long as it follows the
   guidelines for mapping between its ACL model and the NFS version 4
   ACL model.

   The situation is complicated by the fact that a server may have
   multiple modules that enforce ACLs.  For example, the enforcement for
   NFS version 4 access may be different from the enforcement for local
   access, and both may be different from the enforcement for access
   through other protocols such as SMB.  So it may be useful for a
   server to accept an ACL even if not all of its modules are able to
   support it.

   The guiding principle in all cases is that the server must not accept
   ACLs that appear to make the file more secure than it really is.

5.11.1.  ACE type

   Type         Description
   _____________________________________________________
   ALLOW        Explicitly grants the access defined in
                acemask4 to the file or directory.

   DENY         Explicitly denies the access defined in
                acemask4 to the file or directory.

   AUDIT        LOG (system dependent) any access
                attempt to a file or directory which
                uses any of the access methods specified
                in acemask4.

   ALARM        Generate a system ALARM (system
                dependent) when any access attempt is
                made to a file or directory for the
                access methods specified in acemask4.

   A server need not support all of the above ACE types.  The bitmask
   constants used to represent the above definitions within the

   aclsupport attribute are as follows:

      const ACL4_SUPPORT_ALLOW_ACL    = 0x00000001;
      const ACL4_SUPPORT_DENY_ACL     = 0x00000002;
      const ACL4_SUPPORT_AUDIT_ACL    = 0x00000004;
      const ACL4_SUPPORT_ALARM_ACL    = 0x00000008;

noToC RFC3530 - Page 52

   The semantics of the "type" field follow the descriptions provided
   above.

   The constants used for the type field (acetype4) are as follows:

      const ACE4_ACCESS_ALLOWED_ACE_TYPE      = 0x00000000;
      const ACE4_ACCESS_DENIED_ACE_TYPE       = 0x00000001;
      const ACE4_SYSTEM_AUDIT_ACE_TYPE        = 0x00000002;
      const ACE4_SYSTEM_ALARM_ACE_TYPE        = 0x00000003;

   Clients should not attempt to set an ACE unless the server claims
   support for that ACE type.  If the server receives a request to set
   an ACE that it cannot store, it MUST reject the request with
   NFS4ERR_ATTRNOTSUPP.  If the server receives a request to set an ACE
   that it can store but cannot enforce, the server SHOULD reject the
   request with NFS4ERR_ATTRNOTSUPP.

   Example: suppose a server can enforce NFS ACLs for NFS access but
   cannot enforce ACLs for local access.  If arbitrary processes can run
   on the server, then the server SHOULD NOT indicate ACL support.  On
   the other hand, if only trusted administrative programs run locally,
   then the server may indicate ACL support.

5.11.2.  ACE Access Mask

   The access_mask field contains values based on the following:

   Access                 Description
   _______________________________________________________________
   READ_DATA              Permission to read the data of the file
   LIST_DIRECTORY         Permission to list the contents of a
                          directory
   WRITE_DATA             Permission to modify the file's data
   ADD_FILE               Permission to add a new file to a
                          directory
   APPEND_DATA            Permission to append data to a file
   ADD_SUBDIRECTORY       Permission to create a subdirectory to a
                          directory
   READ_NAMED_ATTRS       Permission to read the named attributes
                          of a file
   WRITE_NAMED_ATTRS      Permission to write the named attributes
                          of a file
   EXECUTE                Permission to execute a file
   DELETE_CHILD           Permission to delete a file or directory
                          within a directory
   READ_ATTRIBUTES        The ability to read basic attributes
                          (non-acls) of a file
   WRITE_ATTRIBUTES       Permission to change basic attributes

noToC RFC3530 - Page 53

                          (non-acls) of a file
   DELETE                 Permission to Delete the file
   READ_ACL               Permission to Read the ACL
   WRITE_ACL              Permission to Write the ACL
   WRITE_OWNER            Permission to change the owner
   SYNCHRONIZE            Permission to access file locally at the
                          server with synchronous reads and writes

   The bitmask constants used for the access mask field are as follows:

   const ACE4_READ_DATA            = 0x00000001;
   const ACE4_LIST_DIRECTORY       = 0x00000001;
   const ACE4_WRITE_DATA           = 0x00000002;
   const ACE4_ADD_FILE             = 0x00000002;
   const ACE4_APPEND_DATA          = 0x00000004;
   const ACE4_ADD_SUBDIRECTORY     = 0x00000004;
   const ACE4_READ_NAMED_ATTRS     = 0x00000008;
   const ACE4_WRITE_NAMED_ATTRS    = 0x00000010;
   const ACE4_EXECUTE              = 0x00000020;
   const ACE4_DELETE_CHILD         = 0x00000040;
   const ACE4_READ_ATTRIBUTES      = 0x00000080;
   const ACE4_WRITE_ATTRIBUTES     = 0x00000100;
   const ACE4_DELETE               = 0x00010000;
   const ACE4_READ_ACL             = 0x00020000;
   const ACE4_WRITE_ACL            = 0x00040000;
   const ACE4_WRITE_OWNER          = 0x00080000;
   const ACE4_SYNCHRONIZE          = 0x00100000;

   Server implementations need not provide the granularity of control
   that is implied by this list of masks.  For example, POSIX-based
   systems might not distinguish APPEND_DATA (the ability to append to a
   file) from WRITE_DATA (the ability to modify existing contents); both
   masks would be tied to a single "write" permission.  When such a
   server returns attributes to the client, it would show both
   APPEND_DATA and WRITE_DATA if and only if the write permission is
   enabled.

   If a server receives a SETATTR request that it cannot accurately
   implement, it should error in the direction of more restricted
   access.  For example, suppose a server cannot distinguish overwriting
   data from appending new data, as described in the previous paragraph.
   If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is
   not (or vice versa), the server should reject the request with
   NFS4ERR_ATTRNOTSUPP.  Nonetheless, if the ACE has type DENY, the
   server may silently turn on the other bit, so that both APPEND_DATA
   and WRITE_DATA are denied.

noToC RFC3530 - Page 54

5.11.3.  ACE flag

   The "flag" field contains values based on the following descriptions.

   ACE4_FILE_INHERIT_ACE
      Can be placed on a directory and indicates that this ACE should be
      added to each new non-directory file created.

   ACE4_DIRECTORY_INHERIT_ACE
      Can be placed on a directory and indicates that this ACE should be
      added to each new directory created.

   ACE4_INHERIT_ONLY_ACE
      Can be placed on a directory but does not apply to the directory,
      only to newly created files/directories as specified by the above
      two flags.

   ACE4_NO_PROPAGATE_INHERIT_ACE
      Can be placed on a directory.  Normally when a new directory is
      created and an ACE exists on the parent directory which is marked
      ACL4_DIRECTORY_INHERIT_ACE, two ACEs are placed on the new
      directory.  One for the directory itself and one which is an
      inheritable ACE for newly created directories.  This flag tells
      the server to not place an ACE on the newly created directory
      which is inheritable by subdirectories of the created directory.

   ACE4_SUCCESSFUL_ACCESS_ACE_FLAG

   ACL4_FAILED_ACCESS_ACE_FLAG
      The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and
      ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits relate only to
      ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE
      (ALARM) ACE types.  If during the processing of the file's ACL,
      the server encounters an AUDIT or ALARM ACE that matches the
      principal attempting the OPEN, the server notes that fact, and the
      presence, if any, of the SUCCESS and FAILED flags encountered in
      the AUDIT or ALARM ACE.  Once the server completes the ACL
      processing, and the share reservation processing, and the OPEN
      call, it then notes if the OPEN succeeded or failed.  If the OPEN
      succeeded, and if the SUCCESS flag was set for a matching AUDIT or
      ALARM, then the appropriate AUDIT or ALARM event occurs.  If the
      OPEN failed, and if the FAILED flag was set for the matching AUDIT
      or ALARM, then the appropriate AUDIT or ALARM event occurs.
      Clearly either or both of the SUCCESS or FAILED can be set, but if
      neither is set, the AUDIT or ALARM ACE is not useful.

noToC RFC3530 - Page 55

      The previously described processing applies to that of the ACCESS
      operation as well.  The difference being that "success" or
      "failure" does not mean whether ACCESS returns NFS4_OK or not.
      Success means whether ACCESS returns all requested and supported
      bits.  Failure means whether ACCESS failed to return a bit that
      was requested and supported.

   ACE4_IDENTIFIER_GROUP
      Indicates that the "who" refers to a GROUP as defined under UNIX.

   The bitmask constants used for the flag field are as follows:

   const ACE4_FILE_INHERIT_ACE             = 0x00000001;
   const ACE4_DIRECTORY_INHERIT_ACE        = 0x00000002;
   const ACE4_NO_PROPAGATE_INHERIT_ACE     = 0x00000004;
   const ACE4_INHERIT_ONLY_ACE             = 0x00000008;
   const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG   = 0x00000010;
   const ACE4_FAILED_ACCESS_ACE_FLAG       = 0x00000020;
   const ACE4_IDENTIFIER_GROUP             = 0x00000040;

   A server need not support any of these flags.  If the server supports
   flags that are similar to, but not exactly the same as, these flags,
   the implementation may define a mapping between the protocol-defined
   flags and the implementation-defined flags.  Again, the guiding
   principle is that the file not appear to be more secure than it
   really is.

   For example, suppose a client tries to set an ACE with
   ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE.  If the
   server does not support any form of ACL inheritance, the server
   should reject the request with NFS4ERR_ATTRNOTSUPP.  If the server
   supports a single "inherit ACE" flag that applies to both files and
   directories, the server may reject the request (i.e., requiring the
   client to set both the file and directory inheritance flags).  The
   server may also accept the request and silently turn on the
   ACE4_DIRECTORY_INHERIT_ACE flag.

5.11.4.  ACE who

   There are several special identifiers ("who") which need to be
   understood universally, rather than in the context of a particular
   DNS domain.  Some of these identifiers cannot be understood when an
   NFS client accesses the server, but have meaning when a local process
   accesses the file.  The ability to display and modify these
   permissions is permitted over NFS, even if none of the access methods
   on the server understands the identifiers.

noToC RFC3530 - Page 56

   Who                    Description
   _______________________________________________________________
   "OWNER"                The owner of the file.
   "GROUP"                The group associated with the file.
   "EVERYONE"             The world.
   "INTERACTIVE"          Accessed from an interactive terminal.
   "NETWORK"              Accessed via the network.
   "DIALUP"               Accessed as a dialup user to the server.
   "BATCH"                Accessed from a batch job.
   "ANONYMOUS"            Accessed without any authentication.
   "AUTHENTICATED"        Any authenticated user (opposite of
                          ANONYMOUS)
   "SERVICE"              Access from a system service.

   To avoid conflict, these special identifiers are distinguish by an
   appended "@" and should appear in the form "xxxx@" (note: no domain
   name after the "@").  For example: ANONYMOUS@.

5.11.5.  Mode Attribute

   The NFS version 4 mode attribute is based on the UNIX mode bits.  The
   following bits are defined:

      const MODE4_SUID = 0x800;  /* set user id on execution */
      const MODE4_SGID = 0x400;  /* set group id on execution */
      const MODE4_SVTX = 0x200;  /* save text even after use */
      const MODE4_RUSR = 0x100;  /* read permission: owner */
      const MODE4_WUSR = 0x080;  /* write permission: owner */
      const MODE4_XUSR = 0x040;  /* execute permission: owner */
      const MODE4_RGRP = 0x020;  /* read permission: group */
      const MODE4_WGRP = 0x010;  /* write permission: group */
      const MODE4_XGRP = 0x008;  /* execute permission: group */
      const MODE4_ROTH = 0x004;  /* read permission: other */
      const MODE4_WOTH = 0x002;  /* write permission: other */
      const MODE4_XOTH = 0x001;  /* execute permission: other */

   Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal
   identified in the owner attribute.  Bits MODE4_RGRP, MODE4_WGRP, and

   MODE4_XGRP apply to the principals identified in the owner_group
   attribute.  Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any
   principal that does not match that in the owner group, and does not
   have a group matching that of the owner_group attribute.

   The remaining bits are not defined by this protocol and MUST NOT be
   used.  The minor version mechanism must be used to define further bit
   usage.

noToC RFC3530 - Page 57

   Note that in UNIX, if a file has the MODE4_SGID bit set and no
   MODE4_XGRP bit set, then READ and WRITE must use mandatory file
   locking.

5.11.6.  Mode and ACL Attribute

   The server that supports both mode and ACL must take care to
   synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the
   ACEs which have respective who fields of "OWNER@", "GROUP@", and
   "EVERYONE@" so that the client can see semantically equivalent access
   permissions exist whether the client asks for owner, owner_group and
   mode attributes, or for just the ACL.

   Because the mode attribute includes bits (e.g., MODE4_SVTX) that have
   nothing to do with ACL semantics, it is permitted for clients to
   specify both the ACL attribute and mode in the same SETATTR
   operation.  However, because there is no prescribed order for
   processing the attributes in a SETATTR, the client must ensure that
   ACL attribute, if specified without mode, would produce the desired
   mode bits, and conversely, the mode attribute if specified without
   ACL, would produce the desired "OWNER@", "GROUP@", and "EVERYONE@"
   ACEs.

5.11.7.  mounted_on_fileid

   UNIX-based operating environments connect a filesystem into the
   namespace by connecting (mounting) the filesystem onto the existing
   file object (the mount point, usually a directory) of an existing
   filesystem.  When the mount point's parent directory is read via an
   API like readdir(), the return results are directory entries, each
   with a component name and a fileid.  The fileid of the mount point's
   directory entry will be different from the fileid that the stat()
   system call returns.  The stat() system call is returning the fileid
   of the root of the mounted filesystem, whereas readdir() is returning
   the fileid stat() would have returned before any filesystems were
   mounted on the mount point.

   Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request
   to cross other filesystems.  The client detects the filesystem
   crossing whenever the filehandle argument of LOOKUP has an fsid
   attribute different from that of the filehandle returned by LOOKUP.
   A UNIX-based client will consider this a "mount point crossing".
   UNIX has a legacy scheme for allowing a process to determine its
   current working directory.  This relies on readdir() of a mount
   point's parent and stat() of the mount point returning fileids as
   previously described.  The mounted_on_fileid attribute corresponds to
   the fileid that readdir() would have returned as described
   previously.

noToC RFC3530 - Page 58

   While the NFS version 4 client could simply fabricate a fileid
   corresponding to what mounted_on_fileid provides (and if the server
   does not support mounted_on_fileid, the client has no choice), there
   is a risk that the client will generate a fileid that conflicts with
   one that is already assigned to another object in the filesystem.
   Instead, if the server can provide the mounted_on_fileid, the
   potential for client operational problems in this area is eliminated.

   If the server detects that there is no mounted point at the target
   file object, then the value for mounted_on_fileid that it returns is
   the same as that of the fileid attribute.

   The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD
   provide it if possible, and for a UNIX-based server, this is
   straightforward.  Usually, mounted_on_fileid will be requested during
   a READDIR operation, in which case it is trivial (at least for UNIX-
   based servers) to return mounted_on_fileid since it is equal to the
   fileid of a directory entry returned by readdir().  If
   mounted_on_fileid is requested in a GETATTR operation, the server
   should obey an invariant that has it returning a value that is equal
   to the file object's entry in the object's parent directory, i.e.,
   what readdir() would have returned.  Some operating environments
   allow a series of two or more filesystems to be mounted onto a single
   mount point.  In this case, for the server to obey the aforementioned
   invariant, it will need to find the base mount point, and not the
   intermediate mount points.

6.  Filesystem Migration and Replication

   With the use of the recommended attribute "fs_locations", the NFS
   version 4 server has a method of providing filesystem migration or
   replication services.  For the purposes of migration and replication,
   a filesystem will be defined as all files that share a given fsid
   (both major and minor values are the same).

   The fs_locations attribute provides a list of filesystem locations.
   These locations are specified by providing the server name (either
   DNS domain or IP address) and the path name representing the root of
   the filesystem.  Depending on the type of service being provided, the
   list will provide a new location or a set of alternate locations for
   the filesystem.  The client will use this information to redirect its
   requests to the new server.

6.1.  Replication

   It is expected that filesystem replication will be used in the case
   of read-only data.  Typically, the filesystem will be replicated on
   two or more servers.  The fs_locations attribute will provide the

noToC RFC3530 - Page 59

   list of these locations to the client.  On first access of the
   filesystem, the client should obtain the value of the fs_locations
   attribute.  If, in the future, the client finds the server
   unresponsive, the client may attempt to use another server specified
   by fs_locations.

   If applicable, the client must take the appropriate steps to recover
   valid filehandles from the new server.  This is described in more
   detail in the following sections.

6.2.  Migration

   Filesystem migration is used to move a filesystem from one server to
   another.  Migration is typically used for a filesystem that is
   writable and has a single copy.  The expected use of migration is for
   load balancing or general resource reallocation.  The protocol does
   not specify how the filesystem will be moved between servers.  This
   server-to-server transfer mechanism is left to the server
   implementor.  However, the method used to communicate the migration
   event between client and server is specified here.

   Once the servers participating in the migration have completed the
   move of the filesystem, the error NFS4ERR_MOVED will be returned for
   subsequent requests received by the original server.  The
   NFS4ERR_MOVED error is returned for all operations except PUTFH and
   GETATTR.  Upon receiving the NFS4ERR_MOVED error, the client will
   obtain the value of the fs_locations attribute.  The client will then
   use the contents of the attribute to redirect its requests to the
   specified server.  To facilitate the use of GETATTR, operations such
   as PUTFH must also be accepted by the server for the migrated file
   system's filehandles.  Note that if the server returns NFS4ERR_MOVED,
   the server MUST support the fs_locations attribute.

   If the client requests more attributes than just fs_locations, the
   server may return fs_locations only.  This is to be expected since
   the server has migrated the filesystem and may not have a method of
   obtaining additional attribute data.

   The server implementor needs to be careful in developing a migration
   solution.  The server must consider all of the state information
   clients may have outstanding at the server.  This includes but is not
   limited to locking/share state, delegation state, and asynchronous
   file writes which are represented by WRITE and COMMIT verifiers.  The
   server should strive to minimize the impact on its clients during and
   after the migration process.

noToC RFC3530 - Page 60

6.3.  Interpretation of the fs_locations Attribute

   The fs_location attribute is structured in the following way:

   struct fs_location {
           utf8str_cis     server<>;
           pathname4       rootpath;
   };

   struct fs_locations {
           pathname4       fs_root;
           fs_location     locations<>;
   };

   The fs_location struct is used to represent the location of a
   filesystem by providing a server name and the path to the root of the
   filesystem.  For a multi-homed server or a set of servers that use
   the same rootpath, an array of server names may be provided.  An
   entry in the server array is an UTF8 string and represents one of a
   traditional DNS host name, IPv4 address, or IPv6 address.  It is not
   a requirement that all servers that share the same rootpath be listed
   in one fs_location struct.  The array of server names is provided for
   convenience.  Servers that share the same rootpath may also be listed
   in separate fs_location entries in the fs_locations attribute.

   The fs_locations struct and attribute then contains an array of
   locations.  Since the name space of each server may be constructed
   differently, the "fs_root" field is provided.  The path represented
   by fs_root represents the location of the filesystem in the server's
   name space.  Therefore, the fs_root path is only associated with the
   server from which the fs_locations attribute was obtained.  The
   fs_root path is meant to aid the client in locating the filesystem at
   the various servers listed.

   As an example, there is a replicated filesystem located at two
   servers (servA and servB).  At servA the filesystem is located at
   path "/a/b/c".  At servB the filesystem is located at path "/x/y/z".
   In this example the client accesses the filesystem first at servA
   with a multi-component lookup path of "/a/b/c/d".  Since the client
   used a multi-component lookup to obtain the filehandle at "/a/b/c/d",
   it is unaware that the filesystem's root is located in servA's name
   space at "/a/b/c".  When the client switches to servB, it will need
   to determine that the directory it first referenced at servA is now
   represented by the path "/x/y/z/d" on servB.  To facilitate this, the
   fs_locations attribute provided by servA would have a fs_root value
   of "/a/b/c" and two entries in fs_location.  One entry in fs_location
   will be for itself (servA) and the other will be for servB with a

noToC RFC3530 - Page 61

   path of "/x/y/z".  With this information, the client is able to
   substitute "/x/y/z" for the "/a/b/c" at the beginning of its access
   path and construct "/x/y/z/d" to use for the new server.

   See the section "Security Considerations" for a discussion on the
   recommendations for the security flavor to be used by any GETATTR
   operation that requests the "fs_locations" attribute.

6.4.  Filehandle Recovery for Migration or Replication

   Filehandles for filesystems that are replicated or migrated generally
   have the same semantics as for filesystems that are not replicated or
   migrated.  For example, if a filesystem has persistent filehandles
   and it is migrated to another server, the filehandle values for the
   filesystem will be valid at the new server.

   For volatile filehandles, the servers involved likely do not have a
   mechanism to transfer filehandle format and content between
   themselves.  Therefore, a server may have difficulty in determining
   if a volatile filehandle from an old server should return an error of
   NFS4ERR_FHEXPIRED.  Therefore, the client is informed, with the use
   of the fh_expire_type attribute, whether volatile filehandles will
   expire at the migration or replication event.  If the bit
   FH4_VOL_MIGRATION is set in the fh_expire_type attribute, the client
   must treat the volatile filehandle as if the server had returned the
   NFS4ERR_FHEXPIRED error.  At the migration or replication event in
   the presence of the FH4_VOL_MIGRATION bit, the client will not
   present the original or old volatile filehandle to the new server.
   The client will start its communication with the new server by
   recovering its filehandles using the saved file names.

7.  NFS Server Name Space

7.1.  Server Exports

   On a UNIX server the name space describes all the files reachable by
   pathnames under the root directory or "/".  On a Windows NT server
   the name space constitutes all the files on disks named by mapped
   disk letters.  NFS server administrators rarely make the entire
   server's filesystem name space available to NFS clients.  More often
   portions of the name space are made available via an "export"
   feature.  In previous versions of the NFS protocol, the root
   filehandle for each export is obtained through the MOUNT protocol;
   the client sends a string that identifies the export of name space
   and the server returns the root filehandle for it.  The MOUNT
   protocol supports an EXPORTS procedure that will enumerate the
   server's exports.

noToC RFC3530 - Page 62

7.2.  Browsing Exports

   The NFS version 4 protocol provides a root filehandle that clients
   can use to obtain filehandles for these exports via a multi-component
   LOOKUP.  A common user experience is to use a graphical user
   interface (perhaps a file "Open" dialog window) to find a file via
   progressive browsing through a directory tree.  The client must be
   able to move from one export to another export via single-component,
   progressive LOOKUP operations.

   This style of browsing is not well supported by the NFS version 2 and
   3 protocols.  The client expects all LOOKUP operations to remain
   within a single server filesystem.  For example, the device attribute
   will not change.  This prevents a client from taking name space paths
   that span exports.

   An automounter on the client can obtain a snapshot of the server's
   name space using the EXPORTS procedure of the MOUNT protocol.  If it
   understands the server's pathname syntax, it can create an image of
   the server's name space on the client.  The parts of the name space
   that are not exported by the server are filled in with a "pseudo
   filesystem" that allows the user to browse from one mounted
   filesystem to another.  There is a drawback to this representation of
   the server's name space on the client: it is static.  If the server
   administrator adds a new export the client will be unaware of it.

7.3.  Server Pseudo Filesystem

   NFS version 4 servers avoid this name space inconsistency by
   presenting all the exports within the framework of a single server
   name space.  An NFS version 4 client uses LOOKUP and READDIR
   operations to browse seamlessly from one export to another.  Portions
   of the server name space that are not exported are bridged via a
   "pseudo filesystem" that provides a view of exported directories
   only.  A pseudo filesystem has a unique fsid and behaves like a
   normal, read only filesystem.

   Based on the construction of the server's name space, it is possible
   that multiple pseudo filesystems may exist.  For example,

   /a         pseudo filesystem
   /a/b       real filesystem
   /a/b/c     pseudo filesystem
   /a/b/c/d   real filesystem

   Each of the pseudo filesystems are considered separate entities and
   therefore will have a unique fsid.

noToC RFC3530 - Page 63

7.4.  Multiple Roots

   The DOS and Windows operating environments are sometimes described as
   having "multiple roots".  Filesystems are commonly represented as
   disk letters.  MacOS represents filesystems as top level names.  NFS
   version 4 servers for these platforms can construct a pseudo file
   system above these root names so that disk letters or volume names
   are simply directory names in the pseudo root.

7.5.  Filehandle Volatility

   The nature of the server's pseudo filesystem is that it is a logical
   representation of filesystem(s) available from the server.
   Therefore, the pseudo filesystem is most likely constructed
   dynamically when the server is first instantiated.  It is expected
   that the pseudo filesystem may not have an on disk counterpart from
   which persistent filehandles could be constructed.  Even though it is
   preferable that the server provide persistent filehandles for the
   pseudo filesystem, the NFS client should expect that pseudo file
   system filehandles are volatile.  This can be confirmed by checking
   the associated "fh_expire_type" attribute for those filehandles in
   question.  If the filehandles are volatile, the NFS client must be
   prepared to recover a filehandle value (e.g., with a multi-component
   LOOKUP) when receiving an error of NFS4ERR_FHEXPIRED.

7.6.  Exported Root

   If the server's root filesystem is exported, one might conclude that
   a pseudo-filesystem is not needed.  This would be wrong.  Assume the
   following filesystems on a server:

         /       disk1  (exported)
         /a      disk2  (not exported)
         /a/b    disk3  (exported)

   Because disk2 is not exported, disk3 cannot be reached with simple
   LOOKUPs.  The server must bridge the gap with a pseudo-filesystem.

7.7.  Mount Point Crossing

   The server filesystem environment may be constructed in such a way
   that one filesystem contains a directory which is 'covered' or
   mounted upon by a second filesystem.  For example:

         /a/b            (filesystem 1)
         /a/b/c/d        (filesystem 2)

noToC RFC3530 - Page 64

   The pseudo filesystem for this server may be constructed to look
   like:

         /               (place holder/not exported)
         /a/b            (filesystem 1)
         /a/b/c/d        (filesystem 2)

   It is the server's responsibility to present the pseudo filesystem
   that is complete to the client.  If the client sends a lookup request
   for the path "/a/b/c/d", the server's response is the filehandle of
   the filesystem "/a/b/c/d".  In previous versions of the NFS protocol,
   the server would respond with the filehandle of directory "/a/b/c/d"
   within the filesystem "/a/b".

   The NFS client will be able to determine if it crosses a server mount
   point by a change in the value of the "fsid" attribute.

7.8.  Security Policy and Name Space Presentation

   The application of the server's security policy needs to be carefully
   considered by the implementor.  One may choose to limit the
   viewability of portions of the pseudo filesystem based on the
   server's perception of the client's ability to authenticate itself
   properly.  However, with the support of multiple security mechanisms
   and the ability to negotiate the appropriate use of these mechanisms,
   the server is unable to properly determine if a client will be able
   to authenticate itself.  If, based on its policies, the server
   chooses to limit the contents of the pseudo filesystem, the server
   may effectively hide filesystems from a client that may otherwise
   have legitimate access.

   As suggested practice, the server should apply the security policy of
   a shared resource in the server's namespace to the components of the
   resource's ancestors.  For example:

         /
         /a/b
         /a/b/c

   The /a/b/c directory is a real filesystem and is the shared resource.
   The security policy for /a/b/c is Kerberos with integrity.  The
   server should apply the same security policy to /, /a, and /a/b.
   This allows for the extension of the protection of the server's
   namespace to the ancestors of the real shared resource.

noToC RFC3530 - Page 65

   For the case of the use of multiple, disjoint security mechanisms in
   the server's resources, the security for a particular object in the
   server's namespace should be the union of all security mechanisms of
   all direct descendants.

8.  File Locking and Share Reservations

   Integrating locking into the NFS protocol necessarily causes it to be
   stateful.  With the inclusion of share reservations the protocol
   becomes substantially more dependent on state than the traditional
   combination of NFS and NLM [XNFS].  There are three components to
   making this state manageable:

   o  Clear division between client and server

   o  Ability to reliably detect inconsistency in state between client
      and server

   o  Simple and robust recovery mechanisms

   In this model, the server owns the state information.  The client
   communicates its view of this state to the server as needed.  The
   client is also able to detect inconsistent state before modifying a
   file.

   To support Win32 share reservations it is necessary to atomically
   OPEN or CREATE files.  Having a separate share/unshare operation
   would not allow correct implementation of the Win32 OpenFile API.  In
   order to correctly implement share semantics, the previous NFS
   protocol mechanisms used when a file is opened or created (LOOKUP,
   CREATE, ACCESS) need to be replaced.  The NFS version 4 protocol has
   an OPEN operation that subsumes the NFS version 3 methodology of
   LOOKUP, CREATE, and ACCESS.  However, because many operations require
   a filehandle, the traditional LOOKUP is preserved to map a file name
   to filehandle without establishing state on the server.  The policy
   of granting access or modifying files is managed by the server based
   on the client's state.  These mechanisms can implement policy ranging
   from advisory only locking to full mandatory locking.

8.1.  Locking

   It is assumed that manipulating a lock is rare when compared to READ
   and WRITE operations.  It is also assumed that crashes and network
   partitions are relatively rare.  Therefore it is important that the
   READ and WRITE operations have a lightweight mechanism to indicate if
   they possess a held lock.  A lock request contains the heavyweight
   information required to establish a lock and uniquely define the lock
   owner.

noToC RFC3530 - Page 66

   The following sections describe the transition from the heavy weight
   information to the eventual stateid used for most client and server
   locking and lease interactions.

8.1.1.  Client ID

   For each LOCK request, the client must identify itself to the server.

   This is done in such a way as to allow for correct lock
   identification and crash recovery.  A sequence of a SETCLIENTID
   operation followed by a SETCLIENTID_CONFIRM operation is required to
   establish the identification onto the server.  Establishment of
   identification by a new incarnation of the client also has the effect
   of immediately breaking any leased state that a previous incarnation
   of the client might have had on the server, as opposed to forcing the
   new client incarnation to wait for the leases to expire.  Breaking
   the lease state amounts to the server removing all lock, share
   reservation, and, where the server is not supporting the
   CLAIM_DELEGATE_PREV claim type, all delegation state associated with
   same client with the same identity.  For discussion of delegation
   state recovery, see the section "Delegation Recovery".

   Client identification is encapsulated in the following structure:

         struct nfs_client_id4 {
                 verifier4     verifier;
                 opaque        id<NFS4_OPAQUE_LIMIT>;
         };

   The first field, verifier is a client incarnation verifier that is
   used to detect client reboots.  Only if the verifier is different
   from that which the server has previously recorded the client (as
   identified by the second field of the structure, id) does the server
   start the process of canceling the client's leased state.

   The second field, id is a variable length string that uniquely
   defines the client.

   There are several considerations for how the client generates the id
   string:

   o  The string should be unique so that multiple clients do not
      present the same string.  The consequences of two clients
      presenting the same string range from one client getting an error
      to one client having its leased state abruptly and unexpectedly
      canceled.

noToC RFC3530 - Page 67

   o  The string should be selected so the subsequent incarnations
      (e.g., reboots) of the same client cause the client to present the
      same string.  The implementor is cautioned against an approach
      that requires the string to be recorded in a local file because
      this precludes the use of the implementation in an environment
      where there is no local disk and all file access is from an NFS
      version 4 server.

   o  The string should be different for each server network address
      that the client accesses, rather than common to all server network
      addresses.  The reason is that it may not be possible for the
      client to tell if the same server is listening on multiple network
      addresses.  If the client issues SETCLIENTID with the same id
      string to each network address of such a server, the server will
      think it is the same client, and each successive SETCLIENTID will
      cause the server to begin the process of removing the client's
      previous leased state.

   o  The algorithm for generating the string should not assume that the
      client's network address won't change.  This includes changes
      between client incarnations and even changes while the client is
      stilling running in its current incarnation.  This means that if
      the client includes just the client's and server's network address
      in the id string, there is a real risk, after the client gives up
      the network address, that another client, using a similar
      algorithm for generating the id string, will generate a
      conflicting id string.

   Given the above considerations, an example of a well generated id
   string is one that includes:

   o  The server's network address.

   o  The client's network address.

   o  For a user level NFS version 4 client, it should contain
      additional information to distinguish the client from other user
      level clients running on the same host, such as a process id or
      other unique sequence.

   o  Additional information that tends to be unique, such as one or
      more of:

      -  The client machine's serial number (for privacy reasons, it is
         best to perform some one way function on the serial number).

      -  A MAC address.

noToC RFC3530 - Page 68

      -  The timestamp of when the NFS version 4 software was first
         installed on the client (though this is subject to the
         previously mentioned caution about using information that is
         stored in a file, because the file might only be accessible
         over NFS version 4).

      -  A true random number.  However since this number ought to be
         the same between client incarnations, this shares the same
         problem as that of the using the timestamp of the software
         installation.

   As a security measure, the server MUST NOT cancel a client's leased
   state if the principal established the state for a given id string is
   not the same as the principal issuing the SETCLIENTID.

   Note that SETCLIENTID and SETCLIENTID_CONFIRM has a secondary purpose
   of establishing the information the server needs to make callbacks to
   the client for purpose of supporting delegations.  It is permitted to
   change this information via SETCLIENTID and SETCLIENTID_CONFIRM
   within the same incarnation of the client without removing the
   client's leased state.

   Once a SETCLIENTID and SETCLIENTID_CONFIRM sequence has successfully
   completed, the client uses the shorthand client identifier, of type
   clientid4, instead of the longer and less compact nfs_client_id4
   structure.  This shorthand client identifier (a clientid) is assigned
   by the server and should be chosen so that it will not conflict with
   a clientid previously assigned by the server.  This applies across
   server restarts or reboots.  When a clientid is presented to a server
   and that clientid is not recognized, as would happen after a server
   reboot, the server will reject the request with the error
   NFS4ERR_STALE_CLIENTID.  When this happens, the client must obtain a
   new clientid by use of the SETCLIENTID operation and then proceed to
   any other necessary recovery for the server reboot case (See the
   section "Server Failure and Recovery").

   The client must also employ the SETCLIENTID operation when it
   receives a NFS4ERR_STALE_STATEID error using a stateid derived from
   its current clientid, since this also indicates a server reboot which
   has invalidated the existing clientid (see the next section
   "lock_owner and stateid Definition" for details).

   See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM
   for a complete specification of the operations.

noToC RFC3530 - Page 69

8.1.2.  Server Release of Clientid

   If the server determines that the client holds no associated state
   for its clientid, the server may choose to release the clientid.  The
   server may make this choice for an inactive client so that resources
   are not consumed by those intermittently active clients.  If the
   client contacts the server after this release, the server must ensure
   the client receives the appropriate error so that it will use the
   SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity.
   It should be clear that the server must be very hesitant to release a
   clientid since the resulting work on the client to recover from such
   an event will be the same burden as if the server had failed and
   restarted.  Typically a server would not release a clientid unless
   there had been no activity from that client for many minutes.

   Note that if the id string in a SETCLIENTID request is properly
   constructed, and if the client takes care to use the same principal
   for each successive use of SETCLIENTID, then, barring an active
   denial of service attack, NFS4ERR_CLID_INUSE should never be
   returned.

   However, client bugs, server bugs, or perhaps a deliberate change of
   the principal owner of the id string (such as the case of a client
   that changes security flavors, and under the new flavor, there is no
   mapping to the previous owner) will in rare cases result in
   NFS4ERR_CLID_INUSE.

   In that event, when the server gets a SETCLIENTID for a client id
   that currently has no state, or it has state, but the lease has
   expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST
   allow the SETCLIENTID, and confirm the new clientid if followed by
   the appropriate SETCLIENTID_CONFIRM.

8.1.3.  lock_owner and stateid Definition

   When requesting a lock, the client must present to the server the
   clientid and an identifier for the owner of the requested lock.
   These two fields are referred to as the lock_owner and the definition
   of those fields are:

   o  A clientid returned by the server as part of the client's use of
      the SETCLIENTID operation.

   o  A variable length opaque array used to uniquely define the owner
      of a lock managed by the client.

      This may be a thread id, process id, or other unique value.

noToC RFC3530 - Page 70

   When the server grants the lock, it responds with a unique stateid.
   The stateid is used as a shorthand reference to the lock_owner, since
   the server will be maintaining the correspondence between them.

   The server is free to form the stateid in any manner that it chooses
   as long as it is able to recognize invalid and out-of-date stateids.
   This requirement includes those stateids generated by earlier
   instances of the server.  From this, the client can be properly
   notified of a server restart.  This notification will occur when the
   client presents a stateid to the server from a previous
   instantiation.

   The server must be able to distinguish the following situations and
   return the error as specified:

   o  The stateid was generated by an earlier server instance (i.e.,
      before a server reboot).  The error NFS4ERR_STALE_STATEID should
      be returned.

   o  The stateid was generated by the current server instance but the
      stateid no longer designates the current locking state for the
      lockowner-file pair in question (i.e., one or more locking
      operations has occurred).  The error NFS4ERR_OLD_STATEID should be
      returned.

      This error condition will only occur when the client issues a
      locking request which changes a stateid while an I/O request that
      uses that stateid is outstanding.

   o  The stateid was generated by the current server instance but the
      stateid does not designate a locking state for any active
      lockowner-file pair.  The error NFS4ERR_BAD_STATEID should be
      returned.

      This error condition will occur when there has been a logic error
      on the part of the client or server.  This should not happen.

   One mechanism that may be used to satisfy these requirements is for
   the server to,

   o  divide the "other" field of each stateid into two fields:

      -  A server verifier which uniquely designates a particular server
         instantiation.

      -  An index into a table of locking-state structures.

noToC RFC3530 - Page 71

   o  utilize the "seqid" field of each stateid, such that seqid is
      monotonically incremented for each stateid that is associated with
      the same index into the locking-state table.

   By matching the incoming stateid and its field values with the state
   held at the server, the server is able to easily determine if a
   stateid is valid for its current instantiation and state.  If the
   stateid is not valid, the appropriate error can be supplied to the
   client.

8.1.4.  Use of the stateid and Locking

   All READ, WRITE and SETATTR operations contain a stateid.  For the
   purposes of this section, SETATTR operations which change the size
   attribute of a file are treated as if they are writing the area
   between the old and new size (i.e., the range truncated or added to
   the file by means of the SETATTR), even where SETATTR is not
   explicitly mentioned in the text.

   If the lock_owner performs a READ or WRITE in a situation in which it
   has established a lock or share reservation on the server (any OPEN
   constitutes a share reservation) the stateid (previously returned by
   the server) must be used to indicate what locks, including both
   record locks and share reservations, are held by the lockowner.  If
   no state is established by the client, either record lock or share
   reservation, a stateid of all bits 0 is used.  Regardless whether a
   stateid of all bits 0, or a stateid returned by the server is used,
   if there is a conflicting share reservation or mandatory record lock
   held on the file, the server MUST refuse to service the READ or WRITE
   operation.

   Share reservations are established by OPEN operations and by their
   nature are mandatory in that when the OPEN denies READ or WRITE
   operations, that denial results in such operations being rejected
   with error NFS4ERR_LOCKED.  Record locks may be implemented by the
   server as either mandatory or advisory, or the choice of mandatory or
   advisory behavior may be determined by the server on the basis of the
   file being accessed (for example, some UNIX-based servers support a
   "mandatory lock bit" on the mode attribute such that if set, record
   locks are required on the file before I/O is possible).  When record
   locks are advisory, they only prevent the granting of conflicting
   lock requests and have no effect on READs or WRITEs.  Mandatory
   record locks, however, prevent conflicting I/O operations.  When they
   are attempted, they are rejected with NFS4ERR_LOCKED.  When the
   client gets NFS4ERR_LOCKED on a file it knows it has the proper share
   reservation for, it will need to issue a LOCK request on the region

noToC RFC3530 - Page 72

   of the file that includes the region the I/O was to be performed on,
   with an appropriate locktype (i.e., READ*_LT for a READ operation,
   WRITE*_LT for a WRITE operation).

   With NFS version 3, there was no notion of a stateid so there was no
   way to tell if the application process of the client sending the READ
   or WRITE operation had also acquired the appropriate record lock on
   the file.  Thus there was no way to implement mandatory locking.
   With the stateid construct, this barrier has been removed.

   Note that for UNIX environments that support mandatory file locking,
   the distinction between advisory and mandatory locking is subtle.  In
   fact, advisory and mandatory record locks are exactly the same in so
   far as the APIs and requirements on implementation.  If the mandatory
   lock attribute is set on the file, the server checks to see if the
   lockowner has an appropriate shared (read) or exclusive (write)
   record lock on the region it wishes to read or write to.  If there is
   no appropriate lock, the server checks if there is a conflicting lock
   (which can be done by attempting to acquire the conflicting lock on
   the behalf of the lockowner, and if successful, release the lock
   after the READ or WRITE is done), and if there is, the server returns
   NFS4ERR_LOCKED.

   For Windows environments, there are no advisory record locks, so the
   server always checks for record locks during I/O requests.

   Thus, the NFS version 4 LOCK operation does not need to distinguish
   between advisory and mandatory record locks.  It is the NFS version 4
   server's processing of the READ and WRITE operations that introduces
   the distinction.

   Every stateid other than the special stateid values noted in this
   section, whether returned by an OPEN-type operation (i.e., OPEN,
   OPEN_DOWNGRADE), or by a LOCK-type operation (i.e., LOCK or LOCKU),
   defines an access mode for the file (i.e., READ, WRITE, or READ-
   WRITE) as established by the original OPEN which began the stateid
   sequence, and as modified by subsequent OPENs and OPEN_DOWNGRADEs
   within that stateid sequence.  When a READ, WRITE, or SETATTR which
   specifies the size attribute, is done, the operation is subject to
   checking against the access mode to verify that the operation is
   appropriate given the OPEN with which the operation is associated.

   In the case of WRITE-type operations (i.e., WRITEs and SETATTRs which
   set size), the server must verify that the access mode allows writing
   and return an NFS4ERR_OPENMODE error if it does not.  In the case, of
   READ, the server may perform the corresponding check on the access
   mode, or it may choose to allow READ on opens for WRITE only, to
   accommodate clients whose write implementation may unavoidably do

noToC RFC3530 - Page 73

   reads (e.g., due to buffer cache constraints).  However, even if
   READs are allowed in these circumstances, the server MUST still check
   for locks that conflict with the READ (e.g., another open specify
   denial of READs).  Note that a server which does enforce the access
   mode check on READs need not explicitly check for conflicting share
   reservations since the existence of OPEN for read access guarantees
   that no conflicting share reservation can exist.

   A stateid of all bits 1 (one) MAY allow READ operations to bypass
   locking checks at the server.  However, WRITE operations with a
   stateid with bits all 1 (one) MUST NOT bypass locking checks and are
   treated exactly the same as if a stateid of all bits 0 were used.

   A lock may not be granted while a READ or WRITE operation using one
   of the special stateids is being performed and the range of the lock
   request conflicts with the range of the READ or WRITE operation.  For
   the purposes of this paragraph, a conflict occurs when a shared lock
   is requested and a WRITE operation is being performed, or an
   exclusive lock is requested and either a READ or a WRITE operation is
   being performed.  A SETATTR that sets size is treated similarly to a
   WRITE as discussed above.

8.1.5.  Sequencing of Lock Requests

   Locking is different than most NFS operations as it requires "at-
   most-one" semantics that are not provided by ONCRPC.  ONCRPC over a
   reliable transport is not sufficient because a sequence of locking
   requests may span multiple TCP connections.  In the face of
   retransmission or reordering, lock or unlock requests must have a
   well defined and consistent behavior.  To accomplish this, each lock
   request contains a sequence number that is a consecutively increasing
   integer.  Different lock_owners have different sequences.  The server
   maintains the last sequence number (L) received and the response that
   was returned.  The first request issued for any given lock_owner is
   issued with a sequence number of zero.

   Note that for requests that contain a sequence number, for each
   lock_owner, there should be no more than one outstanding request.

   If a request (r) with a previous sequence number (r < L) is received,
   it is rejected with the return of error NFS4ERR_BAD_SEQID.  Given a
   properly-functioning client, the response to (r) must have been
   received before the last request (L) was sent.  If a duplicate of
   last request (r == L) is received, the stored response is returned.
   If a request beyond the next sequence (r == L + 2) is received, it is
   rejected with the return of error NFS4ERR_BAD_SEQID.  Sequence
   history is reinitialized whenever the SETCLIENTID/SETCLIENTID_CONFIRM
   sequence changes the client verifier.

noToC RFC3530 - Page 74

   Since the sequence number is represented with an unsigned 32-bit
   integer, the arithmetic involved with the sequence number is mod
   2^32.  For an example of modulo arithmetic involving sequence numbers
   see [RFC793].

   It is critical the server maintain the last response sent to the
   client to provide a more reliable cache of duplicate non-idempotent
   requests than that of the traditional cache described in [Juszczak].
   The traditional duplicate request cache uses a least recently used
   algorithm for removing unneeded requests.  However, the last lock
   request and response on a given lock_owner must be cached as long as
   the lock state exists on the server.

   The client MUST monotonically increment the sequence number for the
   CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE
   operations.  This is true even in the event that the previous
   operation that used the sequence number received an error.  The only
   exception to this rule is if the previous operation received one of
   the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
   NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
   NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE.

8.1.6.  Recovery from Replayed Requests

   As described above, the sequence number is per lock_owner.  As long
   as the server maintains the last sequence number received and follows
   the methods described above, there are no risks of a Byzantine router
   re-sending old requests.  The server need only maintain the
   (lock_owner, sequence number) state as long as there are open files
   or closed files with locks outstanding.

   LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence
   number and therefore the risk of the replay of these operations
   resulting in undesired effects is non-existent while the server
   maintains the lock_owner state.

8.1.7.  Releasing lock_owner State

   When a particular lock_owner no longer holds open or file locking
   state at the server, the server may choose to release the sequence
   number state associated with the lock_owner.  The server may make
   this choice based on lease expiration, for the reclamation of server
   memory, or other implementation specific details.  In any event, the
   server is able to do this safely only when the lock_owner no longer
   is being utilized by the client.  The server may choose to hold the
   lock_owner state in the event that retransmitted requests are
   received.  However, the period to hold this state is implementation
   specific.

noToC RFC3530 - Page 75

   In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is
   retransmitted after the server has previously released the lock_owner
   state, the server will find that the lock_owner has no files open and
   an error will be returned to the client.  If the lock_owner does have
   a file open, the stateid will not match and again an error is
   returned to the client.

8.1.8.  Use of Open Confirmation

   In the case that an OPEN is retransmitted and the lock_owner is being
   used for the first time or the lock_owner state has been previously
   released by the server, the use of the OPEN_CONFIRM operation will
   prevent incorrect behavior.  When the server observes the use of the
   lock_owner for the first time, it will direct the client to perform
   the OPEN_CONFIRM for the corresponding OPEN.  This sequence
   establishes the use of an lock_owner and associated sequence number.
   Since the OPEN_CONFIRM sequence connects a new open_owner on the
   server with an existing open_owner on a client, the sequence number
   may have any value.  The OPEN_CONFIRM step assures the server that
   the value received is the correct one.  See the section "OPEN_CONFIRM
   - Confirm Open" for further details.

   There are a number of situations in which the requirement to confirm
   an OPEN would pose difficulties for the client and server, in that
   they would be prevented from acting in a timely fashion on
   information received, because that information would be provisional,
   subject to deletion upon non-confirmation.  Fortunately, these are
   situations in which the server can avoid the need for confirmation
   when responding to open requests.  The two constraints are:

   o  The server must not bestow a delegation for any open which would
      require confirmation.

   o  The server MUST NOT require confirmation on a reclaim-type open
      (i.e., one specifying claim type CLAIM_PREVIOUS or
      CLAIM_DELEGATE_PREV).

   These constraints are related in that reclaim-type opens are the only
   ones in which the server may be required to send a delegation.  For
   CLAIM_NULL, sending the delegation is optional while for
   CLAIM_DELEGATE_CUR, no delegation is sent.

   Delegations being sent with an open requiring confirmation are
   troublesome because recovering from non-confirmation adds undue
   complexity to the protocol while requiring confirmation on reclaim-
   type opens poses difficulties in that the inability to resolve

noToC RFC3530 - Page 76

   the status of the reclaim until lease expiration may make it
   difficult to have timely determination of the set of locks being
   reclaimed (since the grace period may expire).

   Requiring open confirmation on reclaim-type opens is avoidable
   because of the nature of the environments in which such opens are
   done.  For CLAIM_PREVIOUS opens, this is immediately after server
   reboot, so there should be no time for lockowners to be created,
   found to be unused, and recycled.  For CLAIM_DELEGATE_PREV opens, we
   are dealing with a client reboot situation.  A server which supports
   delegation can be sure that no lockowners for that client have been
   recycled since client initialization and thus can ensure that
   confirmation will not be required.

(page 76 continued on part 4)