Tech-invite3GPPspaceIETFspace
959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 3530

Network File System (NFS) version 4 Protocol

Pages: 275
Obsoletes:  3010
Obsoleted by:  7530
Part 3 of 8 – Pages 47 to 76
First   Prev   Next

ToP   noToC   RFC3530 - Page 47   prevText

5.8. Interpreting owner and owner_group

The recommended attributes "owner" and "owner_group" (and also users and groups within the "acl" attribute) are represented in terms of a UTF-8 string. To avoid a representation that is tied to a particular underlying implementation at the client or server, the use of the UTF-8 string has been chosen. Note that section 6.1 of [RFC2624] provides additional rationale. It is expected that the client and server will have their own local representation of owner and owner_group that is used for local storage or presentation to the end user. Therefore, it is expected that when these attributes are transferred between the client and server that the local representation is translated to a syntax of the form "user@dns_domain". This will allow for a client and server that do not use the same local representation the ability to translate to a common syntax that can be interpreted by both. Similarly, security principals may be represented in different ways by different security mechanisms. Servers normally translate these representations into a common format, generally that used by local storage, to serve as a means of identifying the users corresponding to these security principals. When these local identifiers are translated to the form of the owner attribute, associated with files created by such principals they identify, in a common format, the users associated with each corresponding set of security principals. The translation used to interpret owner and group strings is not specified as part of the protocol. This allows various solutions to be employed. For example, a local translation table may be consulted that maps between a numeric id to the user@dns_domain syntax. A name service may also be used to accomplish the translation. A server may provide a more general service, not limited by any particular translation (which would only translate a limited set of possible strings) by storing the owner and owner_group attributes in local storage without any translation or it may augment a translation method by storing the entire string for attributes for which no translation is available while using the local representation for those cases in which a translation is available. Servers that do not provide support for all possible values of the owner and owner_group attributes, should return an error (NFS4ERR_BADOWNER) when a string is presented that has no translation, as the value to be set for a SETATTR of the owner, owner_group, or acl attributes. When a server does accept an owner or owner_group value as valid on a SETATTR (and similarly for the owner and group strings in an acl), it is promising to return that same string when a corresponding GETATTR is done. Configuration changes and ill-constructed name translations (those that contain
ToP   noToC   RFC3530 - Page 48
   aliasing) may make that promise impossible to honor.  Servers should
   make appropriate efforts to avoid a situation in which these
   attributes have their values changed when no real change to ownership
   has occurred.

   The "dns_domain" portion of the owner string is meant to be a DNS
   domain name.  For example, user@ietf.org.  Servers should accept as
   valid a set of users for at least one domain.  A server may treat
   other domains as having no valid translations.  A more general
   service is provided when a server is capable of accepting users for
   multiple domains, or for all domains, subject to security
   constraints.

   In the case where there is no translation available to the client or
   server, the attribute value must be constructed without the "@".
   Therefore, the absence of the @ from the owner or owner_group
   attribute signifies that no translation was available at the sender
   and that the receiver of the attribute should not use that string as
   a basis for translation into its own internal format.  Even though
   the attribute value can not be translated, it may still be useful.
   In the case of a client, the attribute string may be used for local
   display of ownership.

   To provide a greater degree of compatibility with previous versions
   of NFS (i.e., v2 and v3), which identified users and groups by 32-bit
   unsigned uid's and gid's, owner and group strings that consist of
   decimal numeric values with no leading zeros can be given a special
   interpretation by clients and servers which choose to provide such
   support.  The receiver may treat such a user or group string as
   representing the same user as would be represented by a v2/v3 uid or
   gid having the corresponding numeric value.  A server is not
   obligated to accept such a string, but may return an NFS4ERR_BADOWNER
   instead.  To avoid this mechanism being used to subvert user and
   group translation, so that a client might pass all of the owners and
   groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER
   error when there is a valid translation for the user or owner
   designated in this way.  In that case, the client must use the
   appropriate name@domain string and not the special form for
   compatibility.

   The owner string "nobody" may be used to designate an anonymous user,
   which will be associated with a file created by a security principal
   that cannot be mapped through normal means to the owner attribute.
ToP   noToC   RFC3530 - Page 49

5.9. Character Case Attributes

With respect to the case_insensitive and case_preserving attributes, each UCS-4 character (which UTF-8 encodes) has a "long descriptive name" [RFC1345] which may or may not included the word "CAPITAL" or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to implement unambiguous and efficient table driven mappings for case insensitive comparisons, and non-case-preserving storage. For general character handling and internationalization issues, see the section "Internationalization".

5.10. Quota Attributes

For the attributes related to filesystem quotas, the following definitions apply: quota_avail_soft The value in bytes which represents the amount of additional disk space that can be allocated to this file or directory before the user may reasonably be warned. It is understood that this space may be consumed by allocations to other files or directories though there is a rule as to which other files or directories. quota_avail_hard The value in bytes which represent the amount of additional disk space beyond the current allocation that can be allocated to this file or directory before further allocations will be refused. It is understood that this space may be consumed by allocations to other files or directories. quota_used The value in bytes which represent the amount of disc space used by this file or directory and possibly a number of other similar files or directories, where the set of "similar" meets at least the criterion that allocating space to any file or directory in the set will reduce the "quota_avail_hard" of every other file or directory in the set. Note that there may be a number of distinct but overlapping sets of files or directories for which a quota_used value is maintained (e.g., "all files with a given owner", "all files with a given group owner", etc.). The server is at liberty to choose any of those sets but should do so in a repeatable way. The rule may be configured per- filesystem or may be "choose the set with the smallest quota".
ToP   noToC   RFC3530 - Page 50

5.11. Access Control Lists

The NFS version 4 ACL attribute is an array of access control entries (ACE). Although, the client can read and write the ACL attribute, the NFSv4 model is the server does all access control based on the server's interpretation of the ACL. If at any point the client wants to check access without issuing an operation that modifies or reads data or metadata, the client can use the OPEN and ACCESS operations to do so. There are various access control entry types, as defined in the Section "ACE type". The server is able to communicate which ACE types are supported by returning the appropriate value within the aclsupport attribute. Each ACE covers one or more operations on a file or directory as described in the Section "ACE Access Mask". It may also contain one or more flags that modify the semantics of the ACE as defined in the Section "ACE flag". The NFS ACE attribute is defined as follows: typedef uint32_t acetype4; typedef uint32_t aceflag4; typedef uint32_t acemask4; struct nfsace4 { acetype4 type; aceflag4 flag; acemask4 access_mask; utf8str_mixed who; }; To determine if a request succeeds, each nfsace4 entry is processed in order by the server. Only ACEs which have a "who" that matches the requester are considered. Each ACE is processed until all of the bits of the requester's access have been ALLOWED. Once a bit (see below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer considered in the processing of later ACEs. If an ACCESS_DENIED_ACE is encountered where the requester's access still has unALLOWED bits in common with the "access_mask" of the ACE, the request is denied. However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT ACE types do not affect a requester's access, and instead are for triggering events as a result of a requester's access attempt. Therefore, all AUDIT and ALARM ACEs are processed until end of the ACL. When the ACL is fully processed, if there are bits in requester's mask that have not been considered whether the server allows or denies the access is undefined. If there is a mode attribute on the file, then this cannot happen, since the mode's MODE4_*OTH bits will map to EVERYONE@ ACEs that unambiguously specify the requester's access.
ToP   noToC   RFC3530 - Page 51
   The NFS version 4 ACL model is quite rich.  Some server platforms may
   provide access control functionality that goes beyond the UNIX-style
   mode attribute, but which is not as rich as the NFS ACL model.  So
   that users can take advantage of this more limited functionality, the
   server may indicate that it supports ACLs as long as it follows the
   guidelines for mapping between its ACL model and the NFS version 4
   ACL model.

   The situation is complicated by the fact that a server may have
   multiple modules that enforce ACLs.  For example, the enforcement for
   NFS version 4 access may be different from the enforcement for local
   access, and both may be different from the enforcement for access
   through other protocols such as SMB.  So it may be useful for a
   server to accept an ACL even if not all of its modules are able to
   support it.

   The guiding principle in all cases is that the server must not accept
   ACLs that appear to make the file more secure than it really is.

5.11.1. ACE type

Type Description _____________________________________________________ ALLOW Explicitly grants the access defined in acemask4 to the file or directory. DENY Explicitly denies the access defined in acemask4 to the file or directory. AUDIT LOG (system dependent) any access attempt to a file or directory which uses any of the access methods specified in acemask4. ALARM Generate a system ALARM (system dependent) when any access attempt is made to a file or directory for the access methods specified in acemask4. A server need not support all of the above ACE types. The bitmask constants used to represent the above definitions within the aclsupport attribute are as follows: const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; const ACL4_SUPPORT_DENY_ACL = 0x00000002; const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; const ACL4_SUPPORT_ALARM_ACL = 0x00000008;
ToP   noToC   RFC3530 - Page 52
   The semantics of the "type" field follow the descriptions provided
   above.

   The constants used for the type field (acetype4) are as follows:

      const ACE4_ACCESS_ALLOWED_ACE_TYPE      = 0x00000000;
      const ACE4_ACCESS_DENIED_ACE_TYPE       = 0x00000001;
      const ACE4_SYSTEM_AUDIT_ACE_TYPE        = 0x00000002;
      const ACE4_SYSTEM_ALARM_ACE_TYPE        = 0x00000003;

   Clients should not attempt to set an ACE unless the server claims
   support for that ACE type.  If the server receives a request to set
   an ACE that it cannot store, it MUST reject the request with
   NFS4ERR_ATTRNOTSUPP.  If the server receives a request to set an ACE
   that it can store but cannot enforce, the server SHOULD reject the
   request with NFS4ERR_ATTRNOTSUPP.

   Example: suppose a server can enforce NFS ACLs for NFS access but
   cannot enforce ACLs for local access.  If arbitrary processes can run
   on the server, then the server SHOULD NOT indicate ACL support.  On
   the other hand, if only trusted administrative programs run locally,
   then the server may indicate ACL support.

5.11.2. ACE Access Mask

The access_mask field contains values based on the following: Access Description _______________________________________________________________ READ_DATA Permission to read the data of the file LIST_DIRECTORY Permission to list the contents of a directory WRITE_DATA Permission to modify the file's data ADD_FILE Permission to add a new file to a directory APPEND_DATA Permission to append data to a file ADD_SUBDIRECTORY Permission to create a subdirectory to a directory READ_NAMED_ATTRS Permission to read the named attributes of a file WRITE_NAMED_ATTRS Permission to write the named attributes of a file EXECUTE Permission to execute a file DELETE_CHILD Permission to delete a file or directory within a directory READ_ATTRIBUTES The ability to read basic attributes (non-acls) of a file WRITE_ATTRIBUTES Permission to change basic attributes
ToP   noToC   RFC3530 - Page 53
                          (non-acls) of a file
   DELETE                 Permission to Delete the file
   READ_ACL               Permission to Read the ACL
   WRITE_ACL              Permission to Write the ACL
   WRITE_OWNER            Permission to change the owner
   SYNCHRONIZE            Permission to access file locally at the
                          server with synchronous reads and writes

   The bitmask constants used for the access mask field are as follows:

   const ACE4_READ_DATA            = 0x00000001;
   const ACE4_LIST_DIRECTORY       = 0x00000001;
   const ACE4_WRITE_DATA           = 0x00000002;
   const ACE4_ADD_FILE             = 0x00000002;
   const ACE4_APPEND_DATA          = 0x00000004;
   const ACE4_ADD_SUBDIRECTORY     = 0x00000004;
   const ACE4_READ_NAMED_ATTRS     = 0x00000008;
   const ACE4_WRITE_NAMED_ATTRS    = 0x00000010;
   const ACE4_EXECUTE              = 0x00000020;
   const ACE4_DELETE_CHILD         = 0x00000040;
   const ACE4_READ_ATTRIBUTES      = 0x00000080;
   const ACE4_WRITE_ATTRIBUTES     = 0x00000100;
   const ACE4_DELETE               = 0x00010000;
   const ACE4_READ_ACL             = 0x00020000;
   const ACE4_WRITE_ACL            = 0x00040000;
   const ACE4_WRITE_OWNER          = 0x00080000;
   const ACE4_SYNCHRONIZE          = 0x00100000;

   Server implementations need not provide the granularity of control
   that is implied by this list of masks.  For example, POSIX-based
   systems might not distinguish APPEND_DATA (the ability to append to a
   file) from WRITE_DATA (the ability to modify existing contents); both
   masks would be tied to a single "write" permission.  When such a
   server returns attributes to the client, it would show both
   APPEND_DATA and WRITE_DATA if and only if the write permission is
   enabled.

   If a server receives a SETATTR request that it cannot accurately
   implement, it should error in the direction of more restricted
   access.  For example, suppose a server cannot distinguish overwriting
   data from appending new data, as described in the previous paragraph.
   If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is
   not (or vice versa), the server should reject the request with
   NFS4ERR_ATTRNOTSUPP.  Nonetheless, if the ACE has type DENY, the
   server may silently turn on the other bit, so that both APPEND_DATA
   and WRITE_DATA are denied.
ToP   noToC   RFC3530 - Page 54

5.11.3. ACE flag

The "flag" field contains values based on the following descriptions. ACE4_FILE_INHERIT_ACE Can be placed on a directory and indicates that this ACE should be added to each new non-directory file created. ACE4_DIRECTORY_INHERIT_ACE Can be placed on a directory and indicates that this ACE should be added to each new directory created. ACE4_INHERIT_ONLY_ACE Can be placed on a directory but does not apply to the directory, only to newly created files/directories as specified by the above two flags. ACE4_NO_PROPAGATE_INHERIT_ACE Can be placed on a directory. Normally when a new directory is created and an ACE exists on the parent directory which is marked ACL4_DIRECTORY_INHERIT_ACE, two ACEs are placed on the new directory. One for the directory itself and one which is an inheritable ACE for newly created directories. This flag tells the server to not place an ACE on the newly created directory which is inheritable by subdirectories of the created directory. ACE4_SUCCESSFUL_ACCESS_ACE_FLAG ACL4_FAILED_ACCESS_ACE_FLAG The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits relate only to ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE (ALARM) ACE types. If during the processing of the file's ACL, the server encounters an AUDIT or ALARM ACE that matches the principal attempting the OPEN, the server notes that fact, and the presence, if any, of the SUCCESS and FAILED flags encountered in the AUDIT or ALARM ACE. Once the server completes the ACL processing, and the share reservation processing, and the OPEN call, it then notes if the OPEN succeeded or failed. If the OPEN succeeded, and if the SUCCESS flag was set for a matching AUDIT or ALARM, then the appropriate AUDIT or ALARM event occurs. If the OPEN failed, and if the FAILED flag was set for the matching AUDIT or ALARM, then the appropriate AUDIT or ALARM event occurs. Clearly either or both of the SUCCESS or FAILED can be set, but if neither is set, the AUDIT or ALARM ACE is not useful.
ToP   noToC   RFC3530 - Page 55
      The previously described processing applies to that of the ACCESS
      operation as well.  The difference being that "success" or
      "failure" does not mean whether ACCESS returns NFS4_OK or not.
      Success means whether ACCESS returns all requested and supported
      bits.  Failure means whether ACCESS failed to return a bit that
      was requested and supported.

   ACE4_IDENTIFIER_GROUP
      Indicates that the "who" refers to a GROUP as defined under UNIX.

   The bitmask constants used for the flag field are as follows:

   const ACE4_FILE_INHERIT_ACE             = 0x00000001;
   const ACE4_DIRECTORY_INHERIT_ACE        = 0x00000002;
   const ACE4_NO_PROPAGATE_INHERIT_ACE     = 0x00000004;
   const ACE4_INHERIT_ONLY_ACE             = 0x00000008;
   const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG   = 0x00000010;
   const ACE4_FAILED_ACCESS_ACE_FLAG       = 0x00000020;
   const ACE4_IDENTIFIER_GROUP             = 0x00000040;

   A server need not support any of these flags.  If the server supports
   flags that are similar to, but not exactly the same as, these flags,
   the implementation may define a mapping between the protocol-defined
   flags and the implementation-defined flags.  Again, the guiding
   principle is that the file not appear to be more secure than it
   really is.

   For example, suppose a client tries to set an ACE with
   ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE.  If the
   server does not support any form of ACL inheritance, the server
   should reject the request with NFS4ERR_ATTRNOTSUPP.  If the server
   supports a single "inherit ACE" flag that applies to both files and
   directories, the server may reject the request (i.e., requiring the
   client to set both the file and directory inheritance flags).  The
   server may also accept the request and silently turn on the
   ACE4_DIRECTORY_INHERIT_ACE flag.

5.11.4. ACE who

There are several special identifiers ("who") which need to be understood universally, rather than in the context of a particular DNS domain. Some of these identifiers cannot be understood when an NFS client accesses the server, but have meaning when a local process accesses the file. The ability to display and modify these permissions is permitted over NFS, even if none of the access methods on the server understands the identifiers.
ToP   noToC   RFC3530 - Page 56
   Who                    Description
   _______________________________________________________________
   "OWNER"                The owner of the file.
   "GROUP"                The group associated with the file.
   "EVERYONE"             The world.
   "INTERACTIVE"          Accessed from an interactive terminal.
   "NETWORK"              Accessed via the network.
   "DIALUP"               Accessed as a dialup user to the server.
   "BATCH"                Accessed from a batch job.
   "ANONYMOUS"            Accessed without any authentication.
   "AUTHENTICATED"        Any authenticated user (opposite of
                          ANONYMOUS)
   "SERVICE"              Access from a system service.

   To avoid conflict, these special identifiers are distinguish by an
   appended "@" and should appear in the form "xxxx@" (note: no domain
   name after the "@").  For example: ANONYMOUS@.

5.11.5. Mode Attribute

The NFS version 4 mode attribute is based on the UNIX mode bits. The following bits are defined: const MODE4_SUID = 0x800; /* set user id on execution */ const MODE4_SGID = 0x400; /* set group id on execution */ const MODE4_SVTX = 0x200; /* save text even after use */ const MODE4_RUSR = 0x100; /* read permission: owner */ const MODE4_WUSR = 0x080; /* write permission: owner */ const MODE4_XUSR = 0x040; /* execute permission: owner */ const MODE4_RGRP = 0x020; /* read permission: group */ const MODE4_WGRP = 0x010; /* write permission: group */ const MODE4_XGRP = 0x008; /* execute permission: group */ const MODE4_ROTH = 0x004; /* read permission: other */ const MODE4_WOTH = 0x002; /* write permission: other */ const MODE4_XOTH = 0x001; /* execute permission: other */ Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and MODE4_XGRP apply to the principals identified in the owner_group attribute. Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does not match that in the owner group, and does not have a group matching that of the owner_group attribute. The remaining bits are not defined by this protocol and MUST NOT be used. The minor version mechanism must be used to define further bit usage.
ToP   noToC   RFC3530 - Page 57
   Note that in UNIX, if a file has the MODE4_SGID bit set and no
   MODE4_XGRP bit set, then READ and WRITE must use mandatory file
   locking.

5.11.6. Mode and ACL Attribute

The server that supports both mode and ACL must take care to synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the ACEs which have respective who fields of "OWNER@", "GROUP@", and "EVERYONE@" so that the client can see semantically equivalent access permissions exist whether the client asks for owner, owner_group and mode attributes, or for just the ACL. Because the mode attribute includes bits (e.g., MODE4_SVTX) that have nothing to do with ACL semantics, it is permitted for clients to specify both the ACL attribute and mode in the same SETATTR operation. However, because there is no prescribed order for processing the attributes in a SETATTR, the client must ensure that ACL attribute, if specified without mode, would produce the desired mode bits, and conversely, the mode attribute if specified without ACL, would produce the desired "OWNER@", "GROUP@", and "EVERYONE@" ACEs.

5.11.7. mounted_on_fileid

UNIX-based operating environments connect a filesystem into the namespace by connecting (mounting) the filesystem onto the existing file object (the mount point, usually a directory) of an existing filesystem. When the mount point's parent directory is read via an API like readdir(), the return results are directory entries, each with a component name and a fileid. The fileid of the mount point's directory entry will be different from the fileid that the stat() system call returns. The stat() system call is returning the fileid of the root of the mounted filesystem, whereas readdir() is returning the fileid stat() would have returned before any filesystems were mounted on the mount point. Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request to cross other filesystems. The client detects the filesystem crossing whenever the filehandle argument of LOOKUP has an fsid attribute different from that of the filehandle returned by LOOKUP. A UNIX-based client will consider this a "mount point crossing". UNIX has a legacy scheme for allowing a process to determine its current working directory. This relies on readdir() of a mount point's parent and stat() of the mount point returning fileids as previously described. The mounted_on_fileid attribute corresponds to the fileid that readdir() would have returned as described previously.
ToP   noToC   RFC3530 - Page 58
   While the NFS version 4 client could simply fabricate a fileid
   corresponding to what mounted_on_fileid provides (and if the server
   does not support mounted_on_fileid, the client has no choice), there
   is a risk that the client will generate a fileid that conflicts with
   one that is already assigned to another object in the filesystem.
   Instead, if the server can provide the mounted_on_fileid, the
   potential for client operational problems in this area is eliminated.

   If the server detects that there is no mounted point at the target
   file object, then the value for mounted_on_fileid that it returns is
   the same as that of the fileid attribute.

   The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD
   provide it if possible, and for a UNIX-based server, this is
   straightforward.  Usually, mounted_on_fileid will be requested during
   a READDIR operation, in which case it is trivial (at least for UNIX-
   based servers) to return mounted_on_fileid since it is equal to the
   fileid of a directory entry returned by readdir().  If
   mounted_on_fileid is requested in a GETATTR operation, the server
   should obey an invariant that has it returning a value that is equal
   to the file object's entry in the object's parent directory, i.e.,
   what readdir() would have returned.  Some operating environments
   allow a series of two or more filesystems to be mounted onto a single
   mount point.  In this case, for the server to obey the aforementioned
   invariant, it will need to find the base mount point, and not the
   intermediate mount points.

6. Filesystem Migration and Replication

With the use of the recommended attribute "fs_locations", the NFS version 4 server has a method of providing filesystem migration or replication services. For the purposes of migration and replication, a filesystem will be defined as all files that share a given fsid (both major and minor values are the same). The fs_locations attribute provides a list of filesystem locations. These locations are specified by providing the server name (either DNS domain or IP address) and the path name representing the root of the filesystem. Depending on the type of service being provided, the list will provide a new location or a set of alternate locations for the filesystem. The client will use this information to redirect its requests to the new server.

6.1. Replication

It is expected that filesystem replication will be used in the case of read-only data. Typically, the filesystem will be replicated on two or more servers. The fs_locations attribute will provide the
ToP   noToC   RFC3530 - Page 59
   list of these locations to the client.  On first access of the
   filesystem, the client should obtain the value of the fs_locations
   attribute.  If, in the future, the client finds the server
   unresponsive, the client may attempt to use another server specified
   by fs_locations.

   If applicable, the client must take the appropriate steps to recover
   valid filehandles from the new server.  This is described in more
   detail in the following sections.

6.2. Migration

Filesystem migration is used to move a filesystem from one server to another. Migration is typically used for a filesystem that is writable and has a single copy. The expected use of migration is for load balancing or general resource reallocation. The protocol does not specify how the filesystem will be moved between servers. This server-to-server transfer mechanism is left to the server implementor. However, the method used to communicate the migration event between client and server is specified here. Once the servers participating in the migration have completed the move of the filesystem, the error NFS4ERR_MOVED will be returned for subsequent requests received by the original server. The NFS4ERR_MOVED error is returned for all operations except PUTFH and GETATTR. Upon receiving the NFS4ERR_MOVED error, the client will obtain the value of the fs_locations attribute. The client will then use the contents of the attribute to redirect its requests to the specified server. To facilitate the use of GETATTR, operations such as PUTFH must also be accepted by the server for the migrated file system's filehandles. Note that if the server returns NFS4ERR_MOVED, the server MUST support the fs_locations attribute. If the client requests more attributes than just fs_locations, the server may return fs_locations only. This is to be expected since the server has migrated the filesystem and may not have a method of obtaining additional attribute data. The server implementor needs to be careful in developing a migration solution. The server must consider all of the state information clients may have outstanding at the server. This includes but is not limited to locking/share state, delegation state, and asynchronous file writes which are represented by WRITE and COMMIT verifiers. The server should strive to minimize the impact on its clients during and after the migration process.
ToP   noToC   RFC3530 - Page 60

6.3. Interpretation of the fs_locations Attribute

The fs_location attribute is structured in the following way: struct fs_location { utf8str_cis server<>; pathname4 rootpath; }; struct fs_locations { pathname4 fs_root; fs_location locations<>; }; The fs_location struct is used to represent the location of a filesystem by providing a server name and the path to the root of the filesystem. For a multi-homed server or a set of servers that use the same rootpath, an array of server names may be provided. An entry in the server array is an UTF8 string and represents one of a traditional DNS host name, IPv4 address, or IPv6 address. It is not a requirement that all servers that share the same rootpath be listed in one fs_location struct. The array of server names is provided for convenience. Servers that share the same rootpath may also be listed in separate fs_location entries in the fs_locations attribute. The fs_locations struct and attribute then contains an array of locations. Since the name space of each server may be constructed differently, the "fs_root" field is provided. The path represented by fs_root represents the location of the filesystem in the server's name space. Therefore, the fs_root path is only associated with the server from which the fs_locations attribute was obtained. The fs_root path is meant to aid the client in locating the filesystem at the various servers listed. As an example, there is a replicated filesystem located at two servers (servA and servB). At servA the filesystem is located at path "/a/b/c". At servB the filesystem is located at path "/x/y/z". In this example the client accesses the filesystem first at servA with a multi-component lookup path of "/a/b/c/d". Since the client used a multi-component lookup to obtain the filehandle at "/a/b/c/d", it is unaware that the filesystem's root is located in servA's name space at "/a/b/c". When the client switches to servB, it will need to determine that the directory it first referenced at servA is now represented by the path "/x/y/z/d" on servB. To facilitate this, the fs_locations attribute provided by servA would have a fs_root value of "/a/b/c" and two entries in fs_location. One entry in fs_location will be for itself (servA) and the other will be for servB with a
ToP   noToC   RFC3530 - Page 61
   path of "/x/y/z".  With this information, the client is able to
   substitute "/x/y/z" for the "/a/b/c" at the beginning of its access
   path and construct "/x/y/z/d" to use for the new server.

   See the section "Security Considerations" for a discussion on the
   recommendations for the security flavor to be used by any GETATTR
   operation that requests the "fs_locations" attribute.

6.4. Filehandle Recovery for Migration or Replication

Filehandles for filesystems that are replicated or migrated generally have the same semantics as for filesystems that are not replicated or migrated. For example, if a filesystem has persistent filehandles and it is migrated to another server, the filehandle values for the filesystem will be valid at the new server. For volatile filehandles, the servers involved likely do not have a mechanism to transfer filehandle format and content between themselves. Therefore, a server may have difficulty in determining if a volatile filehandle from an old server should return an error of NFS4ERR_FHEXPIRED. Therefore, the client is informed, with the use of the fh_expire_type attribute, whether volatile filehandles will expire at the migration or replication event. If the bit FH4_VOL_MIGRATION is set in the fh_expire_type attribute, the client must treat the volatile filehandle as if the server had returned the NFS4ERR_FHEXPIRED error. At the migration or replication event in the presence of the FH4_VOL_MIGRATION bit, the client will not present the original or old volatile filehandle to the new server. The client will start its communication with the new server by recovering its filehandles using the saved file names.

7. NFS Server Name Space

7.1. Server Exports

On a UNIX server the name space describes all the files reachable by pathnames under the root directory or "/". On a Windows NT server the name space constitutes all the files on disks named by mapped disk letters. NFS server administrators rarely make the entire server's filesystem name space available to NFS clients. More often portions of the name space are made available via an "export" feature. In previous versions of the NFS protocol, the root filehandle for each export is obtained through the MOUNT protocol; the client sends a string that identifies the export of name space and the server returns the root filehandle for it. The MOUNT protocol supports an EXPORTS procedure that will enumerate the server's exports.
ToP   noToC   RFC3530 - Page 62

7.2. Browsing Exports

The NFS version 4 protocol provides a root filehandle that clients can use to obtain filehandles for these exports via a multi-component LOOKUP. A common user experience is to use a graphical user interface (perhaps a file "Open" dialog window) to find a file via progressive browsing through a directory tree. The client must be able to move from one export to another export via single-component, progressive LOOKUP operations. This style of browsing is not well supported by the NFS version 2 and 3 protocols. The client expects all LOOKUP operations to remain within a single server filesystem. For example, the device attribute will not change. This prevents a client from taking name space paths that span exports. An automounter on the client can obtain a snapshot of the server's name space using the EXPORTS procedure of the MOUNT protocol. If it understands the server's pathname syntax, it can create an image of the server's name space on the client. The parts of the name space that are not exported by the server are filled in with a "pseudo filesystem" that allows the user to browse from one mounted filesystem to another. There is a drawback to this representation of the server's name space on the client: it is static. If the server administrator adds a new export the client will be unaware of it.

7.3. Server Pseudo Filesystem

NFS version 4 servers avoid this name space inconsistency by presenting all the exports within the framework of a single server name space. An NFS version 4 client uses LOOKUP and READDIR operations to browse seamlessly from one export to another. Portions of the server name space that are not exported are bridged via a "pseudo filesystem" that provides a view of exported directories only. A pseudo filesystem has a unique fsid and behaves like a normal, read only filesystem. Based on the construction of the server's name space, it is possible that multiple pseudo filesystems may exist. For example, /a pseudo filesystem /a/b real filesystem /a/b/c pseudo filesystem /a/b/c/d real filesystem Each of the pseudo filesystems are considered separate entities and therefore will have a unique fsid.
ToP   noToC   RFC3530 - Page 63

7.4. Multiple Roots

The DOS and Windows operating environments are sometimes described as having "multiple roots". Filesystems are commonly represented as disk letters. MacOS represents filesystems as top level names. NFS version 4 servers for these platforms can construct a pseudo file system above these root names so that disk letters or volume names are simply directory names in the pseudo root.

7.5. Filehandle Volatility

The nature of the server's pseudo filesystem is that it is a logical representation of filesystem(s) available from the server. Therefore, the pseudo filesystem is most likely constructed dynamically when the server is first instantiated. It is expected that the pseudo filesystem may not have an on disk counterpart from which persistent filehandles could be constructed. Even though it is preferable that the server provide persistent filehandles for the pseudo filesystem, the NFS client should expect that pseudo file system filehandles are volatile. This can be confirmed by checking the associated "fh_expire_type" attribute for those filehandles in question. If the filehandles are volatile, the NFS client must be prepared to recover a filehandle value (e.g., with a multi-component LOOKUP) when receiving an error of NFS4ERR_FHEXPIRED.

7.6. Exported Root

If the server's root filesystem is exported, one might conclude that a pseudo-filesystem is not needed. This would be wrong. Assume the following filesystems on a server: / disk1 (exported) /a disk2 (not exported) /a/b disk3 (exported) Because disk2 is not exported, disk3 cannot be reached with simple LOOKUPs. The server must bridge the gap with a pseudo-filesystem.

7.7. Mount Point Crossing

The server filesystem environment may be constructed in such a way that one filesystem contains a directory which is 'covered' or mounted upon by a second filesystem. For example: /a/b (filesystem 1) /a/b/c/d (filesystem 2)
ToP   noToC   RFC3530 - Page 64
   The pseudo filesystem for this server may be constructed to look
   like:

         /               (place holder/not exported)
         /a/b            (filesystem 1)
         /a/b/c/d        (filesystem 2)

   It is the server's responsibility to present the pseudo filesystem
   that is complete to the client.  If the client sends a lookup request
   for the path "/a/b/c/d", the server's response is the filehandle of
   the filesystem "/a/b/c/d".  In previous versions of the NFS protocol,
   the server would respond with the filehandle of directory "/a/b/c/d"
   within the filesystem "/a/b".

   The NFS client will be able to determine if it crosses a server mount
   point by a change in the value of the "fsid" attribute.

7.8. Security Policy and Name Space Presentation

The application of the server's security policy needs to be carefully considered by the implementor. One may choose to limit the viewability of portions of the pseudo filesystem based on the server's perception of the client's ability to authenticate itself properly. However, with the support of multiple security mechanisms and the ability to negotiate the appropriate use of these mechanisms, the server is unable to properly determine if a client will be able to authenticate itself. If, based on its policies, the server chooses to limit the contents of the pseudo filesystem, the server may effectively hide filesystems from a client that may otherwise have legitimate access. As suggested practice, the server should apply the security policy of a shared resource in the server's namespace to the components of the resource's ancestors. For example: / /a/b /a/b/c The /a/b/c directory is a real filesystem and is the shared resource. The security policy for /a/b/c is Kerberos with integrity. The server should apply the same security policy to /, /a, and /a/b. This allows for the extension of the protection of the server's namespace to the ancestors of the real shared resource.
ToP   noToC   RFC3530 - Page 65
   For the case of the use of multiple, disjoint security mechanisms in
   the server's resources, the security for a particular object in the
   server's namespace should be the union of all security mechanisms of
   all direct descendants.

8. File Locking and Share Reservations

Integrating locking into the NFS protocol necessarily causes it to be stateful. With the inclusion of share reservations the protocol becomes substantially more dependent on state than the traditional combination of NFS and NLM [XNFS]. There are three components to making this state manageable: o Clear division between client and server o Ability to reliably detect inconsistency in state between client and server o Simple and robust recovery mechanisms In this model, the server owns the state information. The client communicates its view of this state to the server as needed. The client is also able to detect inconsistent state before modifying a file. To support Win32 share reservations it is necessary to atomically OPEN or CREATE files. Having a separate share/unshare operation would not allow correct implementation of the Win32 OpenFile API. In order to correctly implement share semantics, the previous NFS protocol mechanisms used when a file is opened or created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has an OPEN operation that subsumes the NFS version 3 methodology of LOOKUP, CREATE, and ACCESS. However, because many operations require a filehandle, the traditional LOOKUP is preserved to map a file name to filehandle without establishing state on the server. The policy of granting access or modifying files is managed by the server based on the client's state. These mechanisms can implement policy ranging from advisory only locking to full mandatory locking.

8.1. Locking

It is assumed that manipulating a lock is rare when compared to READ and WRITE operations. It is also assumed that crashes and network partitions are relatively rare. Therefore it is important that the READ and WRITE operations have a lightweight mechanism to indicate if they possess a held lock. A lock request contains the heavyweight information required to establish a lock and uniquely define the lock owner.
ToP   noToC   RFC3530 - Page 66
   The following sections describe the transition from the heavy weight
   information to the eventual stateid used for most client and server
   locking and lease interactions.

8.1.1. Client ID

For each LOCK request, the client must identify itself to the server. This is done in such a way as to allow for correct lock identification and crash recovery. A sequence of a SETCLIENTID operation followed by a SETCLIENTID_CONFIRM operation is required to establish the identification onto the server. Establishment of identification by a new incarnation of the client also has the effect of immediately breaking any leased state that a previous incarnation of the client might have had on the server, as opposed to forcing the new client incarnation to wait for the leases to expire. Breaking the lease state amounts to the server removing all lock, share reservation, and, where the server is not supporting the CLAIM_DELEGATE_PREV claim type, all delegation state associated with same client with the same identity. For discussion of delegation state recovery, see the section "Delegation Recovery". Client identification is encapsulated in the following structure: struct nfs_client_id4 { verifier4 verifier; opaque id<NFS4_OPAQUE_LIMIT>; }; The first field, verifier is a client incarnation verifier that is used to detect client reboots. Only if the verifier is different from that which the server has previously recorded the client (as identified by the second field of the structure, id) does the server start the process of canceling the client's leased state. The second field, id is a variable length string that uniquely defines the client. There are several considerations for how the client generates the id string: o The string should be unique so that multiple clients do not present the same string. The consequences of two clients presenting the same string range from one client getting an error to one client having its leased state abruptly and unexpectedly canceled.
ToP   noToC   RFC3530 - Page 67
   o  The string should be selected so the subsequent incarnations
      (e.g., reboots) of the same client cause the client to present the
      same string.  The implementor is cautioned against an approach
      that requires the string to be recorded in a local file because
      this precludes the use of the implementation in an environment
      where there is no local disk and all file access is from an NFS
      version 4 server.

   o  The string should be different for each server network address
      that the client accesses, rather than common to all server network
      addresses.  The reason is that it may not be possible for the
      client to tell if the same server is listening on multiple network
      addresses.  If the client issues SETCLIENTID with the same id
      string to each network address of such a server, the server will
      think it is the same client, and each successive SETCLIENTID will
      cause the server to begin the process of removing the client's
      previous leased state.

   o  The algorithm for generating the string should not assume that the
      client's network address won't change.  This includes changes
      between client incarnations and even changes while the client is
      stilling running in its current incarnation.  This means that if
      the client includes just the client's and server's network address
      in the id string, there is a real risk, after the client gives up
      the network address, that another client, using a similar
      algorithm for generating the id string, will generate a
      conflicting id string.

   Given the above considerations, an example of a well generated id
   string is one that includes:

   o  The server's network address.

   o  The client's network address.

   o  For a user level NFS version 4 client, it should contain
      additional information to distinguish the client from other user
      level clients running on the same host, such as a process id or
      other unique sequence.

   o  Additional information that tends to be unique, such as one or
      more of:

      -  The client machine's serial number (for privacy reasons, it is
         best to perform some one way function on the serial number).

      -  A MAC address.
ToP   noToC   RFC3530 - Page 68
      -  The timestamp of when the NFS version 4 software was first
         installed on the client (though this is subject to the
         previously mentioned caution about using information that is
         stored in a file, because the file might only be accessible
         over NFS version 4).

      -  A true random number.  However since this number ought to be
         the same between client incarnations, this shares the same
         problem as that of the using the timestamp of the software
         installation.

   As a security measure, the server MUST NOT cancel a client's leased
   state if the principal established the state for a given id string is
   not the same as the principal issuing the SETCLIENTID.

   Note that SETCLIENTID and SETCLIENTID_CONFIRM has a secondary purpose
   of establishing the information the server needs to make callbacks to
   the client for purpose of supporting delegations.  It is permitted to
   change this information via SETCLIENTID and SETCLIENTID_CONFIRM
   within the same incarnation of the client without removing the
   client's leased state.

   Once a SETCLIENTID and SETCLIENTID_CONFIRM sequence has successfully
   completed, the client uses the shorthand client identifier, of type
   clientid4, instead of the longer and less compact nfs_client_id4
   structure.  This shorthand client identifier (a clientid) is assigned
   by the server and should be chosen so that it will not conflict with
   a clientid previously assigned by the server.  This applies across
   server restarts or reboots.  When a clientid is presented to a server
   and that clientid is not recognized, as would happen after a server
   reboot, the server will reject the request with the error
   NFS4ERR_STALE_CLIENTID.  When this happens, the client must obtain a
   new clientid by use of the SETCLIENTID operation and then proceed to
   any other necessary recovery for the server reboot case (See the
   section "Server Failure and Recovery").

   The client must also employ the SETCLIENTID operation when it
   receives a NFS4ERR_STALE_STATEID error using a stateid derived from
   its current clientid, since this also indicates a server reboot which
   has invalidated the existing clientid (see the next section
   "lock_owner and stateid Definition" for details).

   See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM
   for a complete specification of the operations.
ToP   noToC   RFC3530 - Page 69

8.1.2. Server Release of Clientid

If the server determines that the client holds no associated state for its clientid, the server may choose to release the clientid. The server may make this choice for an inactive client so that resources are not consumed by those intermittently active clients. If the client contacts the server after this release, the server must ensure the client receives the appropriate error so that it will use the SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity. It should be clear that the server must be very hesitant to release a clientid since the resulting work on the client to recover from such an event will be the same burden as if the server had failed and restarted. Typically a server would not release a clientid unless there had been no activity from that client for many minutes. Note that if the id string in a SETCLIENTID request is properly constructed, and if the client takes care to use the same principal for each successive use of SETCLIENTID, then, barring an active denial of service attack, NFS4ERR_CLID_INUSE should never be returned. However, client bugs, server bugs, or perhaps a deliberate change of the principal owner of the id string (such as the case of a client that changes security flavors, and under the new flavor, there is no mapping to the previous owner) will in rare cases result in NFS4ERR_CLID_INUSE. In that event, when the server gets a SETCLIENTID for a client id that currently has no state, or it has state, but the lease has expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST allow the SETCLIENTID, and confirm the new clientid if followed by the appropriate SETCLIENTID_CONFIRM.

8.1.3. lock_owner and stateid Definition

When requesting a lock, the client must present to the server the clientid and an identifier for the owner of the requested lock. These two fields are referred to as the lock_owner and the definition of those fields are: o A clientid returned by the server as part of the client's use of the SETCLIENTID operation. o A variable length opaque array used to uniquely define the owner of a lock managed by the client. This may be a thread id, process id, or other unique value.
ToP   noToC   RFC3530 - Page 70
   When the server grants the lock, it responds with a unique stateid.
   The stateid is used as a shorthand reference to the lock_owner, since
   the server will be maintaining the correspondence between them.

   The server is free to form the stateid in any manner that it chooses
   as long as it is able to recognize invalid and out-of-date stateids.
   This requirement includes those stateids generated by earlier
   instances of the server.  From this, the client can be properly
   notified of a server restart.  This notification will occur when the
   client presents a stateid to the server from a previous
   instantiation.

   The server must be able to distinguish the following situations and
   return the error as specified:

   o  The stateid was generated by an earlier server instance (i.e.,
      before a server reboot).  The error NFS4ERR_STALE_STATEID should
      be returned.

   o  The stateid was generated by the current server instance but the
      stateid no longer designates the current locking state for the
      lockowner-file pair in question (i.e., one or more locking
      operations has occurred).  The error NFS4ERR_OLD_STATEID should be
      returned.

      This error condition will only occur when the client issues a
      locking request which changes a stateid while an I/O request that
      uses that stateid is outstanding.

   o  The stateid was generated by the current server instance but the
      stateid does not designate a locking state for any active
      lockowner-file pair.  The error NFS4ERR_BAD_STATEID should be
      returned.

      This error condition will occur when there has been a logic error
      on the part of the client or server.  This should not happen.

   One mechanism that may be used to satisfy these requirements is for
   the server to,

   o  divide the "other" field of each stateid into two fields:

      -  A server verifier which uniquely designates a particular server
         instantiation.

      -  An index into a table of locking-state structures.
ToP   noToC   RFC3530 - Page 71
   o  utilize the "seqid" field of each stateid, such that seqid is
      monotonically incremented for each stateid that is associated with
      the same index into the locking-state table.

   By matching the incoming stateid and its field values with the state
   held at the server, the server is able to easily determine if a
   stateid is valid for its current instantiation and state.  If the
   stateid is not valid, the appropriate error can be supplied to the
   client.

8.1.4. Use of the stateid and Locking

All READ, WRITE and SETATTR operations contain a stateid. For the purposes of this section, SETATTR operations which change the size attribute of a file are treated as if they are writing the area between the old and new size (i.e., the range truncated or added to the file by means of the SETATTR), even where SETATTR is not explicitly mentioned in the text. If the lock_owner performs a READ or WRITE in a situation in which it has established a lock or share reservation on the server (any OPEN constitutes a share reservation) the stateid (previously returned by the server) must be used to indicate what locks, including both record locks and share reservations, are held by the lockowner. If no state is established by the client, either record lock or share reservation, a stateid of all bits 0 is used. Regardless whether a stateid of all bits 0, or a stateid returned by the server is used, if there is a conflicting share reservation or mandatory record lock held on the file, the server MUST refuse to service the READ or WRITE operation. Share reservations are established by OPEN operations and by their nature are mandatory in that when the OPEN denies READ or WRITE operations, that denial results in such operations being rejected with error NFS4ERR_LOCKED. Record locks may be implemented by the server as either mandatory or advisory, or the choice of mandatory or advisory behavior may be determined by the server on the basis of the file being accessed (for example, some UNIX-based servers support a "mandatory lock bit" on the mode attribute such that if set, record locks are required on the file before I/O is possible). When record locks are advisory, they only prevent the granting of conflicting lock requests and have no effect on READs or WRITEs. Mandatory record locks, however, prevent conflicting I/O operations. When they are attempted, they are rejected with NFS4ERR_LOCKED. When the client gets NFS4ERR_LOCKED on a file it knows it has the proper share reservation for, it will need to issue a LOCK request on the region
ToP   noToC   RFC3530 - Page 72
   of the file that includes the region the I/O was to be performed on,
   with an appropriate locktype (i.e., READ*_LT for a READ operation,
   WRITE*_LT for a WRITE operation).

   With NFS version 3, there was no notion of a stateid so there was no
   way to tell if the application process of the client sending the READ
   or WRITE operation had also acquired the appropriate record lock on
   the file.  Thus there was no way to implement mandatory locking.
   With the stateid construct, this barrier has been removed.

   Note that for UNIX environments that support mandatory file locking,
   the distinction between advisory and mandatory locking is subtle.  In
   fact, advisory and mandatory record locks are exactly the same in so
   far as the APIs and requirements on implementation.  If the mandatory
   lock attribute is set on the file, the server checks to see if the
   lockowner has an appropriate shared (read) or exclusive (write)
   record lock on the region it wishes to read or write to.  If there is
   no appropriate lock, the server checks if there is a conflicting lock
   (which can be done by attempting to acquire the conflicting lock on
   the behalf of the lockowner, and if successful, release the lock
   after the READ or WRITE is done), and if there is, the server returns
   NFS4ERR_LOCKED.

   For Windows environments, there are no advisory record locks, so the
   server always checks for record locks during I/O requests.

   Thus, the NFS version 4 LOCK operation does not need to distinguish
   between advisory and mandatory record locks.  It is the NFS version 4
   server's processing of the READ and WRITE operations that introduces
   the distinction.

   Every stateid other than the special stateid values noted in this
   section, whether returned by an OPEN-type operation (i.e., OPEN,
   OPEN_DOWNGRADE), or by a LOCK-type operation (i.e., LOCK or LOCKU),
   defines an access mode for the file (i.e., READ, WRITE, or READ-
   WRITE) as established by the original OPEN which began the stateid
   sequence, and as modified by subsequent OPENs and OPEN_DOWNGRADEs
   within that stateid sequence.  When a READ, WRITE, or SETATTR which
   specifies the size attribute, is done, the operation is subject to
   checking against the access mode to verify that the operation is
   appropriate given the OPEN with which the operation is associated.

   In the case of WRITE-type operations (i.e., WRITEs and SETATTRs which
   set size), the server must verify that the access mode allows writing
   and return an NFS4ERR_OPENMODE error if it does not.  In the case, of
   READ, the server may perform the corresponding check on the access
   mode, or it may choose to allow READ on opens for WRITE only, to
   accommodate clients whose write implementation may unavoidably do
ToP   noToC   RFC3530 - Page 73
   reads (e.g., due to buffer cache constraints).  However, even if
   READs are allowed in these circumstances, the server MUST still check
   for locks that conflict with the READ (e.g., another open specify
   denial of READs).  Note that a server which does enforce the access
   mode check on READs need not explicitly check for conflicting share
   reservations since the existence of OPEN for read access guarantees
   that no conflicting share reservation can exist.

   A stateid of all bits 1 (one) MAY allow READ operations to bypass
   locking checks at the server.  However, WRITE operations with a
   stateid with bits all 1 (one) MUST NOT bypass locking checks and are
   treated exactly the same as if a stateid of all bits 0 were used.

   A lock may not be granted while a READ or WRITE operation using one
   of the special stateids is being performed and the range of the lock
   request conflicts with the range of the READ or WRITE operation.  For
   the purposes of this paragraph, a conflict occurs when a shared lock
   is requested and a WRITE operation is being performed, or an
   exclusive lock is requested and either a READ or a WRITE operation is
   being performed.  A SETATTR that sets size is treated similarly to a
   WRITE as discussed above.

8.1.5. Sequencing of Lock Requests

Locking is different than most NFS operations as it requires "at- most-one" semantics that are not provided by ONCRPC. ONCRPC over a reliable transport is not sufficient because a sequence of locking requests may span multiple TCP connections. In the face of retransmission or reordering, lock or unlock requests must have a well defined and consistent behavior. To accomplish this, each lock request contains a sequence number that is a consecutively increasing integer. Different lock_owners have different sequences. The server maintains the last sequence number (L) received and the response that was returned. The first request issued for any given lock_owner is issued with a sequence number of zero. Note that for requests that contain a sequence number, for each lock_owner, there should be no more than one outstanding request. If a request (r) with a previous sequence number (r < L) is received, it is rejected with the return of error NFS4ERR_BAD_SEQID. Given a properly-functioning client, the response to (r) must have been received before the last request (L) was sent. If a duplicate of last request (r == L) is received, the stored response is returned. If a request beyond the next sequence (r == L + 2) is received, it is rejected with the return of error NFS4ERR_BAD_SEQID. Sequence history is reinitialized whenever the SETCLIENTID/SETCLIENTID_CONFIRM sequence changes the client verifier.
ToP   noToC   RFC3530 - Page 74
   Since the sequence number is represented with an unsigned 32-bit
   integer, the arithmetic involved with the sequence number is mod
   2^32.  For an example of modulo arithmetic involving sequence numbers
   see [RFC793].

   It is critical the server maintain the last response sent to the
   client to provide a more reliable cache of duplicate non-idempotent
   requests than that of the traditional cache described in [Juszczak].
   The traditional duplicate request cache uses a least recently used
   algorithm for removing unneeded requests.  However, the last lock
   request and response on a given lock_owner must be cached as long as
   the lock state exists on the server.

   The client MUST monotonically increment the sequence number for the
   CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE
   operations.  This is true even in the event that the previous
   operation that used the sequence number received an error.  The only
   exception to this rule is if the previous operation received one of
   the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
   NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
   NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE.

8.1.6. Recovery from Replayed Requests

As described above, the sequence number is per lock_owner. As long as the server maintains the last sequence number received and follows the methods described above, there are no risks of a Byzantine router re-sending old requests. The server need only maintain the (lock_owner, sequence number) state as long as there are open files or closed files with locks outstanding. LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence number and therefore the risk of the replay of these operations resulting in undesired effects is non-existent while the server maintains the lock_owner state.

8.1.7. Releasing lock_owner State

When a particular lock_owner no longer holds open or file locking state at the server, the server may choose to release the sequence number state associated with the lock_owner. The server may make this choice based on lease expiration, for the reclamation of server memory, or other implementation specific details. In any event, the server is able to do this safely only when the lock_owner no longer is being utilized by the client. The server may choose to hold the lock_owner state in the event that retransmitted requests are received. However, the period to hold this state is implementation specific.
ToP   noToC   RFC3530 - Page 75
   In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is
   retransmitted after the server has previously released the lock_owner
   state, the server will find that the lock_owner has no files open and
   an error will be returned to the client.  If the lock_owner does have
   a file open, the stateid will not match and again an error is
   returned to the client.

8.1.8. Use of Open Confirmation

In the case that an OPEN is retransmitted and the lock_owner is being used for the first time or the lock_owner state has been previously released by the server, the use of the OPEN_CONFIRM operation will prevent incorrect behavior. When the server observes the use of the lock_owner for the first time, it will direct the client to perform the OPEN_CONFIRM for the corresponding OPEN. This sequence establishes the use of an lock_owner and associated sequence number. Since the OPEN_CONFIRM sequence connects a new open_owner on the server with an existing open_owner on a client, the sequence number may have any value. The OPEN_CONFIRM step assures the server that the value received is the correct one. See the section "OPEN_CONFIRM - Confirm Open" for further details. There are a number of situations in which the requirement to confirm an OPEN would pose difficulties for the client and server, in that they would be prevented from acting in a timely fashion on information received, because that information would be provisional, subject to deletion upon non-confirmation. Fortunately, these are situations in which the server can avoid the need for confirmation when responding to open requests. The two constraints are: o The server must not bestow a delegation for any open which would require confirmation. o The server MUST NOT require confirmation on a reclaim-type open (i.e., one specifying claim type CLAIM_PREVIOUS or CLAIM_DELEGATE_PREV). These constraints are related in that reclaim-type opens are the only ones in which the server may be required to send a delegation. For CLAIM_NULL, sending the delegation is optional while for CLAIM_DELEGATE_CUR, no delegation is sent. Delegations being sent with an open requiring confirmation are troublesome because recovering from non-confirmation adds undue complexity to the protocol while requiring confirmation on reclaim- type opens poses difficulties in that the inability to resolve
ToP   noToC   RFC3530 - Page 76
   the status of the reclaim until lease expiration may make it
   difficult to have timely determination of the set of locks being
   reclaimed (since the grace period may expire).

   Requiring open confirmation on reclaim-type opens is avoidable
   because of the nature of the environments in which such opens are
   done.  For CLAIM_PREVIOUS opens, this is immediately after server
   reboot, so there should be no time for lockowners to be created,
   found to be unused, and recycled.  For CLAIM_DELEGATE_PREV opens, we
   are dealing with a client reboot situation.  A server which supports
   delegation can be sure that no lockowners for that client have been
   recycled since client initialization and thus can ensure that
   confirmation will not be required.



(page 76 continued on part 4)

Next Section