9.4. Open Delegation
When a file is being OPENed, the server may delegate further handling of opens and closes for that file to the opening client. Any such delegation is recallable, since the circumstances that allowed for the delegation are subject to change. In particular, the server may receive a conflicting OPEN from another client, the server must recall the delegation before deciding whether the OPEN from the other client may be granted. Making a delegation is up to the server and clients should not assume that any particular OPEN either will or will not result in an open delegation. The following is a typical set of conditions that servers might use in deciding whether OPEN should be delegated: o The client must be able to respond to the server's callback requests. The server will use the CB_NULL procedure for a test of callback ability. o The client must have responded properly to previous recalls. o There must be no current open conflicting with the requested delegation. o There should be no current delegation that conflicts with the delegation being requested. o The probability of future conflicting open requests should be low based on the recent history of the file.
o The existence of any server-specific semantics of OPEN/CLOSE that
would make the required handling incompatible with the prescribed
handling that the delegated client would apply (see below).
There are two types of open delegations, read and write. A read open
delegation allows a client to handle, on its own, requests to open a
file for reading that do not deny read access to others. Multiple
read open delegations may be outstanding simultaneously and do not
conflict. A write open delegation allows the client to handle, on
its own, all opens. Only one write open delegation may exist for a
given file at a given time and it is inconsistent with any read open
delegations.
When a client has a read open delegation, it may not make any changes
to the contents or attributes of the file but it is assured that no
other client may do so. When a client has a write open delegation,
it may modify the file data since no other client will be accessing
the file's data. The client holding a write delegation may only
affect file attributes which are intimately connected with the file
data: size, time_modify, change.
When a client has an open delegation, it does not send OPENs or
CLOSEs to the server but updates the appropriate status internally.
For a read open delegation, opens that cannot be handled locally
(opens for write or that deny read access) must be sent to the
server.
When an open delegation is made, the response to the OPEN contains an
open delegation structure which specifies the following:
o the type of delegation (read or write)
o space limitation information to control flushing of data on close
(write open delegation only, see the section "Open Delegation and
Data Caching")
o an nfsace4 specifying read and write permissions
o a stateid to represent the delegation for READ and WRITE
The delegation stateid is separate and distinct from the stateid for
the OPEN proper. The standard stateid, unlike the delegation
stateid, is associated with a particular lock_owner and will continue
to be valid after the delegation is recalled and the file remains
open.
When a request internal to the client is made to open a file and open
delegation is in effect, it will be accepted or rejected solely on
the basis of the following conditions. Any requirement for other
checks to be made by the delegate should result in open delegation
being denied so that the checks can be made by the server itself.
o The access and deny bits for the request and the file as described
in the section "Share Reservations".
o The read and write permissions as determined below.
The nfsace4 passed with delegation can be used to avoid frequent
ACCESS calls. The permission check should be as follows:
o If the nfsace4 indicates that the open may be done, then it should
be granted without reference to the server.
o If the nfsace4 indicates that the open may not be done, then an
ACCESS request must be sent to the server to obtain the definitive
answer.
The server may return an nfsace4 that is more restrictive than the
actual ACL of the file. This includes an nfsace4 that specifies
denial of all access. Note that some common practices such as
mapping the traditional user "root" to the user "nobody" may make it
incorrect to return the actual ACL of the file in the delegation
response.
The use of delegation together with various other forms of caching
creates the possibility that no server authentication will ever be
performed for a given user since all of the user's requests might be
satisfied locally. Where the client is depending on the server for
authentication, the client should be sure authentication occurs for
each user by use of the ACCESS operation. This should be the case
even if an ACCESS operation would not be required otherwise. As
mentioned before, the server may enforce frequent authentication by
returning an nfsace4 denying all access with every open delegation.
9.4.1. Open Delegation and Data Caching
OPEN delegation allows much of the message overhead associated with
the opening and closing files to be eliminated. An open when an open
delegation is in effect does not require that a validation message be
sent to the server. The continued endurance of the "read open
delegation" provides a guarantee that no OPEN for write and thus no
write has occurred. Similarly, when closing a file opened for write
and if write open delegation is in effect, the data written does not
have to be flushed to the server until the open delegation is
recalled. The continued endurance of the open delegation provides a guarantee that no open and thus no read or write has been done by another client. For the purposes of open delegation, READs and WRITEs done without an OPEN are treated as the functional equivalents of a corresponding type of OPEN. This refers to the READs and WRITEs that use the special stateids consisting of all zero bits or all one bits. Therefore, READs or WRITEs with a special stateid done by another client will force the server to recall a write open delegation. A WRITE with a special stateid done by another client will force a recall of read open delegations. With delegations, a client is able to avoid writing data to the server when the CLOSE of a file is serviced. The file close system call is the usual point at which the client is notified of a lack of stable storage for the modified file data generated by the application. At the close, file data is written to the server and through normal accounting the server is able to determine if the available filesystem space for the data has been exceeded (i.e., server returns NFS4ERR_NOSPC or NFS4ERR_DQUOT). This accounting includes quotas. The introduction of delegations requires that a alternative method be in place for the same type of communication to occur between client and server. In the delegation response, the server provides either the limit of the size of the file or the number of modified blocks and associated block size. The server must ensure that the client will be able to flush data to the server of a size equal to that provided in the original delegation. The server must make this assurance for all outstanding delegations. Therefore, the server must be careful in its management of available space for new or modified data taking into account available filesystem space and any applicable quotas. The server can recall delegations as a result of managing the available filesystem space. The client should abide by the server's state space limits for delegations. If the client exceeds the stated limits for the delegation, the server's behavior is undefined. Based on server conditions, quotas or available filesystem space, the server may grant write open delegations with very restrictive space limitations. The limitations may be defined in a way that will always force modified data to be flushed to the server on close. With respect to authentication, flushing modified data to the server after a CLOSE has occurred may be problematic. For example, the user of the application may have logged off the client and unexpired authentication credentials may not be present. In this case, the client may need to take special care to ensure that local unexpired
credentials will in fact be available. This may be accomplished by tracking the expiration time of credentials and flushing data well in advance of their expiration or by making private copies of credentials to assure their availability when needed.9.4.2. Open Delegation and File Locks
When a client holds a write open delegation, lock operations may be performed locally. This includes those required for mandatory file locking. This can be done since the delegation implies that there can be no conflicting locks. Similarly, all of the revalidations that would normally be associated with obtaining locks and the flushing of data associated with the releasing of locks need not be done. When a client holds a read open delegation, lock operations are not performed locally. All lock operations, including those requesting non-exclusive locks, are sent to the server for resolution.9.4.3. Handling of CB_GETATTR
The server needs to employ special handling for a GETATTR where the target is a file that has a write open delegation in effect. The reason for this is that the client holding the write delegation may have modified the data and the server needs to reflect this change to the second client that submitted the GETATTR. Therefore, the client holding the write delegation needs to be interrogated. The server will use the CB_GETATTR operation. The only attributes that the server can reliably query via CB_GETATTR are size and change. Since CB_GETATTR is being used to satisfy another client's GETATTR request, the server only needs to know if the client holding the delegation has a modified version of the file. If the client's copy of the delegated file is not modified (data or size), the server can satisfy the second client's GETATTR request from the attributes stored locally at the server. If the file is modified, the server only needs to know about this modified state. If the server determines that the file is currently modified, it will respond to the second client's GETATTR as if the file had been modified locally at the server. Since the form of the change attribute is determined by the server and is opaque to the client, the client and server need to agree on a method of communicating the modified state of the file. For the size attribute, the client will report its current view of the file size. For the change attribute, the handling is more involved.
For the client, the following steps will be taken when receiving a
write delegation:
o The value of the change attribute will be obtained from the server
and cached. Let this value be represented by c.
o The client will create a value greater than c that will be used
for communicating modified data is held at the client. Let this
value be represented by d.
o When the client is queried via CB_GETATTR for the change
attribute, it checks to see if it holds modified data. If the
file is modified, the value d is returned for the change attribute
value. If this file is not currently modified, the client returns
the value c for the change attribute.
For simplicity of implementation, the client MAY for each CB_GETATTR
return the same value d. This is true even if, between successive
CB_GETATTR operations, the client again modifies in the file's data
or metadata in its cache. The client can return the same value
because the only requirement is that the client be able to indicate
to the server that the client holds modified data. Therefore, the
value of d may always be c + 1.
While the change attribute is opaque to the client in the sense that
it has no idea what units of time, if any, the server is counting
change with, it is not opaque in that the client has to treat it as
an unsigned integer, and the server has to be able to see the results
of the client's changes to that integer. Therefore, the server MUST
encode the change attribute in network order when sending it to the
client. The client MUST decode it from network order to its native
order when receiving it and the client MUST encode it network order
when sending it to the server. For this reason, change is defined as
an unsigned integer rather than an opaque array of octets.
For the server, the following steps will be taken when providing a
write delegation:
o Upon providing a write delegation, the server will cache a copy of
the change attribute in the data structure it uses to record the
delegation. Let this value be represented by sc.
o When a second client sends a GETATTR operation on the same file to
the server, the server obtains the change attribute from the first
client. Let this value be cc.
o If the value cc is equal to sc, the file is not modified and the
server returns the current values for change, time_metadata, and
time_modify (for example) to the second client.
o If the value cc is NOT equal to sc, the file is currently modified
at the first client and most likely will be modified at the server
at a future time. The server then uses its current time to
construct attribute values for time_metadata and time_modify. A
new value of sc, which we will call nsc, is computed by the
server, such that nsc >= sc + 1. The server then returns the
constructed time_metadata, time_modify, and nsc values to the
requester. The server replaces sc in the delegation record with
nsc. To prevent the possibility of time_modify, time_metadata,
and change from appearing to go backward (which would happen if
the client holding the delegation fails to write its modified data
to the server before the delegation is revoked or returned), the
server SHOULD update the file's metadata record with the
constructed attribute values. For reasons of reasonable
performance, committing the constructed attribute values to stable
storage is OPTIONAL.
As discussed earlier in this section, the client MAY return the
same cc value on subsequent CB_GETATTR calls, even if the file was
modified in the client's cache yet again between successive
CB_GETATTR calls. Therefore, the server must assume that the file
has been modified yet again, and MUST take care to ensure that the
new nsc it constructs and returns is greater than the previous nsc
it returned. An example implementation's delegation record would
satisfy this mandate by including a boolean field (let us call it
"modified") that is set to false when the delegation is granted,
and an sc value set at the time of grant to the change attribute
value. The modified field would be set to true the first time cc
!= sc, and would stay true until the delegation is returned or
revoked. The processing for constructing nsc, time_modify, and
time_metadata would use this pseudo code:
if (!modified) {
do CB_GETATTR for change and size;
if (cc != sc)
modified = TRUE;
} else {
do CB_GETATTR for size;
}
if (modified) {
sc = sc + 1;
time_modify = time_metadata = current_time;
update sc, time_modify, time_metadata into file's metadata;
}
return to client (that sent GETATTR) the attributes
it requested, but make sure size comes from what
CB_GETATTR returned. Do not update the file's metadata
with the client's modified size.
o In the case that the file attribute size is different than the
server's current value, the server treats this as a modification
regardless of the value of the change attribute retrieved via
CB_GETATTR and responds to the second client as in the last step.
This methodology resolves issues of clock differences between client
and server and other scenarios where the use of CB_GETATTR break
down.
It should be noted that the server is under no obligation to use
CB_GETATTR and therefore the server MAY simply recall the delegation
to avoid its use.
9.4.4. Recall of Open Delegation
The following events necessitate recall of an open delegation:
o Potentially conflicting OPEN request (or READ/WRITE done with
"special" stateid)
o SETATTR issued by another client
o REMOVE request for the file
o RENAME request for the file as either source or target of the
RENAME
Whether a RENAME of a directory in the path leading to the file
results in recall of an open delegation depends on the semantics of
the server filesystem. If that filesystem denies such RENAMEs when a
file is open, the recall must be performed to determine whether the
file in question is, in fact, open.
In addition to the situations above, the server may choose to recall
open delegations at any time if resource constraints make it
advisable to do so. Clients should always be prepared for the
possibility of recall.
When a client receives a recall for an open delegation, it needs to
update state on the server before returning the delegation. These
same updates must be done whenever a client chooses to return a
delegation voluntarily. The following items of state need to be
dealt with:
o If the file associated with the delegation is no longer open and
no previous CLOSE operation has been sent to the server, a CLOSE
operation must be sent to the server.
o If a file has other open references at the client, then OPEN
operations must be sent to the server. The appropriate stateids
will be provided by the server for subsequent use by the client
since the delegation stateid will not longer be valid. These OPEN
requests are done with the claim type of CLAIM_DELEGATE_CUR. This
will allow the presentation of the delegation stateid so that the
client can establish the appropriate rights to perform the OPEN.
(see the section "Operation 18: OPEN" for details.)
o If there are granted file locks, the corresponding LOCK operations
need to be performed. This applies to the write open delegation
case only.
o For a write open delegation, if at the time of recall the file is
not open for write, all modified data for the file must be flushed
to the server. If the delegation had not existed, the client
would have done this data flush before the CLOSE operation.
o For a write open delegation when a file is still open at the time
of recall, any modified data for the file needs to be flushed to
the server.
o With the write open delegation in place, it is possible that the
file was truncated during the duration of the delegation. For
example, the truncation could have occurred as a result of an OPEN
UNCHECKED with a size attribute value of zero. Therefore, if a
truncation of the file has occurred and this operation has not
been propagated to the server, the truncation must occur before
any modified data is written to the server.
In the case of write open delegation, file locking imposes some
additional requirements. To precisely maintain the associated
invariant, it is required to flush any modified data in any region
for which a write lock was released while the write delegation was in
effect. However, because the write open delegation implies no other
locking by other clients, a simpler implementation is to flush all
modified data for the file (as described just above) if any write
lock has been released while the write open delegation was in effect.
An implementation need not wait until delegation recall (or deciding to voluntarily return a delegation) to perform any of the above actions, if implementation considerations (e.g., resource availability constraints) make that desirable. Generally, however, the fact that the actual open state of the file may continue to change makes it not worthwhile to send information about opens and closes to the server, except as part of delegation return. Only in the case of closing the open that resulted in obtaining the delegation would clients be likely to do this early, since, in that case, the close once done will not be undone. Regardless of the client's choices on scheduling these actions, all must be performed before the delegation is returned, including (when applicable) the close that corresponds to the open that resulted in the delegation. These actions can be performed either in previous requests or in previous operations in the same COMPOUND request.9.4.5. Clients that Fail to Honor Delegation Recalls
A client may fail to respond to a recall for various reasons, such as a failure of the callback path from server to the client. The client may be unaware of a failure in the callback path. This lack of awareness could result in the client finding out long after the failure that its delegation has been revoked, and another client has modified the data for which the client had a delegation. This is especially a problem for the client that held a write delegation. The server also has a dilemma in that the client that fails to respond to the recall might also be sending other NFS requests, including those that renew the lease before the lease expires. Without returning an error for those lease renewing operations, the server leads the client to believe that the delegation it has is in force. This difficulty is solved by the following rules: o When the callback path is down, the server MUST NOT revoke the delegation if one of the following occurs: - The client has issued a RENEW operation and the server has returned an NFS4ERR_CB_PATH_DOWN error. The server MUST renew the lease for any record locks and share reservations the client has that the server has known about (as opposed to those locks and share reservations the client has established but not yet sent to the server, due to the delegation). The server SHOULD give the client a reasonable time to return its delegations to the server before revoking the client's delegations.
- The client has not issued a RENEW operation for some period of
time after the server attempted to recall the delegation. This
period of time MUST NOT be less than the value of the
lease_time attribute.
o When the client holds a delegation, it can not rely on operations,
except for RENEW, that take a stateid, to renew delegation leases
across callback path failures. The client that wants to keep
delegations in force across callback path failures must use RENEW
to do so.
9.4.6. Delegation Revocation
At the point a delegation is revoked, if there are associated opens
on the client, the applications holding these opens need to be
notified. This notification usually occurs by returning errors for
READ/WRITE operations or when a close is attempted for the open file.
If no opens exist for the file at the point the delegation is
revoked, then notification of the revocation is unnecessary.
However, if there is modified data present at the client for the
file, the user of the application should be notified. Unfortunately,
it may not be possible to notify the user since active applications
may not be present at the client. See the section "Revocation
Recovery for Write Open Delegation" for additional details.
9.5. Data Caching and Revocation
When locks and delegations are revoked, the assumptions upon which
successful caching depend are no longer guaranteed. For any locks or
share reservations that have been revoked, the corresponding owner
needs to be notified. This notification includes applications with a
file open that has a corresponding delegation which has been revoked.
Cached data associated with the revocation must be removed from the
client. In the case of modified data existing in the client's cache,
that data must be removed from the client without it being written to
the server. As mentioned, the assumptions made by the client are no
longer valid at the point when a lock or delegation has been revoked.
For example, another client may have been granted a conflicting lock
after the revocation of the lock at the first client. Therefore, the
data within the lock range may have been modified by the other
client. Obviously, the first client is unable to guarantee to the
application what has occurred to the file in the case of revocation.
Notification to a lock owner will in many cases consist of simply
returning an error on the next and all subsequent READs/WRITEs to the
open file or on the close. Where the methods available to a client
make such notification impossible because errors for certain
operations may not be returned, more drastic action such as signals or process termination may be appropriate. The justification for this is that an invariant for which an application depends on may be violated. Depending on how errors are typically treated for the client operating environment, further levels of notification including logging, console messages, and GUI pop-ups may be appropriate.9.5.1. Revocation Recovery for Write Open Delegation
Revocation recovery for a write open delegation poses the special issue of modified data in the client cache while the file is not open. In this situation, any client which does not flush modified data to the server on each close must ensure that the user receives appropriate notification of the failure as a result of the revocation. Since such situations may require human action to correct problems, notification schemes in which the appropriate user or administrator is notified may be necessary. Logging and console messages are typical examples. If there is modified data on the client, it must not be flushed normally to the server. A client may attempt to provide a copy of the file data as modified during the delegation under a different name in the filesystem name space to ease recovery. Note that when the client can determine that the file has not been modified by any other client, or when the client has a complete cached copy of file in question, such a saved copy of the client's view of the file may be of particular value for recovery. In other case, recovery using a copy of the file based partially on the client's cached data and partially on the server copy as modified by other clients, will be anything but straightforward, so clients may avoid saving file contents in these situations or mark the results specially to warn users of possible problems. Saving of such modified data in delegation revocation situations may be limited to files of a certain size or might be used only when sufficient disk space is available within the target filesystem. Such saving may also be restricted to situations when the client has sufficient buffering resources to keep the cached copy available until it is properly stored to the target filesystem.9.6. Attribute Caching
The attributes discussed in this section do not include named attributes. Individual named attributes are analogous to files and caching of the data for these needs to be handled just as data
caching is for ordinary files. Similarly, LOOKUP results from an
OPENATTR directory are to be cached on the same basis as any other
pathnames and similarly for directory contents.
Clients may cache file attributes obtained from the server and use
them to avoid subsequent GETATTR requests. Such caching is write
through in that modification to file attributes is always done by
means of requests to the server and should not be done locally and
cached. The exception to this are modifications to attributes that
are intimately connected with data caching. Therefore, extending a
file by writing data to the local data cache is reflected immediately
in the size as seen on the client without this change being
immediately reflected on the server. Normally such changes are not
propagated directly to the server but when the modified data is
flushed to the server, analogous attribute changes are made on the
server. When open delegation is in effect, the modified attributes
may be returned to the server in the response to a CB_RECALL call.
The result of local caching of attributes is that the attribute
caches maintained on individual clients will not be coherent.
Changes made in one order on the server may be seen in a different
order on one client and in a third order on a different client.
The typical filesystem application programming interfaces do not
provide means to atomically modify or interrogate attributes for
multiple files at the same time. The following rules provide an
environment where the potential incoherences mentioned above can be
reasonably managed. These rules are derived from the practice of
previous NFS protocols.
o All attributes for a given file (per-fsid attributes excepted) are
cached as a unit at the client so that no non-serializability can
arise within the context of a single file.
o An upper time boundary is maintained on how long a client cache
entry can be kept without being refreshed from the server.
o When operations are performed that change attributes at the
server, the updated attribute set is requested as part of the
containing RPC. This includes directory operations that update
attributes indirectly. This is accomplished by following the
modifying operation with a GETATTR operation and then using the
results of the GETATTR to update the client's cached attributes.
Note that if the full set of attributes to be cached is requested by
READDIR, the results can be cached by the client on the same basis as
attributes obtained via GETATTR.
A client may validate its cached version of attributes for a file by fetching just both the change and time_access attributes and assuming that if the change attribute has the same value as it did when the attributes were cached, then no attributes other than time_access have changed. The reason why time_access is also fetched is because many servers operate in environments where the operation that updates change does not update time_access. For example, POSIX file semantics do not update access time when a file is modified by the write system call. Therefore, the client that wants a current time_access value should fetch it with change during the attribute cache validation processing and update its cached time_access. The client may maintain a cache of modified attributes for those attributes intimately connected with data of modified regular files (size, time_modify, and change). Other than those three attributes, the client MUST NOT maintain a cache of modified attributes. Instead, attribute changes are immediately sent to the server. In some operating environments, the equivalent to time_access is expected to be implicitly updated by each read of the content of the file object. If an NFS client is caching the content of a file object, whether it is a regular file, directory, or symbolic link, the client SHOULD NOT update the time_access attribute (via SETATTR or a small READ or READDIR request) on the server with each read that is satisfied from cache. The reason is that this can defeat the performance benefits of caching content, especially since an explicit SETATTR of time_access may alter the change attribute on the server. If the change attribute changes, clients that are caching the content will think the content has changed, and will re-read unmodified data from the server. Nor is the client encouraged to maintain a modified version of time_access in its cache, since this would mean that the client will either eventually have to write the access time to the server with bad performance effects, or it would never update the server's time_access, thereby resulting in a situation where an application that caches access time between a close and open of the same file observes the access time oscillating between the past and present. The time_access attribute always means the time of last access to a file by a read that was satisfied by the server. This way clients will tend to see only time_access changes that go forward in time.9.7. Data and Metadata Caching and Memory Mapped Files
Some operating environments include the capability for an application to map a file's content into the application's address space. Each time the application accesses a memory location that corresponds to a block that has not been loaded into the address space, a page fault occurs and the file is read (or if the block does not exist in the
file, the block is allocated and then instantiated in the
application's address space).
As long as each memory mapped access to the file requires a page
fault, the relevant attributes of the file that are used to detect
access and modification (time_access, time_metadata, time_modify, and
change) will be updated. However, in many operating environments,
when page faults are not required these attributes will not be
updated on reads or updates to the file via memory access (regardless
whether the file is local file or is being access remotely). A
client or server MAY fail to update attributes of a file that is
being accessed via memory mapped I/O. This has several implications:
o If there is an application on the server that has memory mapped a
file that a client is also accessing, the client may not be able
to get a consistent value of the change attribute to determine
whether its cache is stale or not. A server that knows that the
file is memory mapped could always pessimistically return updated
values for change so as to force the application to always get the
most up to date data and metadata for the file. However, due to
the negative performance implications of this, such behavior is
OPTIONAL.
o If the memory mapped file is not being modified on the server, and
instead is just being read by an application via the memory mapped
interface, the client will not see an updated time_access
attribute. However, in many operating environments, neither will
any process running on the server. Thus NFS clients are at no
disadvantage with respect to local processes.
o If there is another client that is memory mapping the file, and if
that client is holding a write delegation, the same set of issues
as discussed in the previous two bullet items apply. So, when a
server does a CB_GETATTR to a file that the client has modified in
its cache, the response from CB_GETATTR will not necessarily be
accurate. As discussed earlier, the client's obligation is to
report that the file has been modified since the delegation was
granted, not whether it has been modified again between successive
CB_GETATTR calls, and the server MUST assume that any file the
client has modified in cache has been modified again between
successive CB_GETATTR calls. Depending on the nature of the
client's memory management system, this weak obligation may not be
possible. A client MAY return stale information in CB_GETATTR
whenever the file is memory mapped.
o The mixture of memory mapping and file locking on the same file is
problematic. Consider the following scenario, where the page size
on each client is 8192 bytes.
- Client A memory maps first page (8192 bytes) of file X
- Client B memory maps first page (8192 bytes) of file X
- Client A write locks first 4096 bytes
- Client B write locks second 4096 bytes
- Client A, via a STORE instruction modifies part of its locked
region.
- Simultaneous to client A, client B issues a STORE on part of
its locked region.
Here the challenge is for each client to resynchronize to get a
correct view of the first page. In many operating environments, the
virtual memory management systems on each client only know a page is
modified, not that a subset of the page corresponding to the
respective lock regions has been modified. So it is not possible for
each client to do the right thing, which is to only write to the
server that portion of the page that is locked. For example, if
client A simply writes out the page, and then client B writes out the
page, client A's data is lost.
Moreover, if mandatory locking is enabled on the file, then we have a
different problem. When clients A and B issue the STORE
instructions, the resulting page faults require a record lock on the
entire page. Each client then tries to extend their locked range to
the entire page, which results in a deadlock.
Communicating the NFS4ERR_DEADLOCK error to a STORE instruction is
difficult at best.
If a client is locking the entire memory mapped file, there is no
problem with advisory or mandatory record locking, at least until the
client unlocks a region in the middle of the file.
Given the above issues the following are permitted:
- Clients and servers MAY deny memory mapping a file they know there
are record locks for.
- Clients and servers MAY deny a record lock on a file they know is
memory mapped.
- A client MAY deny memory mapping a file that it knows requires
mandatory locking for I/O. If mandatory locking is enabled after
the file is opened and mapped, the client MAY deny the application
further access to its mapped file.
9.8. Name Caching
The results of LOOKUP and READDIR operations may be cached to avoid
the cost of subsequent LOOKUP operations. Just as in the case of
attribute caching, inconsistencies may arise among the various client
caches. To mitigate the effects of these inconsistencies and given
the context of typical filesystem APIs, an upper time boundary is
maintained on how long a client name cache entry can be kept without
verifying that the entry has not been made invalid by a directory
change operation performed by another client.
When a client is not making changes to a directory for which there
exist name cache entries, the client needs to periodically fetch
attributes for that directory to ensure that it is not being
modified. After determining that no modification has occurred, the
expiration time for the associated name cache entries may be updated
to be the current time plus the name cache staleness bound.
When a client is making changes to a given directory, it needs to
determine whether there have been changes made to the directory by
other clients. It does this by using the change attribute as
reported before and after the directory operation in the associated
change_info4 value returned for the operation. The server is able to
communicate to the client whether the change_info4 data is provided
atomically with respect to the directory operation. If the change
values are provided atomically, the client is then able to compare
the pre-operation change value with the change value in the client's
name cache. If the comparison indicates that the directory was
updated by another client, the name cache associated with the
modified directory is purged from the client. If the comparison
indicates no modification, the name cache can be updated on the
client to reflect the directory operation and the associated timeout
extended. The post-operation change value needs to be saved as the
basis for future change_info4 comparisons.
As demonstrated by the scenario above, name caching requires that the
client revalidate name cache data by inspecting the change attribute
of a directory at the point when the name cache item was cached.
This requires that the server update the change attribute for
directories when the contents of the corresponding directory is
modified. For a client to use the change_info4 information
appropriately and correctly, the server must report the pre and post
operation change attribute values atomically. When the server is
unable to report the before and after values atomically with respect to the directory operation, the server must indicate that fact in the change_info4 return value. When the information is not atomically reported, the client should not assume that other clients have not changed the directory.9.9. Directory Caching
The results of READDIR operations may be used to avoid subsequent READDIR operations. Just as in the cases of attribute and name caching, inconsistencies may arise among the various client caches. To mitigate the effects of these inconsistencies, and given the context of typical filesystem APIs, the following rules should be followed: o Cached READDIR information for a directory which is not obtained in a single READDIR operation must always be a consistent snapshot of directory contents. This is determined by using a GETATTR before the first READDIR and after the last of READDIR that contributes to the cache. o An upper time boundary is maintained to indicate the length of time a directory cache entry is considered valid before the client must revalidate the cached information. The revalidation technique parallels that discussed in the case of name caching. When the client is not changing the directory in question, checking the change attribute of the directory with GETATTR is adequate. The lifetime of the cache entry can be extended at these checkpoints. When a client is modifying the directory, the client needs to use the change_info4 data to determine whether there are other clients modifying the directory. If it is determined that no other client modifications are occurring, the client may update its directory cache to reflect its own changes. As demonstrated previously, directory caching requires that the client revalidate directory cache data by inspecting the change attribute of a directory at the point when the directory was cached. This requires that the server update the change attribute for directories when the contents of the corresponding directory is modified. For a client to use the change_info4 information appropriately and correctly, the server must report the pre and post operation change attribute values atomically. When the server is unable to report the before and after values atomically with respect to the directory operation, the server must indicate that fact in the change_info4 return value. When the information is not atomically reported, the client should not assume that other clients have not changed the directory.
10. Minor Versioning
To address the requirement of an NFS protocol that can evolve as the need arises, the NFS version 4 protocol contains the rules and framework to allow for future minor changes or versioning. The base assumption with respect to minor versioning is that any future accepted minor version must follow the IETF process and be documented in a standards track RFC. Therefore, each minor version number will correspond to an RFC. Minor version zero of the NFS version 4 protocol is represented by this RFC. The COMPOUND procedure will support the encoding of the minor version being requested by the client. The following items represent the basic rules for the development of minor versions. Note that a future minor version may decide to modify or add to the following rules as part of the minor version definition. 1. Procedures are not added or deleted To maintain the general RPC model, NFS version 4 minor versions will not add to or delete procedures from the NFS program. 2. Minor versions may add operations to the COMPOUND and CB_COMPOUND procedures. The addition of operations to the COMPOUND and CB_COMPOUND procedures does not affect the RPC model. 2.1 Minor versions may append attributes to GETATTR4args, bitmap4, and GETATTR4res. This allows for the expansion of the attribute model to allow for future growth or adaptation. 2.2 Minor version X must append any new attributes after the last documented attribute. Since attribute results are specified as an opaque array of per-attribute XDR encoded results, the complexity of adding new attributes in the midst of the current definitions will be too burdensome. 3. Minor versions must not modify the structure of an existing operation's arguments or results.
Again the complexity of handling multiple structure definitions
for a single operation is too burdensome. New operations should
be added instead of modifying existing structures for a minor
version.
This rule does not preclude the following adaptations in a minor
version.
o adding bits to flag fields such as new attributes to GETATTR's
bitmap4 data type
o adding bits to existing attributes like ACLs that have flag
words
o extending enumerated types (including NFS4ERR_*) with new
values
4. Minor versions may not modify the structure of existing
attributes.
5. Minor versions may not delete operations.
This prevents the potential reuse of a particular operation
"slot" in a future minor version.
6. Minor versions may not delete attributes.
7. Minor versions may not delete flag bits or enumeration values.
8. Minor versions may declare an operation as mandatory to NOT
implement.
Specifying an operation as "mandatory to not implement" is
equivalent to obsoleting an operation. For the client, it means
that the operation should not be sent to the server. For the
server, an NFS error can be returned as opposed to "dropping"
the request as an XDR decode error. This approach allows for
the obsolescence of an operation while maintaining its structure
so that a future minor version can reintroduce the operation.
8.1 Minor versions may declare attributes mandatory to NOT
implement.
8.2 Minor versions may declare flag bits or enumeration values as
mandatory to NOT implement.
9. Minor versions may downgrade features from mandatory to
recommended, or recommended to optional.
10. Minor versions may upgrade features from optional to recommended
or recommended to mandatory.
11. A client and server that support minor version X must support
minor versions 0 (zero) through X-1 as well.
12. No new features may be introduced as mandatory in a minor
version.
This rule allows for the introduction of new functionality and
forces the use of implementation experience before designating a
feature as mandatory.
13. A client MUST NOT attempt to use a stateid, filehandle, or
similar returned object from the COMPOUND procedure with minor
version X for another COMPOUND procedure with minor version Y,
where X != Y.
11. Internationalization
The primary issue in which NFS version 4 needs to deal with
internationalization, or I18N, is with respect to file names and
other strings as used within the protocol. The choice of string
representation must allow reasonable name/string access to clients
which use various languages. The UTF-8 encoding of the UCS as
defined by [ISO10646] allows for this type of access and follows the
policy described in "IETF Policy on Character Sets and Languages",
[RFC2277].
[RFC3454], otherwise know as "stringprep", documents a framework for
using Unicode/UTF-8 in networking protocols, so as "to increase the
likelihood that string input and string comparison work in ways that
make sense for typical users throughout the world." A protocol must
define a profile of stringprep "in order to fully specify the
processing options." The remainder of this Internationalization
section defines the NFS version 4 stringprep profiles. Much of
terminology used for the remainder of this section comes from
stringprep.
There are three UTF-8 string types defined for NFS version 4:
utf8str_cs, utf8str_cis, and utf8str_mixed. Separate profiles are
defined for each. Each profile defines the following, as required by
stringprep:
o The intended applicability of the profile
o The character repertoire that is the input and output to
stringprep (which is Unicode 3.2 for referenced version of
stringprep)
o The mapping tables from stringprep used (as described in section 3
of stringprep)
o Any additional mapping tables specific to the profile
o The Unicode normalization used, if any (as described in section 4
of stringprep)
o The tables from stringprep listing of characters that are
prohibited as output (as described in section 5 of stringprep)
o The bidirectional string testing used, if any (as described in
section 6 of stringprep)
o Any additional characters that are prohibited as output specific
to the profile
Stringprep discusses Unicode characters, whereas NFS version 4
renders UTF-8 characters. Since there is a one to one mapping from
UTF-8 to Unicode, where ever the remainder of this document refers to
to Unicode, the reader should assume UTF-8.
Much of the text for the profiles comes from [RFC3454].
11.1. Stringprep profile for the utf8str_cs type
Every use of the utf8str_cs type definition in the NFS version 4
protocol specification follows the profile named nfs4_cs_prep.
11.1.1. Intended applicability of the nfs4_cs_prep profile
The utf8str_cs type is a case sensitive string of UTF-8 characters.
Its primary use in NFS Version 4 is for naming components and
pathnames. Components and pathnames are stored on the server's
filesystem. Two valid distinct UTF-8 strings might be the same after
processing via the utf8str_cs profile. If the strings are two names
inside a directory, the NFS version 4 server will need to either:
o disallow the creation of a second name if it's post processed form
collides with that of an existing name, or
o allow the creation of the second name, but arrange so that after
post processing, the second name is different than the post
processed form of the first name.
11.1.2. Character repertoire of nfs4_cs_prep
The nfs4_cs_prep profile uses Unicode 3.2, as defined in stringprep's Appendix A.111.1.3. Mapping used by nfs4_cs_prep
The nfs4_cs_prep profile specifies mapping using the following tables from stringprep: Table B.1 Table B.2 is normally not part of the nfs4_cs_prep profile as it is primarily for dealing with case-insensitive comparisons. However, if the NFS version 4 file server supports the case_insensitive filesystem attribute, and if case_insensitive is true, the NFS version 4 server MUST use Table B.2 (in addition to Table B1) when processing utf8str_cs strings, and the NFS version 4 client MUST assume Table B.2 (in addition to Table B.1) are being used. If the case_preserving attribute is present and set to false, then the NFS version 4 server MUST use table B.2 to map case when processing utf8str_cs strings. Whether the server maps from lower to upper case or the upper to lower case is an implementation dependency.11.1.4. Normalization used by nfs4_cs_prep
The nfs4_cs_prep profile does not specify a normalization form. A later revision of this specification may specify a particular normalization form. Therefore, the server and client can expect that they may receive unnormalized characters within protocol requests and responses. If the operating environment requires normalization, then the implementation must normalize utf8str_cs strings within the protocol before presenting the information to an application (at the client) or local filesystem (at the server).
11.1.5. Prohibited output for nfs4_cs_prep
The nfs4_cs_prep profile specifies prohibiting using the following tables from stringprep: Table C.3 Table C.4 Table C.5 Table C.6 Table C.7 Table C.8 Table C.911.1.6. Bidirectional output for nfs4_cs_prep
The nfs4_cs_prep profile does not specify any checking of bidirectional strings.11.2. Stringprep profile for the utf8str_cis type
Every use of the utf8str_cis type definition in the NFS version 4 protocol specification follows the profile named nfs4_cis_prep.11.2.1. Intended applicability of the nfs4_cis_prep profile
The utf8str_cis type is a case insensitive string of UTF-8 characters. Its primary use in NFS Version 4 is for naming NFS servers.11.2.2. Character repertoire of nfs4_cis_prep
The nfs4_cis_prep profile uses Unicode 3.2, as defined in stringprep's Appendix A.111.2.3. Mapping used by nfs4_cis_prep
The nfs4_cis_prep profile specifies mapping using the following tables from stringprep: Table B.1 Table B.211.2.4. Normalization used by nfs4_cis_prep
The nfs4_cis_prep profile specifies using Unicode normalization form KC, as described in stringprep.
11.2.5. Prohibited output for nfs4_cis_prep
The nfs4_cis_prep profile specifies prohibiting using the following tables from stringprep: Table C.1.2 Table C.2.2 Table C.3 Table C.4 Table C.5 Table C.6 Table C.7 Table C.8 Table C.911.2.6. Bidirectional output for nfs4_cis_prep
The nfs4_cis_prep profile specifies checking bidirectional strings as described in stringprep's section 6.11.3. Stringprep profile for the utf8str_mixed type
Every use of the utf8str_mixed type definition in the NFS version 4 protocol specification follows the profile named nfs4_mixed_prep.11.3.1. Intended applicability of the nfs4_mixed_prep profile
The utf8str_mixed type is a string of UTF-8 characters, with a prefix that is case sensitive, a separator equal to '@', and a suffix that is fully qualified domain name. Its primary use in NFS Version 4 is for naming principals identified in an Access Control Entry.11.3.2. Character repertoire of nfs4_mixed_prep
The nfs4_mixed_prep profile uses Unicode 3.2, as defined in stringprep's Appendix A.111.3.3. Mapping used by nfs4_cis_prep
For the prefix and the separator of a utf8str_mixed string, the nfs4_mixed_prep profile specifies mapping using the following table from stringprep: Table B.1 For the suffix of a utf8str_mixed string, the nfs4_mixed_prep profile specifies mapping using the following tables from stringprep:
Table B.1
Table B.2
11.3.4. Normalization used by nfs4_mixed_prep
The nfs4_mixed_prep profile specifies using Unicode normalization
form KC, as described in stringprep.
11.3.5. Prohibited output for nfs4_mixed_prep
The nfs4_mixed_prep profile specifies prohibiting using the following
tables from stringprep:
Table C.1.2
Table C.2.2
Table C.3
Table C.4
Table C.5
Table C.6
Table C.7
Table C.8
Table C.9
11.3.6. Bidirectional output for nfs4_mixed_prep
The nfs4_mixed_prep profile specifies checking bidirectional strings
as described in stringprep's section 6.
11.4. UTF-8 Related Errors
Where the client sends an invalid UTF-8 string, the server should
return an NFS4ERR_INVAL error. This includes cases in which
inappropriate prefixes are detected and where the count includes
trailing bytes that do not constitute a full UCS character.
Where the client supplied string is valid UTF-8 but contains
characters that are not supported by the server as a value for that
string (e.g., names containing characters that have more than two
octets on a filesystem that supports Unicode characters only), the
server should return an NFS4ERR_BADCHAR error.
Where a UTF-8 string is used as a file name, and the filesystem,
while supporting all of the characters within the name, does not
allow that particular name to be used, the server should return the
error NFS4ERR_BADNAME. This includes situations in which the server
filesystem imposes a normalization constraint on name strings, but
will also include such situations as filesystem prohibitions of "." and ".." as file names for certain operations, and other such constraints.12. Error Definitions
NFS error numbers are assigned to failed operations within a compound request. A compound request contains a number of NFS operations that have their results encoded in sequence in a compound reply. The results of successful operations will consist of an NFS4_OK status followed by the encoded results of the operation. If an NFS operation fails, an error status will be entered in the reply and the compound request will be terminated. A description of each defined error follows: NFS4_OK Indicates the operation completed successfully. NFS4ERR_ACCESS Permission denied. The caller does not have the correct permission to perform the requested operation. Contrast this with NFS4ERR_PERM, which restricts itself to owner or privileged user permission failures. NFS4ERR_ATTRNOTSUPP An attribute specified is not supported by the server. Does not apply to the GETATTR operation. NFS4ERR_ADMIN_REVOKED Due to administrator intervention, the lockowner's record locks, share reservations, and delegations have been revoked by the server. NFS4ERR_BADCHAR A UTF-8 string contains a character which is not supported by the server in the context in which it being used. NFS4ERR_BAD_COOKIE READDIR cookie is stale. NFS4ERR_BADHANDLE Illegal NFS filehandle. The filehandle failed internal consistency checks. NFS4ERR_BADNAME A name string in a request consists of valid UTF-8 characters supported by the server but the name is not supported by the server as a valid name for current operation.
NFS4ERR_BADOWNER An owner, owner_group, or ACL attribute value
can not be translated to local representation.
NFS4ERR_BADTYPE An attempt was made to create an object of a
type not supported by the server.
NFS4ERR_BAD_RANGE The range for a LOCK, LOCKT, or LOCKU operation
is not appropriate to the allowable range of
offsets for the server.
NFS4ERR_BAD_SEQID The sequence number in a locking request is
neither the next expected number or the last
number processed.
NFS4ERR_BAD_STATEID A stateid generated by the current server
instance, but which does not designate any
locking state (either current or superseded)
for a current lockowner-file pair, was used.
NFS4ERR_BADXDR The server encountered an XDR decoding error
while processing an operation.
NFS4ERR_CLID_INUSE The SETCLIENTID operation has found that a
client id is already in use by another client.
NFS4ERR_DEADLOCK The server has been able to determine a file
locking deadlock condition for a blocking lock
request.
NFS4ERR_DELAY The server initiated the request, but was not
able to complete it in a timely fashion. The
client should wait and then try the request
with a new RPC transaction ID. For example,
this error should be returned from a server
that supports hierarchical storage and receives
a request to process a file that has been
migrated. In this case, the server should start
the immigration process and respond to client
with this error. This error may also occur
when a necessary delegation recall makes
processing a request in a timely fashion
impossible.
NFS4ERR_DENIED An attempt to lock a file is denied. Since
this may be a temporary condition, the client
is encouraged to retry the lock request until
the lock is accepted.
NFS4ERR_DQUOT Resource (quota) hard limit exceeded. The
user's resource limit on the server has been
exceeded.
NFS4ERR_EXIST File exists. The file specified already exists.
NFS4ERR_EXPIRED A lease has expired that is being used in the
current operation.
NFS4ERR_FBIG File too large. The operation would have caused
a file to grow beyond the server's limit.
NFS4ERR_FHEXPIRED The filehandle provided is volatile and has
expired at the server.
NFS4ERR_FILE_OPEN The operation can not be successfully processed
because a file involved in the operation is
currently open.
NFS4ERR_GRACE The server is in its recovery or grace period
which should match the lease period of the
server.
NFS4ERR_INVAL Invalid argument or unsupported argument for an
operation. Two examples are attempting a
READLINK on an object other than a symbolic
link or specifying a value for an enum field
that is not defined in the protocol (e.g.,
nfs_ftype4).
NFS4ERR_IO I/O error. A hard error (for example, a disk
error) occurred while processing the requested
operation.
NFS4ERR_ISDIR Is a directory. The caller specified a
directory in a non-directory operation.
NFS4ERR_LEASE_MOVED A lease being renewed is associated with a
filesystem that has been migrated to a new
server.
NFS4ERR_LOCKED A read or write operation was attempted on a
locked file.
NFS4ERR_LOCK_NOTSUPP Server does not support atomic upgrade or
downgrade of locks.
NFS4ERR_LOCK_RANGE A lock request is operating on a sub-range of a
current lock for the lock owner and the server
does not support this type of request.
NFS4ERR_LOCKS_HELD A CLOSE was attempted and file locks would
exist after the CLOSE.
NFS4ERR_MINOR_VERS_MISMATCH
The server has received a request that
specifies an unsupported minor version. The
server must return a COMPOUND4res with a zero
length operations result array.
NFS4ERR_MLINK Too many hard links.
NFS4ERR_MOVED The filesystem which contains the current
filehandle object has been relocated or
migrated to another server. The client may
obtain the new filesystem location by obtaining
the "fs_locations" attribute for the current
filehandle. For further discussion, refer to
the section "Filesystem Migration or
Relocation".
NFS4ERR_NAMETOOLONG The filename in an operation was too long.
NFS4ERR_NOENT No such file or directory. The file or
directory name specified does not exist.
NFS4ERR_NOFILEHANDLE The logical current filehandle value (or, in
the case of RESTOREFH, the saved filehandle
value) has not been set properly. This may be
a result of a malformed COMPOUND operation
(i.e., no PUTFH or PUTROOTFH before an
operation that requires the current filehandle
be set).
NFS4ERR_NO_GRACE A reclaim of client state has fallen outside of
the grace period of the server. As a result,
the server can not guarantee that conflicting
state has not been provided to another client.
NFS4ERR_NOSPC No space left on device. The operation would
have caused the server's filesystem to exceed
its limit.
NFS4ERR_NOTDIR Not a directory. The caller specified a non-
directory in a directory operation.
NFS4ERR_NOTEMPTY An attempt was made to remove a directory that
was not empty.
NFS4ERR_NOTSUPP Operation is not supported.
NFS4ERR_NOT_SAME This error is returned by the VERIFY operation
to signify that the attributes compared were
not the same as provided in the client's
request.
NFS4ERR_NXIO I/O error. No such device or address.
NFS4ERR_OLD_STATEID A stateid which designates the locking state
for a lockowner-file at an earlier time was
used.
NFS4ERR_OPENMODE The client attempted a READ, WRITE, LOCK or
SETATTR operation not sanctioned by the stateid
passed (e.g., writing to a file opened only for
read).
NFS4ERR_OP_ILLEGAL An illegal operation value has been specified
in the argop field of a COMPOUND or CB_COMPOUND
procedure.
NFS4ERR_PERM Not owner. The operation was not allowed
because the caller is either not a privileged
user (root) or not the owner of the target of
the operation.
NFS4ERR_RECLAIM_BAD The reclaim provided by the client does not
match any of the server's state consistency
checks and is bad.
NFS4ERR_RECLAIM_CONFLICT
The reclaim provided by the client has
encountered a conflict and can not be provided.
Potentially indicates a misbehaving client.
NFS4ERR_RESOURCE For the processing of the COMPOUND procedure,
the server may exhaust available resources and
can not continue processing operations within
the COMPOUND procedure. This error will be
returned from the server in those instances of
resource exhaustion related to the processing
of the COMPOUND procedure.
NFS4ERR_RESTOREFH The RESTOREFH operation does not have a saved
filehandle (identified by SAVEFH) to operate
upon.
NFS4ERR_ROFS Read-only filesystem. A modifying operation was
attempted on a read-only filesystem.
NFS4ERR_SAME This error is returned by the NVERIFY operation
to signify that the attributes compared were
the same as provided in the client's request.
NFS4ERR_SERVERFAULT An error occurred on the server which does not
map to any of the legal NFS version 4 protocol
error values. The client should translate this
into an appropriate error. UNIX clients may
choose to translate this to EIO.
NFS4ERR_SHARE_DENIED An attempt to OPEN a file with a share
reservation has failed because of a share
conflict.
NFS4ERR_STALE Invalid filehandle. The filehandle given in the
arguments was invalid. The file referred to by
that filehandle no longer exists or access to
it has been revoked.
NFS4ERR_STALE_CLIENTID A clientid not recognized by the server was
used in a locking or SETCLIENTID_CONFIRM
request.
NFS4ERR_STALE_STATEID A stateid generated by an earlier server
instance was used.
NFS4ERR_SYMLINK The current filehandle provided for a LOOKUP is
not a directory but a symbolic link. Also used
if the final component of the OPEN path is a
symbolic link.
NFS4ERR_TOOSMALL The encoded response to a READDIR request
exceeds the size limit set by the initial
request.
NFS4ERR_WRONGSEC The security mechanism being used by the client
for the operation does not match the server's
security policy. The client should change the
security mechanism being used and retry the
operation.
NFS4ERR_XDEV Attempt to do an operation between different
fsids.