2.7. Minor Versioning
To address the requirement of an NFS protocol that can evolve as the need arises, the NFSv4.1 protocol contains the rules and framework to allow for future minor changes or versioning. The base assumption with respect to minor versioning is that any future accepted minor version will be documented in one or more Standards Track RFCs. Minor version 0 of the NFSv4 protocol is represented by [30], and minor version 1 is represented by this RFC. The COMPOUND and CB_COMPOUND procedures support the encoding of the minor version being requested by the client. The following items represent the basic rules for the development of minor versions. Note that a future minor version may modify or add to the following rules as part of the minor version definition.
1. Procedures are not added or deleted.
To maintain the general RPC model, NFSv4 minor versions will not
add to or delete procedures from the NFS program.
2. Minor versions may add operations to the COMPOUND and
CB_COMPOUND procedures.
The addition of operations to the COMPOUND and CB_COMPOUND
procedures does not affect the RPC model.
* Minor versions may append attributes to the bitmap4 that
represents sets of attributes and to the fattr4 that
represents sets of attribute values.
This allows for the expansion of the attribute model to allow
for future growth or adaptation.
* Minor version X must append any new attributes after the last
documented attribute.
Since attribute results are specified as an opaque array of
per-attribute, XDR-encoded results, the complexity of adding
new attributes in the midst of the current definitions would
be too burdensome.
3. Minor versions must not modify the structure of an existing
operation's arguments or results.
Again, the complexity of handling multiple structure definitions
for a single operation is too burdensome. New operations should
be added instead of modifying existing structures for a minor
version.
This rule does not preclude the following adaptations in a minor
version:
* adding bits to flag fields, such as new attributes to
GETATTR's bitmap4 data type, and providing corresponding
variants of opaque arrays, such as a notify4 used together
with such bitmaps
* adding bits to existing attributes like ACLs that have flag
words
* extending enumerated types (including NFS4ERR_*) with new
values
* adding cases to a switched union
4. Minor versions must not modify the structure of existing
attributes.
5. Minor versions must not delete operations.
This prevents the potential reuse of a particular operation
"slot" in a future minor version.
6. Minor versions must not delete attributes.
7. Minor versions must not delete flag bits or enumeration values.
8. Minor versions may declare an operation MUST NOT be implemented.
Specifying that an operation MUST NOT be implemented is
equivalent to obsoleting an operation. For the client, it means
that the operation MUST NOT be sent to the server. For the
server, an NFS error can be returned as opposed to "dropping"
the request as an XDR decode error. This approach allows for
the obsolescence of an operation while maintaining its structure
so that a future minor version can reintroduce the operation.
1. Minor versions may declare that an attribute MUST NOT be
implemented.
2. Minor versions may declare that a flag bit or enumeration
value MUST NOT be implemented.
9. Minor versions may downgrade features from REQUIRED to
RECOMMENDED, or RECOMMENDED to OPTIONAL.
10. Minor versions may upgrade features from OPTIONAL to
RECOMMENDED, or RECOMMENDED to REQUIRED.
11. A client and server that support minor version X SHOULD support
minor versions zero through X-1 as well.
12. Except for infrastructural changes, a minor version must not
introduce REQUIRED new features.
This rule allows for the introduction of new functionality and
forces the use of implementation experience before designating a
feature as REQUIRED. On the other hand, some classes of
features are infrastructural and have broad effects. Allowing
infrastructural features to be RECOMMENDED or OPTIONAL
complicates implementation of the minor version.
13. A client MUST NOT attempt to use a stateid, filehandle, or
similar returned object from the COMPOUND procedure with minor
version X for another COMPOUND procedure with minor version Y,
where X != Y.
2.8. Non-RPC-Based Security Services
As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for
identification, authentication, integrity, and privacy. NFSv4.1
itself provides or enables additional security services as described
in the next several subsections.
2.8.1. Authorization
Authorization to access a file object via an NFSv4.1 operation is
ultimately determined by the NFSv4.1 server. A client can
predetermine its access to a file object via the OPEN (Section 18.16)
and the ACCESS (Section 18.1) operations.
Principals with appropriate access rights can modify the
authorization on a file object via the SETATTR (Section 18.30)
operation. Attributes that affect access rights include mode, owner,
owner_group, acl, dacl, and sacl. See Section 5.
2.8.2. Auditing
NFSv4.1 provides auditing on a per-file object basis, via the acl and
sacl attributes as described in Section 6. It is outside the scope
of this specification to specify audit log formats or management
policies.
2.8.3. Intrusion Detection
NFSv4.1 provides alarm control on a per-file object basis, via the
acl and sacl attributes as described in Section 6. Alarms may serve
as the basis for intrusion detection. It is outside the scope of
this specification to specify heuristics for detecting intrusion via
alarms.
2.9. Transport Layers
2.9.1. REQUIRED and RECOMMENDED Properties of Transports
NFSv4.1 works over Remote Direct Memory Access (RDMA) and non-RDMA-
based transports with the following attributes:
o The transport supports reliable delivery of data, which NFSv4.1
requires but neither NFSv4.1 nor RPC has facilities for ensuring
[34].
o The transport delivers data in the order it was sent. Ordered
delivery simplifies detection of transmit errors, and simplifies
the sending of arbitrary sized requests and responses via the
record marking protocol [3].
Where an NFSv4.1 implementation supports operation over the IP
network protocol, any transport used between NFS and IP MUST be among
the IETF-approved congestion control transport protocols. At the
time this document was written, the only two transports that had the
above attributes were TCP and the Stream Control Transmission
Protocol (SCTP). To enhance the possibilities for interoperability,
an NFSv4.1 implementation MUST support operation over the TCP
transport protocol.
Even if NFSv4.1 is used over a non-IP network protocol, it is
RECOMMENDED that the transport support congestion control.
It is permissible for a connectionless transport to be used under
NFSv4.1; however, reliable and in-order delivery of data combined
with congestion control by the connectionless transport is REQUIRED.
As a consequence, UDP by itself MUST NOT be used as an NFSv4.1
transport. NFSv4.1 assumes that a client transport address and
server transport address used to send data over a transport together
constitute a connection, even if the underlying transport eschews the
concept of a connection.
2.9.2. Client and Server Transport Behavior
If a connection-oriented transport (e.g., TCP) is used, the client
and server SHOULD use long-lived connections for at least three
reasons:
1. This will prevent the weakening of the transport's congestion
control mechanisms via short-lived connections.
2. This will improve performance for the WAN environment by
eliminating the need for connection setup handshakes.
3. The NFSv4.1 callback model differs from NFSv4.0, and requires the
client and server to maintain a client-created backchannel (see
Section 2.10.3.1) for the server to use.
In order to reduce congestion, if a connection-oriented transport is
used, and the request is not the NULL procedure:
o A requester MUST NOT retry a request unless the connection the
request was sent over was lost before the reply was received.
o A replier MUST NOT silently drop a request, even if the request is
a retry. (The silent drop behavior of RPCSEC_GSS [4] does not
apply because this behavior happens at the RPCSEC_GSS layer, a
lower layer in the request processing.) Instead, the replier
SHOULD return an appropriate error (see Section 2.10.6.1), or it
MAY disconnect the connection.
When sending a reply, the replier MUST send the reply to the same
full network address (e.g., if using an IP-based transport, the
source port of the requester is part of the full network address)
from which the requester sent the request. If using a connection-
oriented transport, replies MUST be sent on the same connection from
which the request was received.
If a connection is dropped after the replier receives the request but
before the replier sends the reply, the replier might have a pending
reply. If a connection is established with the same source and
destination full network address as the dropped connection, then the
replier MUST NOT send the reply until the requester retries the
request. The reason for this prohibition is that the requester MAY
retry a request over a different connection (provided that connection
is associated with the original request's session).
When using RDMA transports, there are other reasons for not
tolerating retries over the same connection:
o RDMA transports use "credits" to enforce flow control, where a
credit is a right to a peer to transmit a message. If one peer
were to retransmit a request (or reply), it would consume an
additional credit. If the replier retransmitted a reply, it would
certainly result in an RDMA connection loss, since the requester
would typically only post a single receive buffer for each
request. If the requester retransmitted a request, the additional
credit consumed on the server might lead to RDMA connection
failure unless the client accounted for it and decreased its
available credit, leading to wasted resources.
o RDMA credits present a new issue to the reply cache in NFSv4.1.
The reply cache may be used when a connection within a session is
lost, such as after the client reconnects. Credit information is
a dynamic property of the RDMA connection, and stale values must
not be replayed from the cache. This implies that the reply cache
contents must not be blindly used when replies are sent from it,
and credit information appropriate to the channel must be
refreshed by the RPC layer.
In addition, as described in Section 2.10.6.2, while a session is active, the NFSv4.1 requester MUST NOT stop waiting for a reply.2.9.3. Ports
Historically, NFSv3 servers have listened over TCP port 2049. The registered port 2049 [35] for the NFS protocol should be the default configuration. NFSv4.1 clients SHOULD NOT use the RPC binding protocols as described in [36].2.10. Session
NFSv4.1 clients and servers MUST support and MUST use the session feature as described in this section.2.10.1. Motivation and Overview
Previous versions and minor versions of NFS have suffered from the following: o Lack of support for Exactly Once Semantics (EOS). This includes lack of support for EOS through server failure and recovery. o Limited callback support, including no support for sending callbacks through firewalls, and races between replies to normal requests and callbacks. o Limited trunking over multiple network paths. o Requiring machine credentials for fully secure operation. Through the introduction of a session, NFSv4.1 addresses the above shortfalls with practical solutions: o EOS is enabled by a reply cache with a bounded size, making it feasible to keep the cache in persistent storage and enable EOS through server failure and recovery. One reason that previous revisions of NFS did not support EOS was because some EOS approaches often limited parallelism. As will be explained in Section 2.10.6, NFSv4.1 supports both EOS and unlimited parallelism. o The NFSv4.1 client (defined in Section 1.6, Paragraph 2) creates transport connections and provides them to the server to use for sending callback requests, thus solving the firewall issue (Section 18.34). Races between responses from client requests and
callbacks caused by the requests are detected via the session's
sequencing properties that are a consequence of EOS
(Section 2.10.6.3).
o The NFSv4.1 client can associate an arbitrary number of
connections with the session, and thus provide trunking
(Section 2.10.5).
o The NFSv4.1 client and server produces a session key independent
of client and server machine credentials which can be used to
compute a digest for protecting critical session management
operations (Section 2.10.8.3).
o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for
use by the session's backchannel that do not require the server to
authenticate to a client machine principal (Section 2.10.8.2).
A session is a dynamically created, long-lived server object created
by a client and used over time from one or more transport
connections. Its function is to maintain the server's state relative
to the connection(s) belonging to a client instance. This state is
entirely independent of the connection itself, and indeed the state
exists whether or not the connection exists. A client may have one
or more sessions associated with it so that client-associated state
may be accessed using any of the sessions associated with that
client's client ID, when connections are associated with those
sessions. When no connections are associated with any of a client
ID's sessions for an extended time, such objects as locks, opens,
delegations, layouts, etc. are subject to expiration. The session
serves as an object representing a means of access by a client to the
associated client state on the server, independent of the physical
means of access to that state.
A single client may create multiple sessions. A single session MUST
NOT serve multiple clients.
2.10.2. NFSv4 Integration
Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major
infrastructure change such as sessions would require a new major
version number to an Open Network Computing (ONC) RPC program like
NFS. However, because NFSv4 encapsulates its functionality in a
single procedure, COMPOUND, and because COMPOUND can support an
arbitrary number of operations, sessions have been added to NFSv4.1
with little difficulty. COMPOUND includes a minor version number
field, and for NFSv4.1 this minor version is set to 1. When the
NFSv4 server processes a COMPOUND with the minor version set to 1, it
expects a different set of operations than it does for NFSv4.0.
NFSv4.1 defines the SEQUENCE operation, which is required for every COMPOUND that operates over an established session, with the exception of some session administration operations, such as DESTROY_SESSION (Section 18.37).2.10.2.1. SEQUENCE and CB_SEQUENCE
In NFSv4.1, when the SEQUENCE operation is present, it MUST be the first operation in the COMPOUND procedure. The primary purpose of SEQUENCE is to carry the session identifier. The session identifier associates all other operations in the COMPOUND procedure with a particular session. SEQUENCE also contains required information for maintaining EOS (see Section 2.10.6). Session-enabled NFSv4.1 COMPOUND requests thus have the form: +-----+--------------+-----------+------------+-----------+---- | tag | minorversion | numops |SEQUENCE op | op + args | ... | | (== 1) | (limited) | + args | | +-----+--------------+-----------+------------+-----------+---- and the replies have the form: +------------+-----+--------+-------------------------------+--// |last status | tag | numres |status + SEQUENCE op + results | // +------------+-----+--------+-------------------------------+--// //-----------------------+---- // status + op + results | ... //-----------------------+---- A CB_COMPOUND procedure request and reply has a similar form to COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE operation. CB_COMPOUND also has an additional field called "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored by the client. CB_SEQUENCE has the same information as SEQUENCE, and also includes other information needed to resolve callback races (Section 2.10.6.3).2.10.2.2. Client ID and Session Association
Each client ID (Section 2.4) can have zero or more active sessions. A client ID and associated session are required to perform file access in NFSv4.1. Each time a session is used (whether by a client sending a request to the server or the client replying to a callback request from the server), the state leased to its associated client ID is automatically renewed.
State (which can consist of share reservations, locks, delegations, and layouts (Section 1.7.4)) is tied to the client ID. Client state is not tied to any individual session. Successive state changing operations from a given state owner MAY go over different sessions, provided the session is associated with the same client ID. A callback MAY arrive over a different session than that of the request that originally acquired the state pertaining to the callback. For example, if session A is used to acquire a delegation, a request to recall the delegation MAY arrive over session B if both sessions are associated with the same client ID. Sections 2.10.8.1 and 2.10.8.2 discuss the security considerations around callbacks.2.10.3. Channels
A channel is not a connection. A channel represents the direction ONC RPC requests are sent. Each session has one or two channels: the fore channel and the backchannel. Because there are at most two channels per session, and because each channel has a distinct purpose, channels are not assigned identifiers. The fore channel is used for ordinary requests from the client to the server, and carries COMPOUND requests and responses. A session always has a fore channel. The backchannel is used for callback requests from server to client, and carries CB_COMPOUND requests and responses. Whether or not there is a backchannel is a decision made by the client; however, many features of NFSv4.1 require a backchannel. NFSv4.1 servers MUST support backchannels. Each session has resources for each channel, including separate reply caches (see Section 2.10.6.1). Note that even the backchannel requires a reply cache (or, at least, a slot table in order to detect retries) because some callback operations are nonidempotent.2.10.3.1. Association of Connections, Channels, and Sessions
Each channel is associated with zero or more transport connections (whether of the same transport protocol or different transport protocols). A connection can be associated with one channel or both channels of a session; the client and server negotiate whether a connection will carry traffic for one channel or both channels via the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION (Section 18.34) operations. When a session is created via CREATE_SESSION, the connection that transported the CREATE_SESSION request is automatically associated with the fore channel, and
optionally the backchannel. If the client specifies no state protection (Section 18.35) when the session is created, then when SEQUENCE is transmitted on a different connection, the connection is automatically associated with the fore channel of the session specified in the SEQUENCE operation. A connection's association with a session is not exclusive. A connection associated with the channel(s) of one session may be simultaneously associated with the channel(s) of other sessions including sessions associated with other client IDs. It is permissible for connections of multiple transport types to be associated with the same channel. For example, both TCP and RDMA connections can be associated with the fore channel. In the event an RDMA and non-RDMA connection are associated with the same channel, the maximum number of slots SHOULD be at least one more than the total number of RDMA credits (Section 2.10.6.1). This way, if all RDMA credits are used, the non-RDMA connection can have at least one outstanding request. If a server supports multiple transport types, it MUST allow a client to associate connections from each transport to a channel. It is permissible for a connection of one type of transport to be associated with the fore channel, and a connection of a different type to be associated with the backchannel.2.10.4. Server Scope
Servers each specify a server scope value in the form of an opaque string eir_server_scope returned as part of the results of an EXCHANGE_ID operation. The purpose of the server scope is to allow a group of servers to indicate to clients that a set of servers sharing the same server scope value has arranged to use compatible values of otherwise opaque identifiers. Thus, the identifiers generated by one server of that set may be presented to another of that same scope. The use of such compatible values does not imply that a value generated by one server will always be accepted by another. In most cases, it will not. However, a server will not accept a value generated by another inadvertently. When it does accept it, it will be because it is recognized as valid and carrying the same meaning as on another server of the same scope. When servers are of the same server scope, this compatibility of values applies to the follow identifiers:
o Filehandle values. A filehandle value accepted by two servers of
the same server scope denotes the same object. A WRITE operation
sent to one server is reflected immediately in a READ sent to the
other, and locks obtained on one server conflict with those
requested on the other.
o Session ID values. A session ID value accepted by two servers of
the same server scope denotes the same session.
o Client ID values. A client ID value accepted as valid by two
servers of the same server scope is associated with two clients
with the same client owner and verifier.
o State ID values. A state ID value is recognized as valid when the
corresponding client ID is recognized as valid. If the same
stateid value is accepted as valid on two servers of the same
scope and the client IDs on the two servers represent the same
client owner and verifier, then the two stateid values designate
the same set of locks and are for the same file.
o Server owner values. When the server scope values are the same,
server owner value may be validly compared. In cases where the
server scope values are different, server owner values are treated
as different even if they contain all identical bytes.
The coordination among servers required to provide such compatibility
can be quite minimal, and limited to a simple partition of the ID
space. The recognition of common values requires additional
implementation, but this can be tailored to the specific situations
in which that recognition is desired.
Clients will have occasion to compare the server scope values of
multiple servers under a number of circumstances, each of which will
be discussed under the appropriate functional section:
o When server owner values received in response to EXCHANGE_ID
operations sent to multiple network addresses are compared for the
purpose of determining the validity of various forms of trunking,
as described in Section 2.10.5.
o When network or server reconfiguration causes the same network
address to possibly be directed to different servers, with the
necessity for the client to determine when lock reclaim should be
attempted, as described in Section 8.4.2.1.
o When file system migration causes the transfer of responsibility
for a file system between servers and the client needs to
determine whether state has been transferred with the file system
(as described in Section 11.7.7) or whether the client needs to
reclaim state on a similar basis as in the case of server restart,
as described in Section 8.4.2.
When two replies from EXCHANGE_ID, each from two different server
network addresses, have the same server scope, there are a number of
ways a client can validate that the common server scope is due to two
servers cooperating in a group.
o If both EXCHANGE_ID requests were sent with RPCSEC_GSS
authentication and the server principal is the same for both
targets, the equality of server scope is validated. It is
RECOMMENDED that two servers intending to share the same server
scope also share the same principal name.
o The client may accept the appearance of the second server in the
fs_locations or fs_locations_info attribute for a relevant file
system. For example, if there is a migration event for a
particular file system or there are locks to be reclaimed on a
particular file system, the attributes for that particular file
system may be used. The client sends the GETATTR request to the
first server for the fs_locations or fs_locations_info attribute
with RPCSEC_GSS authentication. It may need to do this in advance
of the need to verify the common server scope. If the client
successfully authenticates the reply to GETATTR, and the GETATTR
request and reply containing the fs_locations or fs_locations_info
attribute refers to the second server, then the equality of server
scope is supported. A client may choose to limit the use of this
form of support to information relevant to the specific file
system involved (e.g. a file system being migrated).
2.10.5. Trunking
Trunking is the use of multiple connections between a client and
server in order to increase the speed of data transfer. NFSv4.1
supports two types of trunking: session trunking and client ID
trunking.
NFSv4.1 servers MUST support both forms of trunking within the
context of a single server network address and MUST support both
forms within the context of the set of network addresses used to
access a single server. NFSv4.1 servers in a clustered configuration
MAY allow network addresses for different servers to use client ID
trunking.
Clients may use either form of trunking as long as they do not, when
trunking between different server network addresses, violate the
servers' mandates as to the kinds of trunking to be allowed (see
below). With regard to callback channels, the client MUST allow the server to choose among all callback channels valid for a given client ID and MUST support trunking when the connections supporting the backchannel allow session or client ID trunking to be used for callbacks. Session trunking is essentially the association of multiple connections, each with potentially different target and/or source network addresses, to the same session. When the target network addresses (server addresses) of the two connections are the same, the server MUST support such session trunking. When the target network addresses are different, the server MAY indicate such support using the data returned by the EXCHANGE_ID operation (see below). Client ID trunking is the association of multiple sessions to the same client ID. Servers MUST support client ID trunking for two target network addresses whenever they allow session trunking for those same two network addresses. In addition, a server MAY, by presenting the same major server owner ID (Section 2.5) and server scope (Section 2.10.4), allow an additional case of client ID trunking. When two servers return the same major server owner and server scope, it means that the two servers are cooperating on locking state management, which is a prerequisite for client ID trunking. Distinguishing when the client is allowed to use session and client ID trunking requires understanding how the results of the EXCHANGE_ID (Section 18.35) operation identify a server. Suppose a client sends EXCHANGE_IDs over two different connections, each with a possibly different target network address, but each EXCHANGE_ID operation has the same value in the eia_clientowner field. If the same NFSv4.1 server is listening over each connection, then each EXCHANGE_ID result MUST return the same values of eir_clientid, eir_server_owner.so_major_id, and eir_server_scope. The client can then treat each connection as referring to the same server (subject to verification; see Section 2.10.5.1 later in this section), and it can use each connection to trunk requests and replies. The client's choice is whether session trunking or client ID trunking applies. Session Trunking. If the eia_clientowner argument is the same in two different EXCHANGE_ID requests, and the eir_clientid, eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and eir_server_scope results match in both EXCHANGE_ID results, then the client is permitted to perform session trunking. If the client has no session mapping to the tuple of eir_clientid, eir_server_owner.so_major_id, eir_server_scope, and eir_server_owner.so_minor_id, then it creates the session via a CREATE_SESSION operation over one of the connections, which
associates the connection to the session. If there is a session
for the tuple, the client can send BIND_CONN_TO_SESSION to
associate the connection to the session.
Of course, if the client does not desire to use session trunking,
it is not required to do so. It can invoke CREATE_SESSION on the
connection. This will result in client ID trunking as described
below. It can also decide to drop the connection if it does not
choose to use trunking.
Client ID Trunking. If the eia_clientowner argument is the same in
two different EXCHANGE_ID requests, and the eir_clientid,
eir_server_owner.so_major_id, and eir_server_scope results match
in both EXCHANGE_ID results, then the client is permitted to
perform client ID trunking (regardless of whether the
eir_server_owner.so_minor_id results match). The client can
associate each connection with different sessions, where each
session is associated with the same server.
The client completes the act of client ID trunking by invoking
CREATE_SESSION on each connection, using the same client ID that
was returned in eir_clientid. These invocations create two
sessions and also associate each connection with its respective
session. The client is free to decline to use client ID trunking
by simply dropping the connection at this point.
When doing client ID trunking, locking state is shared across
sessions associated with that same client ID. This requires the
server to coordinate state across sessions.
The client should be prepared for the possibility that
eir_server_owner values may be different on subsequent EXCHANGE_ID
requests made to the same network address, as a result of various
sorts of reconfiguration events. When this happens and the changes
result in the invalidation of previously valid forms of trunking, the
client should cease to use those forms, either by dropping
connections or by adding sessions. For a discussion of lock reclaim
as it relates to such reconfiguration events, see Section 8.4.2.1.
2.10.5.1. Verifying Claims of Matching Server Identity
When two servers over two connections claim matching or partially
matching eir_server_owner, eir_server_scope, and eir_clientid values,
the client does not have to trust the servers' claims. The client
may verify these claims before trunking traffic in the following
ways:
o For session trunking, clients SHOULD reliably verify if
connections between different network paths are in fact associated
with the same NFSv4.1 server and usable on the same session, and
servers MUST allow clients to perform reliable verification. When
a client ID is created, the client SHOULD specify that
BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or
SP4_MACH_CRED (Section 18.35) state protection options. For
SP4_SSV, reliable verification depends on a shared secret (the
SSV) that is established via the SET_SSV (Section 18.47)
operation.
When a new connection is associated with the session (via the
BIND_CONN_TO_SESSION operation, see Section 18.34), if the client
specified SP4_SSV state protection for the BIND_CONN_TO_SESSION
operation, the client MUST send the BIND_CONN_TO_SESSION with
RPCSEC_GSS protection, using integrity or privacy, and an
RPCSEC_GSS handle created with the GSS SSV mechanism
(Section 2.10.9).
If the client mistakenly tries to associate a connection to a
session of a wrong server, the server will either reject the
attempt because it is not aware of the session identifier of the
BIND_CONN_TO_SESSION arguments, or it will reject the attempt
because the RPCSEC_GSS authentication fails. Even if the server
mistakenly or maliciously accepts the connection association
attempt, the RPCSEC_GSS verifier it computes in the response will
not be verified by the client, so the client will know it cannot
use the connection for trunking the specified session.
If the client specified SP4_MACH_CRED state protection, the
BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or
privacy, using the same credential that was used when the client
ID was created. Mutual authentication via RPCSEC_GSS assures the
client that the connection is associated with the correct session
of the correct server.
o For client ID trunking, the client has at least two options for
verifying that the same client ID obtained from two different
EXCHANGE_ID operations came from the same server. The first
option is to use RPCSEC_GSS authentication when sending each
EXCHANGE_ID operation. Each time an EXCHANGE_ID is sent with
RPCSEC_GSS authentication, the client notes the principal name of
the GSS target. If the EXCHANGE_ID results indicate that client
ID trunking is possible, and the GSS targets' principal names are
the same, the servers are the same and client ID trunking is
allowed.
The second option for verification is to use SP4_SSV protection.
When the client sends EXCHANGE_ID, it specifies SP4_SSV
protection. The first EXCHANGE_ID the client sends always has to
be confirmed by a CREATE_SESSION call. The client then sends
SET_SSV. Later, the client sends EXCHANGE_ID to a second
destination network address different from the one the first
EXCHANGE_ID was sent to. The client checks that each EXCHANGE_ID
reply has the same eir_clientid, eir_server_owner.so_major_id, and
eir_server_scope. If so, the client verifies the claim by sending
a CREATE_SESSION operation to the second destination address,
protected with RPCSEC_GSS integrity using an RPCSEC_GSS handle
returned by the second EXCHANGE_ID. If the server accepts the
CREATE_SESSION request, and if the client verifies the RPCSEC_GSS
verifier and integrity codes, then the client has proof the second
server knows the SSV, and thus the two servers are cooperating for
the purposes of specifying server scope and client ID trunking.
2.10.6. Exactly Once Semantics
Via the session, NFSv4.1 offers exactly once semantics (EOS) for
requests sent over a channel. EOS is supported on both the fore
channel and backchannel.
Each COMPOUND or CB_COMPOUND request that is sent with a leading
SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver
exactly once. This requirement holds regardless of whether the
request is sent with reply caching specified (see
Section 2.10.6.1.3). The requirement holds even if the requester is
sending the request over a session created between a pNFS data client
and pNFS data server. To understand the rationale for this
requirement, divide the requests into three classifications:
o Non-idempotent requests.
o Idempotent modifying requests.
o Idempotent non-modifying requests.
An example of a non-idempotent request is RENAME. Obviously, if a
replier executes the same RENAME request twice, and the first
execution succeeds, the re-execution will fail. If the replier
returns the result from the re-execution, this result is incorrect.
Therefore, EOS is required for non-idempotent requests.
An example of an idempotent modifying request is a COMPOUND request containing a WRITE operation. Repeated execution of the same WRITE has the same effect as execution of that WRITE a single time. Nevertheless, enforcing EOS for WRITEs and other idempotent modifying requests is necessary to avoid data corruption. Suppose a client sends WRITE A to a noncompliant server that does not enforce EOS, and receives no response, perhaps due to a network partition. The client reconnects to the server and re-sends WRITE A. Now, the server has outstanding two instances of A. The server can be in a situation in which it executes and replies to the retry of A, while the first A is still waiting in the server's internal I/O system for some resource. Upon receiving the reply to the second attempt of WRITE A, the client believes its WRITE is done so it is free to send WRITE B, which overlaps the byte-range of A. When the original A is dispatched from the server's I/O system and executed (thus the second time A will have been written), then what has been written by B can be overwritten and thus corrupted. An example of an idempotent non-modifying request is a COMPOUND containing SEQUENCE, PUTFH, READLINK, and nothing else. The re- execution of such a request will not cause data corruption or produce an incorrect result. Nonetheless, to keep the implementation simple, the replier MUST enforce EOS for all requests, whether or not idempotent and non-modifying. Note that true and complete EOS is not possible unless the server persists the reply cache in stable storage, and unless the server is somehow implemented to never require a restart (indeed, if such a server exists, the distinction between a reply cache kept in stable storage versus one that is not is one without meaning). See Section 2.10.6.5 for a discussion of persistence in the reply cache. Regardless, even if the server does not persist the reply cache, EOS improves robustness and correctness over previous versions of NFS because the legacy duplicate request/reply caches were based on the ONC RPC transaction identifier (XID). Section 2.10.6.1 explains the shortcomings of the XID as a basis for a reply cache and describes how NFSv4.1 sessions improve upon the XID.2.10.6.1. Slot Identifiers and Reply Cache
The RPC layer provides a transaction ID (XID), which, while required to be unique, is not convenient for tracking requests for two reasons. First, the XID is only meaningful to the requester; it cannot be interpreted by the replier except to test for equality with previously sent requests. When consulting an RPC-based duplicate request cache, the opaqueness of the XID requires a computationally expensive look up (often via a hash that includes XID and source
address). NFSv4.1 requests use a non-opaque slot ID, which is an index into a slot table, which is far more efficient. Second, because RPC requests can be executed by the replier in any order, there is no bound on the number of requests that may be outstanding at any time. To achieve perfect EOS, using ONC RPC would require storing all replies in the reply cache. XIDs are 32 bits; storing over four billion (2^32) replies in the reply cache is not practical. In practice, previous versions of NFS have chosen to store a fixed number of replies in the cache, and to use a least recently used (LRU) approach to replacing cache entries with new entries when the cache is full. In NFSv4.1, the number of outstanding requests is bounded by the size of the slot table, and a sequence ID per slot is used to tell the replier when it is safe to delete a cached reply. In the NFSv4.1 reply cache, when the requester sends a new request, it selects a slot ID in the range 0..N, where N is the replier's current maximum slot ID granted to the requester on the session over which the request is to be sent. The value of N starts out as equal to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the response to SEQUENCE or CB_SEQUENCE as described later in this section. The slot ID must be unused by any of the requests that the requester has already active on the session. "Unused" here means the requester has no outstanding request for that slot ID. A slot contains a sequence ID and the cached reply corresponding to the request sent with that sequence ID. The sequence ID is a 32-bit unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 - 1). The first time a slot is used, the requester MUST specify a sequence ID of one (Section 18.36). Each time a slot is reused, the request MUST specify a sequence ID that is one greater than that of the previous request on the slot. If the previous sequence ID was 0xFFFFFFFF, then the next request for the slot MUST have the sequence ID set to zero (i.e., (2^32 - 1) + 1 mod 2^32). The sequence ID accompanies the slot ID in each request. It is for the critical check at the replier: it used to efficiently determine whether a request using a certain slot ID is a retransmit or a new, never-before-seen request. It is not feasible for the requester to assert that it is retransmitting to implement this, because for any given request the requester cannot know whether the replier has seen it unless the replier actually replies. Of course, if the requester has seen the reply, the requester would not retransmit. The replier compares each received request's sequence ID with the last one previously received for that slot ID, to see if the new request is:
o A new request, in which the sequence ID is one greater than that
previously seen in the slot (accounting for sequence wraparound).
The replier proceeds to execute the new request, and the replier
MUST increase the slot's sequence ID by one.
o A retransmitted request, in which the sequence ID is equal to that
currently recorded in the slot. If the original request has
executed to completion, the replier returns the cached reply. See
Section 2.10.6.2 for direction on how the replier deals with
retries of requests that are still in progress.
o A misordered retry, in which the sequence ID is less than
(accounting for sequence wraparound) that previously seen in the
slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the
result from SEQUENCE or CB_SEQUENCE).
o A misordered new request, in which the sequence ID is two or more
than (accounting for sequence wraparound) that previously seen in
the slot. Note that because the sequence ID MUST wrap around to
zero once it reaches 0xFFFFFFFF, a misordered new request and a
misordered retry cannot be distinguished. Thus, the replier MUST
return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or
CB_SEQUENCE).
Unlike the XID, the slot ID is always within a specific range; this
has two implications. The first implication is that for a given
session, the replier need only cache the results of a limited number
of COMPOUND requests. The second implication derives from the first,
which is that unlike XID-indexed reply caches (also known as
duplicate request caches - DRCs), the slot ID-based reply cache
cannot be overflowed. Through use of the sequence ID to identify
retransmitted requests, the replier does not need to actually cache
the request itself, reducing the storage requirements of the reply
cache further. These facilities make it practical to maintain all
the required entries for an effective reply cache.
The slot ID, sequence ID, and session ID therefore take over the
traditional role of the XID and source network address in the
replier's reply cache implementation. This approach is considerably
more portable and completely robust -- it is not subject to the
reassignment of ports as clients reconnect over IP networks. In
addition, the RPC XID is not used in the reply cache, enhancing
robustness of the cache in the face of any rapid reuse of XIDs by the
requester. While the replier does not care about the XID for the
purposes of reply cache management (but the replier MUST return the
same XID that was in the request), nonetheless there are
considerations for the XID in NFSv4.1 that are the same as all other
previous versions of NFS. The RPC XID remains in each message and
needs to be formulated in NFSv4.1 requests as in any other ONC RPC
request. The reasons include:
o The RPC layer retains its existing semantics and implementation.
o The requester and replier must be able to interoperate at the RPC
layer, prior to the NFSv4.1 decoding of the SEQUENCE or
CB_SEQUENCE operation.
o If an operation is being used that does not start with SEQUENCE or
CB_SEQUENCE (e.g., BIND_CONN_TO_SESSION), then the RPC XID is
needed for correct operation to match the reply to the request.
o The SEQUENCE or CB_SEQUENCE operation may generate an error. If
so, the embedded slot ID, sequence ID, and session ID (if present)
in the request will not be in the reply, and the requester has
only the XID to match the reply to the request.
Given that well-formulated XIDs continue to be required, this begs
the question: why do SEQUENCE and CB_SEQUENCE replies have a session
ID, slot ID, and sequence ID? Having the session ID in the reply
means that the requester does not have to use the XID to look up the
session ID, which would be necessary if the connection were
associated with multiple sessions. Having the slot ID and sequence
ID in the reply means that the requester does not have to use the XID
to look up the slot ID and sequence ID. Furthermore, since the XID
is only 32 bits, it is too small to guarantee the re-association of a
reply with its request [37]; having session ID, slot ID, and sequence
ID in the reply allows the client to validate that the reply in fact
belongs to the matched request.
The SEQUENCE (and CB_SEQUENCE) operation also carries a
"highest_slotid" value, which carries additional requester slot usage
information. The requester MUST always indicate the slot ID
representing the outstanding request with the highest-numbered slot
value. The requester should in all cases provide the most
conservative value possible, although it can be increased somewhat
above the actual instantaneous usage to maintain some minimum or
optimal level. This provides a way for the requester to yield unused
request slots back to the replier, which in turn can use the
information to reallocate resources.
The replier responds with both a new target highest_slotid and an
enforced highest_slotid, described as follows:
o The target highest_slotid is an indication to the requester of the
highest_slotid the replier wishes the requester to be using. This
permits the replier to withdraw (or add) resources from a
requester that has been found to not be using them, in order to
more fairly share resources among a varying level of demand from
other requesters. The requester must always comply with the
replier's value updates, since they indicate newly established
hard limits on the requester's access to session resources.
However, because of request pipelining, the requester may have
active requests in flight reflecting prior values; therefore, the
replier must not immediately require the requester to comply.
o The enforced highest_slotid indicates the highest slot ID the
requester is permitted to use on a subsequent SEQUENCE or
CB_SEQUENCE operation. The replier's enforced highest_slotid
SHOULD be no less than the highest_slotid the requester indicated
in the SEQUENCE or CB_SEQUENCE arguments.
A requester can be intransigent with respect to lowering its
highest_slotid argument to a Sequence operation, i.e. the
requester continues to ignore the target highest_slotid in the
response to a Sequence operation, and continues to set its
highest_slotid argument to be higher than the target
highest_slotid. This can be considered particularly egregious
behavior when the replier knows there are no outstanding requests
with slot IDs higher than its target highest_slotid. When faced
with such intransigence, the replier is free to take more forceful
action, and MAY reply with a new enforced highest_slotid that is
less than its previous enforced highest_slotid. Thereafter, if
the requester continues to send requests with a highest_slotid
that is greater than the replier's new enforced highest_slotid,
the server MAY return NFS4ERR_BAD_HIGH_SLOT, unless the slot ID in
the request is greater than the new enforced highest_slotid and
the request is a retry.
The replier SHOULD retain the slots it wants to retire until the
requester sends a request with a highest_slotid less than or equal
to the replier's new enforced highest_slotid.
The requester can also be intransigent with respect to sending
non-retry requests that have a slot ID that exceeds the replier's
highest_slotid. Once the replier has forcibly lowered the
enforced highest_slotid, the requester is only allowed to send
retries on slots that exceed the replier's highest_slotid. If a
request is received with a slot ID that is higher than the new
enforced highest_slotid, and the sequence ID is one higher than
what is in the slot's reply cache, then the server can both retire
the slot and return NFS4ERR_BADSLOT (however, the server MUST NOT
do one and not the other). The reason it is safe to retire the
slot is because by using the next sequence ID, the requester is
indicating it has received the previous reply for the slot.
o The requester SHOULD use the lowest available slot when sending a
new request. This way, the replier may be able to retire slot
entries faster. However, where the replier is actively adjusting
its granted highest_slotid, it will not be able to use only the
receipt of the slot ID and highest_slotid in the request. Neither
the slot ID nor the highest_slotid used in a request may reflect
the replier's current idea of the requester's session limit,
because the request may have been sent from the requester before
the update was received. Therefore, in the downward adjustment
case, the replier may have to retain a number of reply cache
entries at least as large as the old value of maximum requests
outstanding, until it can infer that the requester has seen a
reply containing the new granted highest_slotid. The replier can
infer that the requester has seen such a reply when it receives a
new request with the same slot ID as the request replied to and
the next higher sequence ID.
2.10.6.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies
When a SEQUENCE or CB_SEQUENCE operation is successfully executed,
its reply MUST always be cached. Specifically, session ID, sequence
ID, and slot ID MUST be cached in the reply cache. The reply from
SEQUENCE also includes the highest slot ID, target highest slot ID,
and status flags. Instead of caching these values, the server MAY
re-compute the values from the current state of the fore channel,
session, and/or client ID as appropriate. Similarly, the reply from
CB_SEQUENCE includes a highest slot ID and target highest slot ID.
The client MAY re-compute the values from the current state of the
session as appropriate.
Regardless of whether or not a replier is re-computing highest slot
ID, target slot ID, and status on replies to retries, the requester
MUST NOT assume that the values are being re-computed whenever it
receives a reply after a retry is sent, since it has no way of
knowing whether the reply it has received was sent by the replier in
response to the retry or is a delayed response to the original
request. Therefore, it may be the case that highest slot ID, target
slot ID, or status bits may reflect the state of affairs when the
request was first executed. Although acting based on such delayed
information is valid, it may cause the receiver of the reply to do
unneeded work. Requesters MAY choose to send additional requests to
get the current state of affairs or use the state of affairs reported
by subsequent requests, in preference to acting immediately on data
that might be out of date.
2.10.6.1.2. Errors from SEQUENCE and CB_SEQUENCE
Any time SEQUENCE or CB_SEQUENCE returns an error, the sequence ID of the slot MUST NOT change. The replier MUST NOT modify the reply cache entry for the slot whenever an error is returned from SEQUENCE or CB_SEQUENCE.2.10.6.1.3. Optional Reply Caching
On a per-request basis, the requester can choose to direct the replier to cache the reply to all operations after the first operation (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it would not direct the replier to cache the entire reply is that the request is composed of all idempotent operations [34]. Caching the reply may offer little benefit. If the reply is too large (see Section 2.10.6.4), it may not be cacheable anyway. Even if the reply to idempotent request is small enough to cache, unnecessarily caching the reply slows down the server and increases RPC latency. Whether or not the requester requests the reply to be cached has no effect on the slot processing. If the results of SEQUENCE or CB_SEQUENCE are NFS4_OK, then the slot's sequence ID MUST be incremented by one. If a requester does not direct the replier to cache the reply, the replier MUST do one of following: o The replier can cache the entire original reply. Even though sa_cachethis or csa_cachethis is FALSE, the replier is always free to cache. It may choose this approach in order to simplify implementation. o The replier enters into its reply cache a reply consisting of the original results to the SEQUENCE or CB_SEQUENCE operation, and with the next operation in COMPOUND or CB_COMPOUND having the error NFS4ERR_RETRY_UNCACHED_REP. Thus, if the requester later retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP. If a replier receives a retried Sequence operation where the reply to the COMPOUND or CB_COMPOUND was not cached, then the replier, * MAY return NFS4ERR_RETRY_UNCACHED_REP in reply to a Sequence operation if the Sequence operation is not the first operation (granted, a requester that does so is in violation of the NFSv4.1 protocol). * MUST NOT return NFS4ERR_RETRY_UNCACHED_REP in reply to a Sequence operation if the Sequence operation is the first operation.
o If the second operation is an illegal operation, or an operation
that was legal in a previous minor version of NFSv4 and MUST NOT
be supported in the current minor version (e.g., SETCLIENTID), the
replier MUST NOT ever return NFS4ERR_RETRY_UNCACHED_REP. Instead
the replier MUST return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or
NFS4ERR_NOTSUPP as appropriate.
o If the second operation can result in another error status, the
replier MAY return a status other than NFS4ERR_RETRY_UNCACHED_REP,
provided the operation is not executed in such a way that the
state of the replier is changed. Examples of such an error status
include: NFS4ERR_NOTSUPP returned for an operation that is legal
but not REQUIRED in the current minor versions, and thus not
supported by the replier; NFS4ERR_SEQUENCE_POS; and
NFS4ERR_REQ_TOO_BIG.
The discussion above assumes that the retried request matches the
original one. Section 2.10.6.1.3.1 discusses what the replier might
do, and MUST do when original and retried requests do not match.
Since the replier may only cache a small amount of the information
that would be required to determine whether this is a case of a false
retry, the replier may send to the client any of the following
responses:
o The cached reply to the original request (if the replier has
cached it in its entirety and the users of the original request
and retry match).
o A reply that consists only of the Sequence operation with the
error NFS4ERR_FALSE_RETRY.
o A reply consisting of the response to Sequence with the status
NFS4_OK, together with the second operation as it appeared in the
retried request with an error of NFS4ERR_RETRY_UNCACHED_REP or
other error as described above.
o A reply that consists of the response to Sequence with the status
NFS4_OK, together with the second operation as it appeared in the
original request with an error of NFS4ERR_RETRY_UNCACHED_REP or
other error as described above.
2.10.6.1.3.1. False Retry
If a requester sent a Sequence operation with a slot ID and sequence
ID that are in the reply cache but the replier detected that the
retried request is not the same as the original request, including a
retry that has different operations or different arguments in the
operations from the original and a retry that uses a different
principal in the RPC request's credential field that translates to a different user, then this is a false retry. When the replier detects a false retry, it is permitted (but not always obligated) to return NFS4ERR_FALSE_RETRY in response to the Sequence operation when it detects a false retry. Translations of particularly privileged user values to other users due to the lack of appropriately secure credentials, as configured on the replier, should be applied before determining whether the users are the same or different. If the replier determines the users are different between the original request and a retry, then the replier MUST return NFS4ERR_FALSE_RETRY. If an operation of the retry is an illegal operation, or an operation that was legal in a previous minor version of NFSv4 and MUST NOT be supported in the current minor version (e.g., SETCLIENTID), the replier MAY return NFS4ERR_FALSE_RETRY (and MUST do so if the users of the original request and retry differ). Otherwise, the replier MAY return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or NFS4ERR_NOTSUPP as appropriate. Note that the handling is in contrast for how the replier deals with retries requests with no cached reply. The difference is due to NFS4ERR_FALSE_RETRY being a valid error for only Sequence operations, whereas NFS4ERR_RETRY_UNCACHED_REP is a valid error for all operations except illegal operations and operations that MUST NOT be supported in the current minor version of NFSv4.2.10.6.2. Retry and Replay of Reply
A requester MUST NOT retry a request, unless the connection it used to send the request disconnects. The requester can then reconnect and re-send the request, or it can re-send the request over a different connection that is associated with the same session. If the requester is a server wanting to re-send a callback operation over the backchannel of a session, the requester of course cannot reconnect because only the client can associate connections with the backchannel. The server can re-send the request over another connection that is bound to the same session's backchannel. If there is no such connection, the server MUST indicate that the session has no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag bit in the response to the next SEQUENCE operation from the client. The client MUST then associate a connection with the session (or destroy the session). Note that it is not fatal for a requester to retry without a disconnect between the request and retry. However, the retry does consume resources, especially with RDMA, where each request, retry or not, consumes a credit. Retries for no reason, especially retries
sent shortly after the previous attempt, are a poor use of network bandwidth and defeat the purpose of a transport's inherent congestion control system. A requester MUST wait for a reply to a request before using the slot for another request. If it does not wait for a reply, then the requester does not know what sequence ID to use for the slot on its next request. For example, suppose a requester sends a request with sequence ID 1, and does not wait for the response. The next time it uses the slot, it sends the new request with sequence ID 2. If the replier has not seen the request with sequence ID 1, then the replier is not expecting sequence ID 2, and rejects the requester's new request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). RDMA fabrics do not guarantee that the memory handles (Steering Tags) within each RPC/RDMA "chunk" [8] are valid on a scope outside that of a single connection. Therefore, handles used by the direct operations become invalid after connection loss. The server must ensure that any RDMA operations that must be replayed from the reply cache use the newly provided handle(s) from the most recent request. A retry might be sent while the original request is still in progress on the replier. The replier SHOULD deal with the issue by returning NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE operation, but implementations MAY return NFS4ERR_MISORDERED. Since errors from SEQUENCE and CB_SEQUENCE are never recorded in the reply cache, this approach allows the results of the execution of the original request to be properly recorded in the reply cache (assuming that the requester specified the reply to be cached).2.10.6.3. Resolving Server Callback Races
It is possible for server callbacks to arrive at the client before the reply from related fore channel operations. For example, a client may have been granted a delegation to a file it has opened, but the reply to the OPEN (informing the client of the granting of the delegation) may be delayed in the network. If a conflicting operation arrives at the server, it will recall the delegation using the backchannel, which may be on a different transport connection, perhaps even a different network, or even a different session associated with the same client ID. The presence of a session between the client and server alleviates this issue. When a session is in place, each client request is uniquely identified by its { session ID, slot ID, sequence ID } triple. By the rules under which slot entries (reply cache entries) are retired, the server has knowledge whether the client has "seen"
each of the server's replies. The server can therefore provide
sufficient information to the client to allow it to disambiguate
between an erroneous or conflicting callback race condition.
For each client operation that might result in some sort of server
callback, the server SHOULD "remember" the { session ID, slot ID,
sequence ID } triple of the client request until the slot ID
retirement rules allow the server to determine that the client has,
in fact, seen the server's reply. Until the time the { session ID,
slot ID, sequence ID } request triple can be retired, any recalls of
the associated object MUST carry an array of these referring
identifiers (in the CB_SEQUENCE operation's arguments), for the
benefit of the client. After this time, it is not necessary for the
server to provide this information in related callbacks, since it is
certain that a race condition can no longer occur.
The CB_SEQUENCE operation that begins each server callback carries a
list of "referring" { session ID, slot ID, sequence ID } triples. If
the client finds the request corresponding to the referring session
ID, slot ID, and sequence ID to be currently outstanding (i.e., the
server's reply has not been seen by the client), it can determine
that the callback has raced the reply, and act accordingly. If the
client does not find the request corresponding to the referring
triple to be outstanding (including the case of a session ID
referring to a destroyed session), then there is no race with respect
to this triple. The server SHOULD limit the referring triples to
requests that refer to just those that apply to the objects referred
to in the CB_COMPOUND procedure.
The client must not simply wait forever for the expected server reply
to arrive before responding to the CB_COMPOUND that won the race,
because it is possible that it will be delayed indefinitely. The
client should assume the likely case that the reply will arrive
within the average round-trip time for COMPOUND requests to the
server, and wait that period of time. If that period of time
expires, it can respond to the CB_COMPOUND with NFS4ERR_DELAY. There
are other scenarios under which callbacks may race replies. Among
them are pNFS layout recalls as described in Section 12.5.5.2.
2.10.6.4. COMPOUND and CB_COMPOUND Construction Issues
Very large requests and replies may pose both buffer management
issues (especially with RDMA) and reply cache issues. When the
session is created (Section 18.36), for each channel (fore and back),
the client and server negotiate the maximum-sized request they will
send or process (ca_maxrequestsize), the maximum-sized reply they
will return or process (ca_maxresponsesize), and the maximum-sized
reply they will store in the reply cache (ca_maxresponsesize_cached).
If a request exceeds ca_maxrequestsize, the reply will have the status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG as the status for the first operation (SEQUENCE or CB_SEQUENCE) in the request (which means that no operations in the request executed and that the state of the slot in the reply cache is unchanged), or it MAY opt to return it on a subsequent operation in the same COMPOUND or CB_COMPOUND request (which means that at least one operation did execute and that the state of the slot in the reply cache does change). The replier SHOULD set NFS4ERR_REQ_TOO_BIG on the operation that exceeds ca_maxrequestsize. If a reply exceeds ca_maxresponsesize, the reply will have the status NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the status for the first operation (SEQUENCE or CB_SEQUENCE) in the request, or it MAY opt to return it on a subsequent operation (in the same COMPOUND or CB_COMPOUND reply). A replier MAY return NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if the response would still exceed ca_maxresponsesize. If sa_cachethis or csa_cachethis is TRUE, then the replier MUST cache a reply except if an error is returned by the SEQUENCE or CB_SEQUENCE operation (see Section 2.10.6.1.2). If the reply exceeds ca_maxresponsesize_cached (and sa_cachethis or csa_cachethis is TRUE), then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) is returned on an operation other than the first operation (SEQUENCE or CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or csa_cachethis is TRUE. For example, if a COMPOUND has eleven operations, including SEQUENCE, the fifth operation is a RENAME, and the tenth operation is a READ for one million bytes, the server may return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since the server executed several operations, especially the non-idempotent RENAME, the client's request to cache the reply needs to be honored in order for the correct operation of exactly once semantics. If the client retries the request, the server will have cached a reply that contains results for ten of the eleven requested operations, with the tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. A client needs to take care that when sending operations that change the current filehandle (except for PUTFH, PUTPUBFH, PUTROOTFH, and RESTOREFH), it not exceed the maximum reply buffer before the GETFH operation. Otherwise, the client will have to retry the operation that changed the current filehandle, in order to obtain the desired filehandle. For the OPEN operation (see Section 18.16), retry is not always available as an option. The following guidelines for the handling of filehandle-changing operations are advised:
o Within the same COMPOUND procedure, a client SHOULD send GETFH
immediately after a current filehandle-changing operation. A
client MUST send GETFH after a current filehandle-changing
operation that is also non-idempotent (e.g., the OPEN operation),
unless the operation is RESTOREFH. RESTOREFH is an exception,
because even though it is non-idempotent, the filehandle RESTOREFH
produced originated from an operation that is either idempotent
(e.g., PUTFH, LOOKUP), or non-idempotent (e.g., OPEN, CREATE). If
the origin is non-idempotent, then because the client MUST send
GETFH after the origin operation, the client can recover if
RESTOREFH returns an error.
o A server MAY return NFS4ERR_REP_TOO_BIG or
NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
filehandle-changing operation if the reply would be too large on
the next operation.
o A server SHOULD return NFS4ERR_REP_TOO_BIG or
NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
filehandle-changing, non-idempotent operation if the reply would
be too large on the next operation, especially if the operation is
OPEN.
o A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent
current filehandle-changing operation, if it looks at the next
operation (in the same COMPOUND procedure) and finds it is not
GETFH. The server SHOULD do this if it is unable to determine in
advance whether the total response size would exceed
ca_maxresponsesize_cached or ca_maxresponsesize.
2.10.6.5. Persistence
Since the reply cache is bounded, it is practical for the reply cache
to persist across server restarts. The replier MUST persist the
following information if it agreed to persist the session (when the
session was created; see Section 18.36):
o The session ID.
o The slot table including the sequence ID and cached reply for each
slot.
The above are sufficient for a replier to provide EOS semantics for
any requests that were sent and executed before the server restarted.
If the replier is a client, then there is no need for it to persist
any more information, unless the client will be persisting all other
state across client restart, in which case, the server will never see
any NFSv4.1-level protocol manifestation of a client restart. If the
replier is a server, with just the slot table and session ID
persisting, any requests the client retries after the server restart
will return the results that are cached in the reply cache, and any
new requests (i.e., the sequence ID is one greater than the slot's
sequence ID) MUST be rejected with NFS4ERR_DEADSESSION (returned by
SEQUENCE). Such a session is considered dead. A server MAY re-
animate a session after a server restart so that the session will
accept new requests as well as retries. To re-animate a session, the
server needs to persist additional information through server
restart:
o The client ID. This is a prerequisite to let the client create
more sessions associated with the same client ID as the re-
animated session.
o The client ID's sequence ID that is used for creating sessions
(see Sections 18.35 and 18.36). This is a prerequisite to let the
client create more sessions.
o The principal that created the client ID. This allows the server
to authenticate the client when it sends EXCHANGE_ID.
o The SSV, if SP4_SSV state protection was specified when the client
ID was created (see Section 18.35). This lets the client create
new sessions, and associate connections with the new and existing
sessions.
o The properties of the client ID as defined in Section 18.35.
A persistent reply cache places certain demands on the server. The
execution of the sequence of operations (starting with SEQUENCE) and
placement of its results in the persistent cache MUST be atomic. If
a client retries a sequence of operations that was previously
executed on the server, the only acceptable outcomes are either the
original cached reply or an indication that the client ID or session
has been lost (indicating a catastrophic loss of the reply cache or a
session that has been deleted because the client failed to use the
session for an extended period of time).
A server could fail and restart in the middle of a COMPOUND procedure
that contains one or more non-idempotent or idempotent-but-modifying
operations. This creates an even higher challenge for atomic
execution and placement of results in the reply cache. One way to
view the problem is as a single transaction consisting of each
operation in the COMPOUND followed by storing the result in
persistent storage, then finally a transaction commit. If there is a
failure before the transaction is committed, then the server rolls
back the transaction. If the server itself fails, then when it restarts, its recovery logic could roll back the transaction before starting the NFSv4.1 server. While the description of the implementation for atomic execution of the request and caching of the reply is beyond the scope of this document, an example implementation for NFSv2 [38] is described in [39].