Tech-invite   3GPPspecs   Glossaries   IETFRFCs   Groups   SIP   ABNFs   Ti+   Search in Tech-invite

in Index   Prev   Next
in Index   Prev   Next  Group: NFSV4

RFC 8166

Remote Direct Memory Access Transport for Remote Procedure Call Version 1

Pages: 55
Proposed STD
Obsoletes:  5666
Part 2 of 3 – Pages 23 to 44
First   Prev   Next

Top   ToC   Page 23   prevText
4.  RPC-over-RDMA in Operation

   Every RPC-over-RDMA version 1 message has a header that includes a
   copy of the message's transaction ID, data for managing RDMA flow-
   control credits, and lists of RDMA segments describing chunks.  All
   RPC-over-RDMA header content is contained in the Transport stream;
   thus, it MUST be XDR encoded.

   RPC message layout is unchanged from that described in [RFC5531]
   except for the possible reduction of data items that are moved by
   separate operations.

   The RPC-over-RDMA protocol passes RPC messages without regard to
   their type (CALL or REPLY).  Apart from restrictions imposed by ULBs,
   each endpoint of a connection MAY send RDMA_MSG or RDMA_NOMSG message
   header types at any time (subject to credit limits).

4.1.  XDR Protocol Definition

   This section contains a description of the core features of the RPC-
   over-RDMA version 1 protocol, expressed in the XDR language

   This description is provided in a way that makes it simple to extract
   into ready-to-compile form.  The reader can apply the following shell
   script to this document to produce a machine-readable XDR description
   of the RPC-over-RDMA version 1 protocol.


   grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??'

Top   ToC   Page 24
   That is, if the above script is stored in a file called ""
   and this document is in a file called "spec.txt", then the reader can
   do the following to extract an XDR description file:


   sh < spec.txt > rpcrdma_corev1.x


4.1.1.  Code Component License

   Code components extracted from this document must include the
   following license text.  When the extracted XDR code is combined with
   other complementary XDR code, which itself has an identical license,
   only a single copy of the license text need be preserved.
Top   ToC   Page 25

   /// /*
   ///  * Copyright (c) 2010-2017 IETF Trust and the persons
   ///  * identified as authors of the code.  All rights reserved.
   ///  *
   ///  * The authors of the code are:
   ///  * B. Callaghan, T. Talpey, and C. Lever
   ///  *
   ///  * Redistribution and use in source and binary forms, with
   ///  * or without modification, are permitted provided that the
   ///  * following conditions are met:
   ///  *
   ///  * - Redistributions of source code must retain the above
   ///  *   copyright notice, this list of conditions and the
   ///  *   following disclaimer.
   ///  *
   ///  * - Redistributions in binary form must reproduce the above
   ///  *   copyright notice, this list of conditions and the
   ///  *   following disclaimer in the documentation and/or other
   ///  *   materials provided with the distribution.
   ///  *
   ///  * - Neither the name of Internet Society, IETF or IETF
   ///  *   Trust, nor the names of specific contributors, may be
   ///  *   used to endorse or promote products derived from this
   ///  *   software without specific prior written permission.
   ///  *
   ///  */

Top   ToC   Page 26
4.1.2.  RPC-over-RDMA Version 1 XDR

   XDR data items defined in this section encodes the Transport Header
   Stream in each RPC-over-RDMA version 1 message.  Comments identify
   items that cannot be changed in subsequent versions.


   /// /*
   ///  * Plain RDMA segment (Section 3.4.3)
   ///  */
   /// struct xdr_rdma_segment {
   ///    uint32 handle;           /* Registered memory handle */
   ///    uint32 length;           /* Length of the chunk in bytes */
   ///    uint64 offset;           /* Chunk virtual address or offset */
   /// };
   /// /*
   ///  * RDMA read segment (Section 3.4.5)
   ///  */
   /// struct xdr_read_chunk {
   ///    uint32 position;        /* Position in XDR stream */
   ///    struct xdr_rdma_segment target;
   /// };
   /// /*
   ///  * Read list (Section 4.3.1)
   ///  */
   /// struct xdr_read_list {
   ///         struct xdr_read_chunk entry;
   ///         struct xdr_read_list  *next;
   /// };
   /// /*
   ///  * Write chunk (Section 3.4.6)
   ///  */
   /// struct xdr_write_chunk {
   ///         struct xdr_rdma_segment target<>;
   /// };
   /// /*
   ///  * Write list (Section 4.3.2)
   ///  */
   /// struct xdr_write_list {
   ///         struct xdr_write_chunk entry;
   ///         struct xdr_write_list  *next;
   /// };
Top   ToC   Page 27
   /// /*
   ///  * Chunk lists (Section 4.3)
   ///  */
   /// struct rpc_rdma_header {
   ///    struct xdr_read_list   *rdma_reads;
   ///    struct xdr_write_list  *rdma_writes;
   ///    struct xdr_write_chunk *rdma_reply;
   ///    /* rpc body follows */
   /// };
   /// struct rpc_rdma_header_nomsg {
   ///    struct xdr_read_list   *rdma_reads;
   ///    struct xdr_write_list  *rdma_writes;
   ///    struct xdr_write_chunk *rdma_reply;
   /// };
   /// /* Not to be used */
   /// struct rpc_rdma_header_padded {
   ///    uint32                 rdma_align;
   ///    uint32                 rdma_thresh;
   ///    struct xdr_read_list   *rdma_reads;
   ///    struct xdr_write_list  *rdma_writes;
   ///    struct xdr_write_chunk *rdma_reply;
   ///    /* rpc body follows */
   /// };
   /// /*
   ///  * Error handling (Section 4.5)
   ///  */
   /// enum rpc_rdma_errcode {
   ///    ERR_VERS = 1,       /* Value fixed for all versions */
   ///    ERR_CHUNK = 2
   /// };
   /// /* Structure fixed for all versions */
   /// struct rpc_rdma_errvers {
   ///    uint32 rdma_vers_low;
   ///    uint32 rdma_vers_high;
   /// };
   /// union rpc_rdma_error switch (rpc_rdma_errcode err) {
   ///    case ERR_VERS:
   ///      rpc_rdma_errvers range;
   ///    case ERR_CHUNK:
   ///      void;
   /// };
   /// /*
Top   ToC   Page 28
   ///  * Procedures (Section 4.2.4)
   ///  */
   /// enum rdma_proc {
   ///    RDMA_MSG = 0,     /* Value fixed for all versions */
   ///    RDMA_NOMSG = 1,   /* Value fixed for all versions */
   ///    RDMA_MSGP = 2,    /* Not to be used */
   ///    RDMA_DONE = 3,    /* Not to be used */
   ///    RDMA_ERROR = 4    /* Value fixed for all versions */
   /// };
   /// /* The position of the proc discriminator field is
   ///  * fixed for all versions */
   /// union rdma_body switch (rdma_proc proc) {
   ///    case RDMA_MSG:
   ///      rpc_rdma_header rdma_msg;
   ///    case RDMA_NOMSG:
   ///      rpc_rdma_header_nomsg rdma_nomsg;
   ///    case RDMA_MSGP:   /* Not to be used */
   ///      rpc_rdma_header_padded rdma_msgp;
   ///    case RDMA_DONE:   /* Not to be used */
   ///      void;
   ///    case RDMA_ERROR:
   ///      rpc_rdma_error rdma_error;
   /// };
   /// /*
   ///  * Fixed header fields (Section 4.2)
   ///  */
   /// struct rdma_msg {
   ///    uint32    rdma_xid;      /* Position fixed for all versions */
   ///    uint32    rdma_vers;     /* Position fixed for all versions */
   ///    uint32    rdma_credit;   /* Position fixed for all versions */
   ///    rdma_body rdma_body;
   /// };


4.2.  Fixed Header Fields

   The RPC-over-RDMA header begins with four fixed 32-bit fields that
   control the RDMA interaction.

   The first three words are individual fields in the rdma_msg
   structure.  The fourth word is the first word of the rdma_body union,
   which acts as the discriminator for the switched union.  The contents
   of this field are described in Section 4.2.4.
Top   ToC   Page 29
   These four fields must remain with the same meanings and in the same
   positions in all subsequent versions of the RPC-over-RDMA protocol.

4.2.1.  Transaction ID (XID)

   The XID generated for the RPC Call and Reply messages.  Having the
   XID at a fixed location in the header makes it easy for the receiver
   to establish context as soon as each RPC-over-RDMA message arrives.
   This XID MUST be the same as the XID in the RPC message.  The
   receiver MAY perform its processing based solely on the XID in the
   RPC-over-RDMA header, and thereby ignore the XID in the RPC message,
   if it so chooses.

4.2.2.  Version Number

   For RPC-over-RDMA version 1, this field MUST contain the value one
   (1).  Rules regarding changes to this transport protocol version
   number can be found in Section 7.

4.2.3.  Credit Value

   When sent with an RPC Call message, the requested credit value is
   provided.  When sent with an RPC Reply message, the granted credit
   value is returned.  Further discussion of how the credit value is
   determined can be found in Section 3.3.

4.2.4.  Procedure Number

   RDMA_MSG = 0         indicates that chunk lists and a Payload stream
                        follow.  The format of the chunk lists is
                        discussed below.

   RDMA_NOMSG = 1       indicates that after the chunk lists there is no
                        Payload stream.  In this case, the chunk lists
                        provide information to allow the Responder to
                        transfer the Payload stream using explicit RDMA

   RDMA_MSGP = 2        is reserved.

   RDMA_DONE = 3        is reserved.

   RDMA_ERROR = 4       is used to signal an encoding error in the RPC-
                        over-RDMA header.

   An RDMA_MSG procedure conveys the Transport stream and the Payload
   stream via an RDMA Send operation.  The Transport stream contains the
   four fixed fields followed by the Read and Write lists and the Reply
Top   ToC   Page 30
   chunk, though any or all three MAY be marked as not present.  The
   Payload stream then follows, beginning with its XID field.  If a Read
   or Write chunk list is present, a portion of the Payload stream has
   been reduced and is conveyed via separate operations.

   An RDMA_NOMSG procedure conveys the Transport stream via an RDMA Send
   operation.  The Transport stream contains the four fixed fields
   followed by the Read and Write chunk lists and the Reply chunk.
   Though any of these MAY be marked as not present, one MUST be present
   and MUST hold the Payload stream for this RPC-over-RDMA message.  If
   a Read or Write chunk list is present, a portion of the Payload
   stream has been excised and is conveyed via separate operations.

   An RDMA_ERROR procedure conveys the Transport stream via an RDMA Send
   operation.  The Transport stream contains the four fixed fields
   followed by formatted error information.  No Payload stream is
   conveyed in this type of RPC-over-RDMA message.

   A Requester MUST NOT send an RPC-over-RDMA header with the RDMA_ERROR
   procedure.  A Responder MUST silently discard RDMA_ERROR procedures.

   The Transport stream and Payload stream can be constructed in
   separate buffers.  However, the total length of the gathered buffers
   cannot exceed the inline threshold.

4.3.  Chunk Lists

   The chunk lists in an RPC-over-RDMA version 1 header are three XDR
   optional-data fields that follow the fixed header fields in RDMA_MSG
   and RDMA_NOMSG procedures.  Read Section 4.19 of [RFC4506] carefully
   to understand how optional-data fields work.  Examples of XDR-encoded
   chunk lists are provided in Section 4.7 as an aid to understanding.

   Often, an RPC-over-RDMA message has no associated chunks.  In this
   case, the Read list, Write list, and Reply chunk are all marked "not

4.3.1.  Read List

   Each RDMA_MSG or RDMA_NOMSG procedure has one "Read list".  The Read
   list is a list of zero or more RDMA read segments, provided by the
   Requester, that are grouped by their Position fields into Read
   chunks.  Each Read chunk advertises the location of argument data the
   Responder is to pull from the Requester.  The Requester has reduced
   the data items in these chunks from the call's Payload stream.
Top   ToC   Page 31
   A Requester may transmit the Payload stream of an RPC Call message
   using a Position Zero Read chunk.  If the RPC Call message has no
   argument data that is DDP-eligible and the Position Zero Read chunk
   is not being used, the Requester leaves the Read list empty.

   Responders MUST leave the Read list empty in all replies.  Matching Read Chunks to Arguments

   When reducing a DDP-eligible argument data item, a Requester records
   the XDR stream offset of that data item in the Read chunk's Position
   field.  The Responder can then tell unambiguously where that chunk is
   to be reinserted into the received Payload stream to form a complete
   RPC Call message.

4.3.2.  Write List

   Each RDMA_MSG or RDMA_NOMSG procedure has one "Write list".  The
   Write list is a list of zero or more Write chunks, provided by the
   Requester.  Each Write chunk is an array of plain segments; thus, the
   Write list is a list of counted arrays.

   If an RPC Reply message has no possible DDP-eligible result data
   items, the Requester leaves the Write list empty.  When a Requester
   provides a Write list, the Responder MUST push data corresponding to
   DDP-eligible result data items to Requester memory referenced in the
   Write list.  The Responder removes these data items from the reply's
   Payload stream.  Matching Write Chunks to Results

   A Requester constructs the Write list for an RPC transaction before
   the Responder has formulated its reply.  When there is only one DDP-
   eligible result data item, the Requester inserts only a single Write
   chunk in the Write list.  If the returned Write chunk is not an
   unused Write chunk, the Requester knows with certainty which result
   data item is contained in it.

   When a Requester has provided multiple Write chunks, the Responder
   fills in each Write chunk with one DDP-eligible result until there
   are either no more DDP-eligible results or no more Write chunks.

   The Requester might not be able to predict in advance which DDP-
   eligible data item goes in which chunk.  Thus, the Requester is
   responsible for allocating and registering Write chunks large enough
   to accommodate the largest result data item that might be associated
   with each chunk in the Write list.
Top   ToC   Page 32
   As a Requester decodes a reply Payload stream, it is clear from the
   contents of the RPC Reply message which Write chunk contains which
   result data item.  Unused Write Chunks

   There are occasions when a Requester provides a non-empty Write chunk
   but the Responder is not able to use it.  For example, a ULP may
   define a union result where some arms of the union contain a DDP-
   eligible data item while other arms do not.  The Responder is
   required to use Requester-provided Write chunks in this case, but if
   the Responder returns a result that uses an arm of the union that has
   no DDP-eligible data item, that Write chunk remains unconsumed.

   If there is a subsequent DDP-eligible result data item in the RPC
   Reply message, it MUST be placed in that unconsumed Write chunk.
   Therefore, the Requester MUST provision each Write chunk so it can be
   filled with the largest DDP-eligible data item that can be placed in

   If this is the last or only Write chunk available and it remains
   unconsumed, the Responder MUST return this Write chunk as an unused
   Write chunk (see Section 3.4.6).  The Responder sets the segment
   count to a value matching the Requester-provided Write chunk, but
   returns only empty segments in that Write chunk.

   Unused Write chunks, or unused bytes in Write chunk segments, are
   returned to the RPC consumer as part of RPC completion.  Even if a
   Responder indicates that a Write chunk is not consumed, the Responder
   may have written data into one or more segments before choosing not
   to return that data item.  The Requester MUST NOT assume that the
   memory regions backing a Write chunk have not been modified.  Empty Write Chunks

   To force a Responder to return a DDP-eligible result inline, a
   Requester employs the following mechanism:

   o  When there is only one DDP-eligible result item in an RPC Reply
      message, the Requester provides an empty Write list.

   o  When there are multiple DDP-eligible result data items and a
      Requester prefers that a data item is returned inline, the
      Requester provides an empty Write chunk for that item (see
      Section 3.4.6).  The Responder MUST return the corresponding
      result data item inline and MUST return an empty Write chunk in
      that Write list position in the RPC Reply message.
Top   ToC   Page 33
   As always, a Requester and Responder must prepare for a Long Reply to
   be used if the resulting RPC Reply might be too large to be conveyed
   in an RDMA Send.

4.3.3.  Reply Chunk

   Each RDMA_MSG or RDMA_NOMSG procedure has one "Reply chunk" slot.  A
   Requester MUST provide a Reply chunk whenever the maximum possible
   size of the RPC Reply message's Transport and Payload streams is
   larger than the inline threshold for messages from Responder to
   Requester.  Otherwise, the Requester marks the Reply chunk as not

   If the Transport stream and Payload stream together are smaller than
   the reply inline threshold, the Responder MAY return the RPC Reply
   message as a Short message rather than using the Requester-provided
   Reply chunk.

   When a Requester provides a Reply chunk in an RPC Call message, the
   Responder MUST copy that chunk into the Transport header of the RPC
   Reply message.  As with Write chunks, the Responder modifies the
   copied Reply chunk in the RPC Reply message to reflect the actual
   amount of data that is being returned in the Reply chunk.

4.4.  Memory Registration

   The cost of registering and invalidating memory can be a significant
   proportion of the cost of an RPC-over-RDMA transaction.  Thus, an
   important implementation consideration is how to minimize
   registration activity without exposing system memory needlessly.

4.4.1.  Registration Longevity

   Data transferred via RDMA Read and Write can reside in a memory
   allocation not in the control of the RPC-over-RDMA transport.  These
   memory allocations can persist outside the bounds of an RPC
   transaction.  They are registered and invalidated as needed, as part
   of each RPC transaction.

   The Requester endpoint must ensure that memory regions associated
   with each RPC transaction are protected from Responder access before
   allowing upper-layer access to the data contained in them.  Moreover,
   the Requester must not access these memory regions while the
   Responder has access to them.
Top   ToC   Page 34
   This includes memory regions that are associated with canceled RPCs.
   A Responder cannot know that the Requester is no longer waiting for a
   reply, and it might proceed to read or even update memory that the
   Requester might have released for other use.

4.4.2.  Communicating DDP-Eligibility

   The interface by which a ULP implementation communicates the
   eligibility of a data item locally to its local RPC-over-RDMA
   endpoint is not described by this specification.

   Depending on the implementation and constraints imposed by ULBs, it
   is possible to implement reduction transparently to upper layers.
   Such implementations may lead to inefficiencies, either because they
   require the RPC layer to perform expensive registration and
   invalidation of memory "on the fly", or they may require using RDMA
   chunks in RPC Reply messages, along with the resulting additional
   handshaking with the RPC-over-RDMA peer.

   However, these issues are internal and generally confined to the
   local interface between RPC and its upper layers, one in which
   implementations are free to innovate.  The only requirement, beyond
   constraints imposed by the ULB, is that the resulting RPC-over-RDMA
   protocol sent to the peer be valid for the upper layer.

4.4.3.  Registration Strategies

   The choice of which memory registration strategies to employ is left
   to Requester and Responder implementers.  To support the widest array
   of RDMA implementations, as well as the most general steering tag
   scheme, an Offset field is included in each RDMA segment.

   While zero-based offset schemes are available in many RDMA
   implementations, their use by RPC requires individual registration of
   each memory region.  For such implementations, this can be a
   significant overhead.  By providing an offset in each chunk, many
   pre-registration or region-based registrations can be readily

4.5.  Error Handling

   A receiver performs basic validity checks on the RPC-over-RDMA header
   and chunk contents before it passes the RPC message to the RPC layer.
   If an incoming RPC-over-RDMA message is not as long as a minimal size
   RPC-over-RDMA header (28 bytes), the receiver cannot trust the value
   of the XID field; therefore, it MUST silently discard the message
   before performing any parsing.  If other errors are detected in the
   RPC-over-RDMA header of an RPC Call message, a Responder MUST send an
Top   ToC   Page 35
   RDMA_ERROR message back to the Requester.  If errors are detected in
   the RPC-over-RDMA header of an RPC Reply message, a Requester MUST
   silently discard the message.

   To form an RDMA_ERROR procedure:

   o  The rdma_xid field MUST contain the same XID that was in the
      rdma_xid field in the failing request;

   o  The rdma_vers field MUST contain the same version that was in the
      rdma_vers field in the failing request;

   o  The rdma_proc field MUST contain the value RDMA_ERROR; and

   o  The rdma_err field contains a value that reflects the type of
      error that occurred, as described below.

   An RDMA_ERROR procedure indicates a permanent error.  Receipt of this
   procedure completes the RPC transaction associated with XID in the
   rdma_xid field.  A receiver MUST silently discard an RDMA_ERROR
   procedure that it cannot decode.

4.5.1.  Header Version Mismatch

   When a Responder detects an RPC-over-RDMA header version that it does
   not support (currently this document defines only version 1), it MUST
   reply with an RDMA_ERROR procedure and set the rdma_err value to
   ERR_VERS, also providing the low and high inclusive version numbers
   it does, in fact, support.

4.5.2.  XDR Errors

   A receiver might encounter an XDR parsing error that prevents it from
   processing the incoming Transport stream.  Examples of such errors
   include an invalid value in the rdma_proc field; an RDMA_NOMSG
   message where the Read list, Write list, and Reply chunk are marked
   not present; or the value of the rdma_xid field does not match the
   value of the XID field in the accompanying RPC message.  If the
   rdma_vers field contains a recognized value, but an XDR parsing error
   occurs, the Responder MUST reply with an RDMA_ERROR procedure and set
   the rdma_err value to ERR_CHUNK.

   When a Responder receives a valid RPC-over-RDMA header but the
   Responder's ULP implementation cannot parse the RPC arguments in the
   RPC Call message, the Responder SHOULD return an RPC Reply message
   with status GARBAGE_ARGS, using an RDMA_MSG procedure.  This type of
   parsing failure might be due to mismatches between chunk sizes or
   offsets and the contents of the Payload stream, for example.
Top   ToC   Page 36
4.5.3.  Responder RDMA Operational Errors

   In RPC-over-RDMA version 1, the Responder initiates RDMA Read and
   Write operations that target the Requester's memory.  Problems might
   arise as the Responder attempts to use Requester-provided resources
   for RDMA operations.  For example:

   o  Usually, chunks can be validated only by using their contents to
      perform data transfers.  If chunk contents are invalid (e.g., a
      memory region is no longer registered or a chunk length exceeds
      the end of the registered memory region), a Remote Access Error

   o  If a Requester's Receive buffer is too small, the Responder's Send
      operation completes with a Local Length Error.

   o  If the Requester-provided Reply chunk is too small to accommodate
      a large RPC Reply message, a Remote Access Error occurs.  A
      Responder might detect this problem before attempting to write
      past the end of the Reply chunk.

   RDMA operational errors are typically fatal to the connection.  To
   avoid a retransmission loop and repeated connection loss that
   deadlocks the connection, once the Requester has re-established a
   connection, the Responder should send an RDMA_ERROR reply with an
   rdma_err value of ERR_CHUNK to indicate that no RPC-level reply is
   possible for that XID.

4.5.4.  Other Operational Errors

   While a Requester is constructing an RPC Call message, an
   unrecoverable problem might occur that prevents the Requester from
   posting further RDMA Work Requests on behalf of that message.  As
   with other transports, if a Requester is unable to construct and
   transmit an RPC Call message, the associated RPC transaction fails

   After a Requester has received a reply, if it is unable to invalidate
   a memory region due to an unrecoverable problem, the Requester MUST
   close the connection to protect that memory from Responder access
   before the associated RPC transaction is complete.

   While a Responder is constructing an RPC Reply message or error
   message, an unrecoverable problem might occur that prevents the
   Responder from posting further RDMA Work Requests on behalf of that
   message.  If a Responder is unable to construct and transmit an RPC
   Reply or RPC-over-RDMA error message, the Responder MUST close the
   connection to signal to the Requester that a reply was lost.
Top   ToC   Page 37
4.5.5.  RDMA Transport Errors

   The RDMA connection and physical link provide some degree of error
   detection and retransmission.  iWARP's Marker PDU Aligned (MPA) layer
   (when used over TCP), the Stream Control Transmission Protocol
   (SCTP), as well as the InfiniBand [IBARCH] link layer all provide
   Cyclic Redundancy Check (CRC) protection of the RDMA payload, and
   CRC-class protection is a general attribute of such transports.

   Additionally, the RPC layer itself can accept errors from the
   transport and recover via retransmission.  RPC recovery can handle
   complete loss and re-establishment of a transport connection.

   The details of reporting and recovery from RDMA link-layer errors are
   described in specific link-layer APIs and operational specifications
   and are outside the scope of this protocol specification.  See
   Section 8 for further discussion of the use of RPC-level integrity
   schemes to detect errors.

4.6.  Protocol Elements No Longer Supported

   The following protocol elements are no longer supported in RPC-over-
   RDMA version 1.  Related enum values and structure definitions remain
   in the RPC-over-RDMA version 1 protocol for backwards compatibility.

4.6.1.  RDMA_MSGP

   The specification of RDMA_MSGP in Section 3.9 of [RFC5666] is
   incomplete.  To fully specify RDMA_MSGP would require:

   o  Updating the definition of DDP-eligibility to include data items
      that may be transferred, with padding, via RDMA_MSGP procedures

   o  Adding full operational descriptions of the alignment and
      threshold fields

   o  Discussing how alignment preferences are communicated between two
      peers without using CCP

   o  Describing the treatment of RDMA_MSGP procedures that convey Read
      or Write chunks

   The RDMA_MSGP message type is beneficial only when the padded data
   payload is at the end of an RPC message's argument or result list.
   This is not typical for NFSv4 COMPOUND RPCs, which often include a
   GETATTR operation as the final element of the compound operation
Top   ToC   Page 38
   Without a full specification of RDMA_MSGP, there has been no fully
   implemented prototype of it.  Without a complete prototype of
   RDMA_MSGP support, it is difficult to assess whether this protocol
   element has benefit or can even be made to work interoperably.

   Therefore, senders MUST NOT send RDMA_MSGP procedures.  When
   receiving an RDMA_MSGP procedure, Responders SHOULD reply with an
   RDMA_ERROR procedure, setting the rdma_err field to ERR_CHUNK;
   Requesters MUST silently discard the message.

4.6.2.  RDMA_DONE

   Because no implementation of RPC-over-RDMA version 1 uses the Read-
   Read transfer model, there is never a need to send an RDMA_DONE

   Therefore, senders MUST NOT send RDMA_DONE messages.  Receivers MUST
   silently discard RDMA_DONE messages.

4.7.  XDR Examples

   RPC-over-RDMA chunk lists are complex data types.  In this section,
   illustrations are provided to help readers grasp how chunk lists are
   represented inside an RPC-over-RDMA header.

   A plain segment is the simplest component, being made up of a 32-bit
   handle (H), a 32-bit length (L), and 64 bits of offset (OO).  Once
   flattened into an XDR stream, plain segments appear as


   An RDMA read segment has an additional 32-bit position field (P).
   RDMA read segments appear as


   A Read chunk is a list of RDMA read segments.  Each RDMA read segment
   is preceded by a 32-bit word containing a one if a segment follows or
   a zero if there are no more segments in the list.  In XDR form, this
   would look like

      1 PHLOO 1 PHLOO 1 PHLOO 0

   where P would hold the same value for each RDMA read segment
   belonging to the same Read chunk.
Top   ToC   Page 39
   The Read list is also a list of RDMA read segments.  In XDR form,
   this would look like a Read chunk, except that the P values could
   vary across the list.  An empty Read list is encoded as a single
   32-bit zero.

   One Write chunk is a counted array of plain segments.  In XDR form,
   the count would appear as the first 32-bit word, followed by an HLOO
   for each element of the array.  For instance, a Write chunk with
   three elements would look like


   The Write list is a list of counted arrays.  In XDR form, this is a
   combination of optional-data and counted arrays.  To represent a
   Write list containing a Write chunk with three segments and a Write
   chunk with two segments, XDR would encode

      1 3 HLOO HLOO HLOO 1 2 HLOO HLOO 0

   An empty Write list is encoded as a single 32-bit zero.

   The Reply chunk is a Write chunk.  However, since it is an optional-
   data field, there is a 32-bit field in front of it that contains a
   one if the Reply chunk is present or a zero if it is not.  After
   encoding, a Reply chunk with two segments would look like

      1 2 HLOO HLOO

   Frequently, a Requester does not provide any chunks.  In that case,
   after the four fixed fields in the RPC-over-RDMA header, there are
   simply three 32-bit fields that contain zero.

5.  RPC Bind Parameters

   In setting up a new RDMA connection, the first action by a Requester
   is to obtain a transport address for the Responder.  The means used
   to obtain this address, and to open an RDMA connection, is dependent
   on the type of RDMA transport and is the responsibility of each RPC
   protocol binding and its local implementation.

   RPC services normally register with a portmap or rpcbind service
   [RFC1833], which associates an RPC Program number with a service
   address.  This policy is no different with RDMA transports.  However,
   a different and distinct service address (port number) might
   sometimes be required for ULP operation with RPC-over-RDMA.
Top   ToC   Page 40
   When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses
   IP port addressing due to its layering on TCP and/or SCTP, port
   mapping is trivial and consists merely of issuing the port in the
   connection process.  The NFS/RDMA protocol service address has been
   assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP

   When mapped atop InfiniBand [IBARCH], which uses a service endpoint
   naming scheme based on a Group Identifier (GID), a translation MUST
   be employed.  One such translation is described in Annexes A3
   (Application Specific Identifiers), A4 (Sockets Direct Protocol
   (SDP)), and A11 (RDMA IP CM Service) of [IBARCH], which is
   appropriate for translating IP port addressing to the InfiniBand
   network.  Therefore, in this case, IP port addressing may be readily
   employed by the upper layer.

   When a mapping standard or convention exists for IP ports on an RDMA
   interconnect, there are several possibilities for each upper layer to

   o  One possibility is to have the Responder register its mapped IP
      port with the rpcbind service under the netid (or netids) defined
      here.  An RPC-over-RDMA-aware Requester can then resolve its
      desired service to a mappable port and proceed to connect.  This
      is the most flexible and compatible approach, for those upper
      layers that are defined to use the rpcbind service.

   o  A second possibility is to have the Responder's portmapper
      register itself on the RDMA interconnect at a "well-known" service
      address (on UDP or TCP, this corresponds to port 111).  A
      Requester could connect to this service address and use the
      portmap protocol to obtain a service address in response to a
      program number, e.g., an iWARP port number or an InfiniBand GID.

   o  Alternately, the Requester could simply connect to the mapped
      well-known port for the service itself, if it is appropriately
      defined.  By convention, the NFS/RDMA service, when operating atop
      such an InfiniBand fabric, uses the same 20049 assignment as for

   Historically, different RPC protocols have taken different approaches
   to their port assignment.  Therefore, the specific method is left to
   each RPC-over-RDMA-enabled ULB and is not addressed in this document.
Top   ToC   Page 41
   In Section 9, this specification defines two new netid values, to be
   used for registration of upper layers atop iWARP [RFC5040] [RFC5041]
   and (when a suitable port translation service is available)
   InfiniBand [IBARCH].  Additional RDMA-capable networks MAY define
   their own netids, or if they provide a port translation, they MAY
   share the one defined in this document.

6.  ULB Specifications

   An ULP is typically defined independently of any particular RPC
   transport.  An ULB (ULB) specification provides guidance that helps
   the ULP interoperate correctly and efficiently over a particular
   transport.  For RPC-over-RDMA version 1, a ULB may provide:

   o  A taxonomy of XDR data items that are eligible for DDP

   o  Constraints on which upper-layer procedures may be reduced and on
      how many chunks may appear in a single RPC request

   o  A method for determining the maximum size of the reply Payload
      stream for all procedures in the ULP

   o  An rpcbind port assignment for operation of the RPC Program and
      Version on an RPC-over-RDMA transport

   Each RPC Program and Version tuple that utilizes RPC-over-RDMA
   version 1 needs to have a ULB specification.

6.1.  DDP-Eligibility

   An ULB designates some XDR data items as eligible for DDP.  As an
   RPC-over-RDMA message is formed, DDP-eligible data items can be
   removed from the Payload stream and placed directly in the receiver's

   An XDR data item should be considered for DDP-eligibility if there is
   a clear benefit to moving the contents of the item directly from the
   sender's memory to the receiver's memory.  Criteria for DDP-
   eligibility include:

   o  The XDR data item is frequently sent or received, and its size is
      often much larger than typical inline thresholds.

   o  If the XDR data item is a result, its maximum size must be
      predictable in advance by the Requester.
Top   ToC   Page 42
   o  Transport-level processing of the XDR data item is not needed.
      For example, the data item is an opaque byte array, which requires
      no XDR encoding and decoding of its content.

   o  The content of the XDR data item is sensitive to address
      alignment.  For example, a data copy operation would be required
      on the receiver to enable the message to be parsed correctly, or
      to enable the data item to be accessed.

   o  The XDR data item does not contain DDP-eligible data items.

   In addition to defining the set of data items that are DDP-eligible,
   a ULB may also limit the use of chunks to particular upper-layer
   procedures.  If more than one data item in a procedure is DDP-
   eligible, the ULB may also limit the number of chunks that a
   Requester can provide for a particular upper-layer procedure.

   Senders MUST NOT reduce data items that are not DDP-eligible.  Such
   data items MAY, however, be moved as part of a Position Zero Read
   chunk or a Reply chunk.

   The programming interface by which an upper-layer implementation
   indicates the DDP-eligibility of a data item to the RPC transport is
   not described by this specification.  The only requirements are that
   the receiver can re-assemble the transmitted RPC-over-RDMA message
   into a valid XDR stream, and that DDP-eligibility rules specified by
   the ULB are respected.

   There is no provision to express DDP-eligibility within the XDR
   language.  The only definitive specification of DDP-eligibility is a

   In general, a DDP-eligibility violation occurs when:

   o  A Requester reduces a non-DDP-eligible argument data item.  The
      Responder MUST NOT process this RPC Call message and MUST report
      the violation as described in Section 4.5.2.

   o  A Responder reduces a non-DDP-eligible result data item.  The
      Requester MUST terminate the pending RPC transaction and report an
      appropriate permanent error to the RPC consumer.

   o  A Responder does not reduce a DDP-eligible result data item into
      an available Write chunk.  The Requester MUST terminate the
      pending RPC transaction and report an appropriate permanent error
      to the RPC consumer.
Top   ToC   Page 43
6.2.  Maximum Reply Size

   A Requester provides resources for both an RPC Call message and its
   matching RPC Reply message.  A Requester forms the RPC Call message
   itself; thus, the Requester can compute the exact resources needed.

   A Requester must allocate resources for the RPC Reply message (an
   RPC-over-RDMA credit, a Receive buffer, and possibly a Write list and
   Reply chunk) before the Responder has formed the actual reply.  To
   accommodate all possible replies for the procedure in the RPC Call
   message, a Requester must allocate reply resources based on the
   maximum possible size of the expected RPC Reply message.

   If there are procedures in the ULP for which there is no clear reply
   size maximum, the ULB needs to specify a dependable means for
   determining the maximum.

6.3.  Additional Considerations

   There may be other details provided in a ULB.

   o  An ULB may recommend inline threshold values or other transport-
      related parameters for RPC-over-RDMA version 1 connections bearing
      that ULP.

   o  An ULP may provide a means to communicate these transport-related
      parameters between peers.  Note that RPC-over-RDMA version 1 does
      not specify any mechanism for changing any transport-related
      parameter after a connection has been established.

   o  Multiple ULPs may share a single RPC-over-RDMA version 1
      connection when their ULBs allow the use of RPC-over-RDMA version
      1 and the rpcbind port assignments for the Protocols allow
      connection sharing.  In this case, the same transport parameters
      (such as inline threshold) apply to all Protocols using that

   Each ULB needs to be designed to allow correct interoperation without
   regard to the transport parameters actually in use.  Furthermore,
   implementations of ULPs must be designed to interoperate correctly
   regardless of the connection parameters in effect on a connection.

6.4.  ULP Extensions

   An RPC Program and Version tuple may be extensible.  For instance,
   there may be a minor versioning scheme that is not reflected in the
   RPC version number, or the ULP may allow additional features to be
   specified after the original RPC Program specification was ratified.
Top   ToC   Page 44
   ULBs are provided for interoperable RPC Programs and Versions by
   extending existing ULBs to reflect the changes made necessary by each
   addition to the existing XDR.

(page 44 continued on part 3)

Next Section