Tech-invite3GPPspaceIETFspace
959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7609

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Pages: 143
Informational
Part 6 of 6 – Pages 129 to 143
First   Prev   None

Top   ToC   RFC7609 - Page 129   prevText

Appendix B. Socket API Considerations

A key design goal for SMC-R is to require no application changes for exploitation. It is confined to socket applications using stream (i.e., TCP) sockets over IPv4 or IPv6. By virtue of the fact that the switch to the SMC-R protocol occurs after a TCP connection is established, no changes are required in a socket address family or in the IP addresses and ports that the socket applications are using. Existing socket APIs that allow applications to retrieve local and remote socket address structures for an established TCP connection (for example, getsockname() and getpeername()) will continue to function as they have before. Existing DNS setup and APIs for resolving hostnames to IP addresses and vice versa also continue to function without any changes. In general, all of the usual socket APIs that are used for TCP communications (send APIs, recv APIs, etc.) will continue to function as they do today, even if SMC-R is used as the underlying protocol.
Top   ToC   RFC7609 - Page 130
   Each SMC-R-enabled implementation does, however, need to pay special
   attention to any socket APIs that have a reliance on the underlying
   TCP and IP protocols and also ensure that their behavior in an SMC-R
   environment is reasonable and minimizes impact on the application.
   While the basic socket API set is fairly similar across different
   operating systems, there is more variability when it comes to
   advanced socket API options.  Each implementation needs to perform a
   detailed analysis of its API options, any possible impact that SMC-R
   may have, and any resultant implications.  As part of that step, a
   discussion or review with other implementations supporting SMC-R
   would be useful to ensure consistent implementation.

B.1. setsockopt() / getsockopt() Considerations

These APIs allow socket applications to manipulate socket, transport (TCP/UDP), and IP-level options associated with a given socket. Typically, a platform restricts the number of IP options available to stream (TCP) socket applications, given their connection-oriented nature. The general guideline here is to continue processing these APIs in a manner that allows for application compatibility. Some options will be relevant to the SMC-R protocol and will require special processing "under the covers". For example, the ability to manipulate TCP send and receive buffer sizes is still valid for SMC-R. However, other options may have no meaning for SMC-R. For example, if an application enabled the TCP_NODELAY socket option to disable Nagle's algorithm, it should have no real effect on SMC-R communications, as there is no notion of Nagle's algorithm with this new protocol. But the implementation must accept the TCP_NODELAY option as it does today and save it so that it can be later extracted via getsockopt() processing. Note that any TCP or IP-level options will still have an effect on any TCP/IP packets flowing for an SMC-R connection (i.e., as part of TCP/IP connection establishment and TCP/IP connection termination packet flows). Under the covers, manipulation of the TCP options will also include the SMC-layer setting, as well as reading the SMC-R experimental option before and after completion of the three-way TCP handshake.
Top   ToC   RFC7609 - Page 131

Appendix C. Rendezvous Error Scenarios

This section discusses error scenarios for setting up and managing SMC-R links.

C.1. SMC Decline during CLC Negotiation

A peer to the SMC-R CLC negotiation can send an SMC Decline in lieu of any expected CLC message to decline SMC and force the TCP connection back to the IP fabric. There can be several reasons for an SMC Decline during the CLC negotiation, including the following: o RNIC went down o SMC-R forbidden by local policy o subnet (IPv4) or prefix (IPv6) doesn't match o lack of resources to perform SMC-R In all cases, when an SMC Decline is sent in lieu of an expected CLC message, no confirmation is required, and the TCP connection immediately falls back to using the IP fabric. To prevent ambiguity between CLC messages and application data, an SMC Decline cannot "chase" another CLC message. An SMC Decline can only be sent in lieu of an expected CLC message. For example, if the client sends an SMC Proposal and then its RNIC goes down, it must wait for the SMC Accept from the server and then reply to the SMC Accept with an SMC Decline. This "no chase" rule means that if this TCP connection is not a first contact between RoCE peers, a server cannot send an SMC Decline after sending an SMC Accept -- it can only either break the TCP connection or fail over if a problem arises in the RoCE fabric after it has sent the SMC Accept. Similarly, once the client sends an SMC Confirm on a TCP connection that isn't a first contact, it is committed to SMC-R for this TCP connection and cannot fall back to IP.

C.2. SMC Decline during LLC Negotiation

For a TCP connection that represents a first contact between RoCE pairs, it is possible for SMC to fall back to IP during the LLC negotiation. This is possible until the first contact SMC-R link is confirmed. For example, see Figure 42. After a first contact SMC-R link is confirmed, fallback to IP is no longer possible. This translates to the following rule: a first contact peer can send an
Top   ToC   RFC7609 - Page 132
   SMC Decline at any time during LLC negotiation until it has
   successfully sent its CONFIRM LINK (request or response) flow.  After
   that point, it cannot fall back to IP.

       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64     |
    | RKey X |   |MAC MA|<-------------------->|MAC MB|   |        |
    |        |   |GID GA|   attempted setup    |GID GB|   | RKey Y2|
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || RMB    |         |                      |        | RMB    | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |        |   |RNIC 3|                      |RNIC 4|   | RKey W2|
    |        |   |MAC MC|                      |MAC MD|   |        |
    |       QP 9 |GID GC|                      |GID GD|  QP 65     |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+

          SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
         <--------------------------------------------------------->

            SMC Proposal / SMC Accept / SMC Confirm exchange
         <-------------------------------------------------------->

           CONFIRM LINK(request, Link 1)
         .........................................................>

                           CONFIRM LINK(response, Link 1)
                              X...................................
                                :
                                : RoCE write failure
                                :.................................>

           SMC Decline(PC1, reason code)
          <--------------------------------------------------------

              Connection data flows over IP fabric
          <------------------------------------------------------->

                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows

               Figure 42: SMC Decline during LLC Negotiation
Top   ToC   RFC7609 - Page 133

C.3. The SMC Decline Window

Because SMC-R does not support fallback to IP for a TCP connection that is already using RDMA, there are specific rules on when the SMC Decline CLC message, which signals a fallback to IP because of an error or problem with the RoCE fabric, can be sent during TCP connection setup. There is a "point of no return" after which a connection cannot fall back to IP, and RoCE errors that occur after this point require the connection to be broken with a RST flow in the IP fabric. For a first contact, that point of no return is after the ADD LINK LLC message has been successfully sent for the second SMC-R link. Specifically, the server cannot fall back to IP after receiving either (1) a positive write completion indication for the ADD LINK request or (2) the ADD LINK response from the client, whichever comes first. The client cannot fall back to IP after sending a negative ADD LINK response, receiving a positive write complete on a positive ADD LINK response, or receiving a CONFIRM LINK for the second SMC-R link from the server, whichever comes first. For a subsequent contact, that point of no return is after the last send of the CLC negotiation completes. This, in combination with the rule that error "chasers" are not allowed during CLC negotiation, means that the server cannot send an SMC Decline after sending an SMC Accept, and the client cannot send an SMC Decline after sending an SMC Confirm.

C.4. Out-of-Sync Conditions during SMC-R Negotiation

The SMC Accept CLC message contains a first contact flag that indicates to the client whether the server believes it is setting up a new link group or using an existing link group. This flag is used to detect an out-of-sync condition between the client and the server. The scenario for such a condition is as follows: there is a single existing SMC-R link between the peers. After the client sends the SMC Proposal CLC message, the existing SMC-R link between the client and the server fails. The client cannot chase the SMC Proposal CLC message with an SMC Decline CLC message in this case, because the client does not yet know that the server would have wanted to choose the SMC-R link that just crashed. The QP that failed recovers before the server returns its SMC Accept CLC message. This means that there is a QP but no SMC-R link. Since the server had not yet learned of the SMC-R link failure when it sent the SMC Accept CLC message, it attempts to reuse the SMC-R link that just failed. This means that the server would not set the first contact flag, indicating to the client that the server thinks it is reusing an SMC-R link. However, the client does not have an SMC-R link that matches the server's
Top   ToC   RFC7609 - Page 134
   specification.  Because the first contact flag is off, the client
   realizes it is out of sync with the server and sends an SMC Decline
   to cause the connection to fall back to IP.

C.5. Timeouts during CLC Negotiation

Because the SMC-R negotiation flows as TCP data, there are built-in timeouts and retransmits at the TCP layer for individual messages. Implementations also must protect the overall TCP/CLC handshake with a timer or timers to prevent connections from hanging indefinitely due to SMC-R processing. This can be done with individual timers for individual CLC messages or an overall timer for the entire exchange, which may include the TCP handshake and the CLC handshake under one timer or separate timers. This decision is implementation dependent. If the TCP and/or CLC handshakes time out, the TCP connection must be terminated as it would be in a legacy IP environment when connection setup doesn't complete in a timely manner. Because the CLC flows are TCP messages, if they cannot be sent and received in a timely fashion, the TCP connection is not healthy and would not work if fallback to IP were attempted.

C.6. Protocol Errors during CLC Negotiation

Protocol errors occur during CLC negotiation when a message is received that is not expected. For example, a peer that is expecting a CLC message but instead receives application data has experienced a protocol error; this also indicates a likely software error, as the two sides are out of sync. When application data is expected, this data is not parsed to ensure that it's not a CLC message. When a peer is expecting a CLC negotiation message, any parsing error except a bad enumerated value in that message must be treated as application data. The CLC negotiation messages are designed with beginning and ending eye catchers to help verify that a CLC negotiation message is actually the expected message. If other parsing errors in an expected CLC message occur, such as incorrect length fields or incorrectly formatted fields, the message must be treated as application data. All protocol errors, with the exception of bad enumerated values, must result in termination of the TCP connection. No fallback to IP is allowed in the case of a protocol error, because if the protocols are out of sync, mismatched, or corrupted, then data and security integrity cannot be ensured.
Top   ToC   RFC7609 - Page 135
   The exception to this rule is enumerated values -- for example, the
   QP MTU values on SMC Accept and SMC Confirm.  If a reserved value is
   received, the proper error response is to send an SMC Decline and
   fall back to IP; this is because the use of a reserved enumerated
   value indicates that the other partner likely has additional support
   that the receiving partner does not have.  This indicated mismatch of
   SMC-R capabilities is not an integrity problem but indicates that
   SMC-R cannot be used for this connection.

C.7. Timeouts during LLC Negotiation

Whenever a peer sends an LLC message to which a reply is expected, it sets a timer after the send posts to wait for the reply. An expected response may be a reply flavor of the LLC message (for example, a CONFIRM LINK reply) or a new LLC message (for example, an ADD LINK CONTINUATION expected from the server by the client if there are more RKeys to be communicated). On LLC flows that are part of a first contact setup of a link group, the value of the timer is implementation dependent but should be long enough to allow the other peer to have a write complete timeout and 2-3 retransmits of an SMC Decline on the TCP fabric. For LLC flows that are maintaining the link group and are not part of a first contact setup of a link group, the timers may be shorter. Upon receipt of an expected reply, the timer is cancelled. If a timer pops without a reply having been received, the sender must initiate a recovery action. During first contact processing, failure of an LLC verification timer is a "should-not-occur" that indicates a problem with one of the endpoints; this is because if there is a "routine" failure in the RoCE fabric that causes an LLC verification send to fail, the sender will get a write completion failure and will then send an SMC Decline to the partner. The only time an LLC verification timer will expire on a first contact is when the sender thinks the send succeeded but it actually didn't. Because of the reliably connected nature of QP connections on the RoCE fabric, this indicates a problem with one of the peers, not with the RoCE fabric. After the reliably connected queue pair for the first SMC-R link in a link group is set up on initial contact, the client sets a timer to wait for a RoCE verification message from the server that the QP is actually connected and usable. If the server experiences a failure sending its QP confirmation message, it will send an SMC Decline, which should arrive at the client before the client's verification timer expires. If the client's timer expires without receiving either an SMC Decline or a RoCE message confirmation from the server,
Top   ToC   RFC7609 - Page 136
   there is a problem with either the server or the TCP fabric.  In
   either case, the client must break the TCP connection and clean up
   the SMC-R link.

   There are two scenarios in which the client's response to the QP
   verification message fails to reach the server.  The main difference
   is whether or not the client has successfully completed the send of
   the CONFIRM LINK response.

   In the normal case of a problem with the RoCE path, the client will
   learn of the failure by getting a write completion failure, before
   the server's timer expires.  In this case, the client sends an SMC
   Decline CLC message to the server, and the TCP connection falls back
   to IP.

   If the client's send of the confirmation message receives a positive
   return code but for some reason still does not reach the server, or
   the client's SMC Decline CLC message fails to reach the server after
   the client fails to send its RoCE confirmation message, then the
   server's timer will time out and the server must break the TCP
   connection by sending a RST.  This is expected to be a very rare
   case, because if the client cannot send its CONFIRM LINK response LLC
   message, the client should get a negative return code and initiate
   fallback to IP.  A client receiving a positive return code on a send
   that fails to reach the server should also be an extremely rare case.

C.7.1. Recovery Actions for LLC Timeouts and Failures

The following list describes recovery actions for LLC timeouts. A write completion failure or other indication of send failure for an LLC command is treated the same as a timeout. LLC message: CONFIRM LINK from server (first contact, first link in the link group) Timer waits for: CONFIRM LINK reply from client. Recovery action: Break the TCP connection by sending a RST, and clean up the link. The server should have received an SMC Decline from the client by now if the client had an LLC send failure. LLC message: CONFIRM LINK from server (first contact, second link in the link group) Timer waits for: CONFIRM LINK reply from client.
Top   ToC   RFC7609 - Page 137
      Recovery action: The second link was not successfully set up.
      Send a DELETE LINK to the client.  Connection data cannot flow in
      the first link in the link group, until the reply to this DELETE
      LINK is received, to prevent the peers from being out of sync on
      the state of the link group.

   LLC message: CONFIRM LINK from server (not first contact)

      Timer waits for: CONFIRM LINK reply from client.

      Recovery action: Clean up the new link, and set a timer to retry.
      Send a DELETE LINK to the client, in case the client has a longer
      timer interval, so the client can stop waiting.

   LLC message: CONFIRM LINK reply from client (first contact)

      Timer waits for: ADD LINK from server.

      Recovery action: Clean up the SMC-R link, and break the TCP
      connection by sending a RST over the IP fabric.  There is a
      problem with the server.  If the server had a send failure, it
      should have sent an SMC Decline by now.

   LLC message: ADD LINK from server (first contact)

      Timer waits for: ADD LINK reply from client.

      Recovery action: Break the TCP connection with a RST, and clean up
      RoCE resources.  The connection is past the point where the server
      can fall back to IP, and if the client had a send problem it
      should have sent an SMC Decline by now.

   LLC message: ADD LINK from server (not first contact)

      Timer waits for: ADD LINK reply from client.

      Recovery action: Clean up resources (QP, RKeys, etc.) for the new
      link, and treat the link over which the ADD LINK was sent as if it
      had failed.  If there is another link available to resend the
      ADD LINK and the link group still needs another link, retry the
      ADD LINK over another link in the link group.

   LLC message: ADD LINK reply from client (and there are more RKeys to
   be communicated)

      Timer waits for: ADD LINK CONTINUATION from server.

      Recovery action: Treat the same as ADD LINK timer failure.
Top   ToC   RFC7609 - Page 138
   LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from
   client (and there are no more RKeys to be communicated, for the
   second link in a first contact scenario)

      Timer waits for: CONFIRM LINK from the server, over the new link.

      Recovery action: The setup of the new link failed.  Send a
      DELETE LINK to the server.  Do not consider the socket opened to
      the client application until receiving confirmation from the
      server in the form of a DELETE LINK request for this link and
      sending the reply (to prevent the partners from being out of sync
      on the state of the link group).

      Set a timer to send another ADD LINK to the server if there is
      still an unused RNIC on the client side.

   LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from
   client (and there are no more RKeys to be communicated)

      Timer waits for: CONFIRM LINK from the server, over the new link.

      Recovery action: Send a DELETE LINK to the server for the new
      link, then clean up any resource allocated for the new link and
      set a timer to send an ADD LINK to the server if there is still an
      unused RNIC on the client side.  The setup of the new link failed,
      but the link over which the ADD LINK exchange occurred is
      unaffected.

   LLC message: ADD LINK CONTINUATION from server

      Timer waits for: ADD LINK CONTINUATION reply from client.

      Recovery action: Treat the same as ADD LINK timer failure.

   LLC message: ADD LINK CONTINUATION reply from client (first contact,
   and RMB count fields indicate that the server owes more ADD LINK
   CONTINUATION messages)

      Timer waits for: ADD LINK CONTINUATION from server.

      Recovery action: Clean up the SMC-R link, and break the TCP
      connection by sending a RST.  There is a problem with the server.

      If the server had a send failure, it should have sent an
      SMC Decline by now.
Top   ToC   RFC7609 - Page 139
   LLC message: ADD LINK CONTINUATION reply from client (not first
   contact, and RMB count fields indicate that the server owes more
   ADD LINK CONTINUATION messages)

      Timer waits for: ADD LINK CONTINUATION from server.

      Recovery action: Treat as if client detected link failure on the
      link that the ADD LINK exchange is using.  Send a DELETE LINK to
      the server over another active link if one exists; otherwise,
      clean up the link group.

   LLC message: DELETE LINK from client

      Timer waits for: DELETE LINK request from server.

      Recovery action: If the scope of the request is to delete a single
      link, the surviving link over which the client sent the
      DELETE LINK is no longer usable either.  If this is the last link
      in the link group, end TCP connections over the link group by
      sending RST packets.  If there are other surviving links in the
      link group, resend over a surviving link.  Also send a DELETE LINK
      over a surviving link for the link over which the client attempted
      to send the initial DELETE LINK message.  If the scope of the
      request is to delete the entire link group, try resending on other
      links in the link group until success is achieved.  If all sends
      fail, tear down the link group and any TCP connections that exist
      on it.

   LLC message: DELETE LINK from server (scope: entire link group)

      Timer waits for: Confirmation from the adapter that the message
      was delivered.

      Recovery action: Tear down the link group and any TCP connections
      that exist on it.

   LLC message: DELETE LINK from server (scope: single link)

      Timer waits for: DELETE LINK reply from client.

      Recovery action: The link over which the server sent the
      DELETE LINK is no longer usable either.  If this is the last link
      in the link group, end TCP connections over the link group by
      sending RST packets.  If there are other surviving links in the
      link group, resend over a surviving link.  Also send a DELETE LINK
      over a surviving link for the link over which the server attempted
      to send the initial DELETE LINK message.  If the scope of the
      request is to delete the entire link group, try resending on other
Top   ToC   RFC7609 - Page 140
      links in the link group until success is achieved.  If all sends
      fail, tear down the link group and any TCP connections that exist
      on it.

   LLC message: CONFIRM RKEY from client

      Timer waits for: CONFIRM RKEY reply from server.

      Recovery action: Perform normal client procedures for detection of
      failed link.  The link over which the message was sent has failed.

   LLC message: CONFIRM RKEY from server

      Timer waits for: CONFIRM RKEY reply from client.

      Recovery action: Perform normal server procedures for detection of
      failed link.  The link over which the message was sent has failed.

   LLC message: TEST LINK from client

      Timer waits for: TEST LINK reply from server.

      Recovery action: Perform normal client procedures for detection of
      failed link.  The link over which the message was sent has failed.

   LLC message: TEST LINK from server

      Timer waits for: TEST LINK reply from client.

      Recovery action: Perform normal server procedures for detection of
      failed link.  The link over which the message was sent has failed.

   The following list describes recovery actions for invalid LLC
   messages.  These could be misformatted or contain out-of-sync data.

   LLC message received: CONFIRM LINK from server

      What it indicates: Incorrect link information.

      Recovery action: Protocol error.  The link must be brought down by
      sending a DELETE LINK for the link over another link in the link
      group if one exists.  If this is a first contact, fall back to IP
      by sending an SMC Decline to the server.
Top   ToC   RFC7609 - Page 141
   LLC message received: ADD LINK

      What it indicates: Undefined enumerated MTU value.

      Recovery action: Send a negative ADD LINK reply with reason
      code x'2'.

   LLC message received: ADD LINK reply from client

      What it indicates: Client-side link information that would result
      in a parallel link being set up.

      Recovery action: Parallel links are not permitted.  Delete the
      link by sending a DELETE LINK to the client over another link in
      the link group.

   LLC message received: Any link group command from the server, except
   DELETE LINK for the entire link group

      What it indicates: Client has sent a DELETE LINK for the link on
      which the message was received.

      Recovery action: Ignore the LLC message.  Worst case: the server
      will time out.  Best case: the DELETE LINK crosses with the
      command from the server, and the server realizes it failed.

   LLC message received: ADD LINK CONTINUATION from server or ADD LINK
   CONTINUATION reply from client

      What it indicates: Number of RMBs provided doesn't match count
      given on initial ADD LINK or ADD LINK reply message.

      Recovery action: Protocol error.  Treat as if detected link
      outage.

   LLC message received: DELETE LINK from client

      What it indicates: Link indicated doesn't exist.

      Recovery action: If the link is in the process of being cleaned
      up, assume timing window and ignore message.  Otherwise, send a
      DELETE LINK reply with reason code 1.

   LLC message received: DELETE LINK from server

      What it indicates: Link indicated doesn't exist.

      Recovery action: Send a DELETE LINK reply with reason code 1.
Top   ToC   RFC7609 - Page 142
   LLC message received: CONFIRM RKEY from either client or server

      What it indicates: No RKey provided for one or more of the links
      in the link group.

      Recovery action: Treat as if detected failure of the link(s) for
      which no RKey was provided.

   LLC message received: DELETE RKEY

      What it indicates: Specified RKey doesn't exist.

      Recovery action: Send a negative DELETE RKEY response.

   LLC message received: TEST LINK reply

      What it indicates: User data doesn't match what was sent in the
      TEST LINK request.

      Recovery action: Treat as if detected that the link has gone down.
      This is a protocol error.

   LLC message received: Unknown LLC type with high-order bits of opcode
   equal to b'10'

      What it indicates: This is an optional LLC message that the
      receiver does not support.

      Recovery action: Ignore (silently discard) the message.

   LLC message received: Any unambiguously incorrect or out-of-sync LLC
   message

      What it indicates: Link is out of sync.

      Recovery action: Treat as if detected that the link has gone down.
      Note that an unsupported or unknown LLC opcode whose two
      high-order bits are b'10' is not an error and must be silently
      discarded.  Any other unknown or unsupported LLC opcode is an
      error.

C.8. Failure to Add Second SMC-R Link to a Link Group

When there is any failure in setting up the second SMC-R link in an SMC-R link group, including confirmation timer expiration, the SMC-R link group is allowed to continue without available failover. However, this situation is extremely undesirable, and the server must endeavor to correct it as soon as it can.
Top   ToC   RFC7609 - Page 143
   The server peer in the SMC-R link group must set a timer to drive it
   to retry setup of a failed additional SMC-R link.  The server will
   immediately retry the SMC-R link setup when the first of the
   following events occurs:

   o  The retry timer expires.

   o  A new RNIC becomes available to the server, on the same LAN as the
      SMC-R link group.

   o  An ADD LINK LLC request message is received from the client; this
      indicates the availability of a new RNIC on the client side.

Authors' Addresses

Mike Fox IBM 3039 Cornwallis Rd. Research Triangle Park, NC 27709 United States Email: mjfox@us.ibm.com Constantinos (Gus) Kassimis IBM 3039 Cornwallis Rd. Research Triangle Park, NC 27709 United States Email: kassimis@us.ibm.com Jerry Stevens IBM 3039 Cornwallis Rd. Research Triangle Park, NC 27709 United States Email: sjerry@us.ibm.com