tech-invite   World Map     

IETF     RFCs     Groups     SIP     ABNFs    |    3GPP     Specs     Gloss.     Arch.     IMS     UICC    |    Misc.    |    search     info

RFC 7609

 
 
 

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Part 4 of 6, p. 60 to 91
Prev RFC Part       Next RFC Part

 


prevText      Top      Up      ToC       Page 60 
4.  SMC-R Memory-Sharing Architecture

4.1.  RMB Element Allocation Considerations

   Each TCP connection using SMC-R must be allocated an RMBE by each
   SMC-R peer.  This allocation is performed by each endpoint
   independently to allow each endpoint to select an RMBE that best
   matches the characteristics on its TCP socket endpoint.  The RMBE
   associated with a TCP socket endpoint must have a receive buffer that
   is at least as large as the TCP receive buffer size in effect for
   that connection.  The receive buffer size can be determined by what
   is specified explicitly by the application using setsockopt() or
   implicitly via the system-configured default value.  This will allow
   sufficient data to be RDMA-written by the SMC-R peer to fill an
   entire receive buffer size's worth of data on a given data flow.
   Given that each RMB must have fixed-length RMBEs, this implies that
   an SMC-R endpoint may need to maintain multiple RMBs of various sizes
   for SMC-R connections on a given SMC-R link and can then select an
   RMBE that most closely fits a connection.

4.2.  RMB and RMBE Format

   An RMB is a virtual memory buffer whose backing real memory is
   pinned.  The RMB is subdivided into a whole number of equal-sized RMB
   Elements (RMBEs).  Each RMBE begins with a 4-byte eye catcher for
   diagnostic and service purposes, followed by the receive data buffer.
   The contents of this diagnostic eye catcher are implementation
   dependent and should be used by the local SMC-R peer to check for
   overlay errors by verifying an intact eye catcher with every RMBE
   access.

   The RMBE is a wrapping receive buffer for receiving RDMA writes from
   the peer.  Cursors, as described below, are exchanged between peers
   to manage and track RDMA writes and local data reads from the RMBE
   for a TCP connection.

4.3.  RMBE Control Information

   RMBE control information consists of consumer cursors, producer
   cursors, wrap counts, CDC message sequence numbers, control flags
   such as urgent data and "writer blocked" indicators, and TCP
   connection information such as termination flags.  This information
   is exchanged between SMC-R peers using CDC messages, which are passed
   using RoCE SendMsg.  A TCP/IP stack implementing SMC-R must receive
   and store this information in its internal data structures, as it is
   used to manage the RMBE and its data buffer.

Top      Up      ToC       Page 61 
   The format and contents of the CDC message are described in detail in
   Appendix A.4 ("Connection Data Control (CDC) Message Format").  The
   following is a high-level description of what this control
   information contains.

   o  Connection state flags such as sending done, connection closed,
      failover data validation, and abnormal close.

   o  A sequence number that is managed by the sender.  This sequence
      number starts at 1, is increased each send, and wraps to 0.  This
      sequence number tracks the CDC message sent and is not related to
      the number of bytes sent.  It is used for failover data
      validation.

   o  Producer cursor: a wrapping offset into the receiver's RMBE data
      area.  Set by the peer that is writing into the RMBE, it points to
      where the writing peer will write the next byte of data into an
      RMBE.  This cursor is accompanied by a wrap sequence number to
      help the RMBE owner (the receiver) identify full window size
      wrapping writes.  Note that this cursor must account for (i.e.,
      skip over) the RMBE eye catcher that is in the beginning of the
      data area.

   o  Consumer cursor: a wrapping offset into the receiver's RMBE data
      area.  Set by the owner of the RMBE (the peer that is reading from
      it), this cursor points to the offset of the next byte of data to
      be consumed by the peer in its own RMBE.  The sender cannot write
      beyond this cursor into the receiver's RMBE without causing data
      loss.  Like the producer cursor, this is accompanied by a wrap
      count to help the writer identify full window size wrapping reads.
      Note that this cursor must account for (i.e., skip over) the RMBE
      eye catcher that is in the beginning of the data area.

   o  Data flags such as urgent data, writer blocked indicator, and
      cursor update requests.

4.4.  Use of RMBEs

4.4.1.  Initializing and Accessing RMBEs

   The RMBE eye catcher is initialized by the RMB owner prior to
   assigning it to a specific TCP connection and communicating its RMB
   index to the SMC-R partner.  After an RMBE index is communicated to
   the SMC-R partner, the RMBE can only be referenced in "read-only
   mode" by the owner, and all updates to it are performed by the remote
   SMC-R partner via RDMA write operations.

Top      Up      ToC       Page 62 
   Initialization of an RMBE must include the following:

   o  Zeroing out the entire RMBE receive buffer, which helps minimize
      data integrity issues (e.g., data from a previous connection
      somehow being presented to the current connection).

   o  Setting the beginning RMBE eye catcher.  This eye catcher plays an
      important role in helping detect accidental overlays of the RMBE.
      The RMB owner should always validate these eye catchers before
      each new reference to the RMBE.  If the eye catchers are found to
      be corrupted, the local host must reset the TCP connection
      associated with this RMBE and log the appropriate diagnostic
      information.

4.4.2.  RMB Element Reuse and Conflict Resolution

   RMB elements can be reused once their associated TCP and SMC-R
   connections are terminated.  Under normal and abnormal SMC-R
   connection termination processing, both SMC-R peers must explicitly
   acknowledge that they are done using an RMBE before that element can
   be freed and reassigned to another SMC-R connection instance.  For
   more details on SMC-R connection termination, refer to Section 4.8.

   However, there are some error scenarios where this two-way explicit
   acknowledgment may not be completed.  In these scenarios, an RMBE
   owner may choose to reassign this RMBE to a new SMC-R connection
   instance on this SMC-R link group.  When this occurs, the partner
   SMC-R peer must detect this condition during SMC-R Rendezvous
   processing when presented with an RMBE that it believes is already in
   use for a different SMC-R connection.  In this case, the SMC-R peer
   must abort the existing SMC-R connection associated with this RMBE.
   The abort processing resets the TCP connection (if it is still
   active), but it must not attempt to perform any RDMA writes to this
   RMBE and must also ignore any data sitting in the local RMBE
   associated with the existing connection.  It then proceeds to free up
   the local RMBE and notify the local application that the connection
   is being abnormally reset.

   The remote SMC-R peer then proceeds to normal processing for this new
   SMC-R connection.

Top      Up      ToC       Page 63 
4.5.  SMC-R Protocol Considerations

   The following sections describe considerations for the SMC-R protocol
   as compared to TCP.

4.5.1.  SMC-R Protocol Optimized Window Size Updates

   An SMC-R receiver host sends its consumer cursor information to the
   sender to convey the progress that the receiving application has made
   in consuming the sent data.  The difference between the writer's
   producer cursor and the associated receiver's consumer cursor
   indicates the window size available for the sender to write into.
   This is somewhat similar to TCP window update processing and
   therefore has some similar considerations, such as silly window
   syndrome avoidance, whereby TCP has an optimization that minimizes
   the overhead of very small, unproductive window size updates
   associated with suboptimal socket applications consuming very small
   amounts of data on every receive() invocation.  For SMC-R, the
   receiver only updates its consumer cursor via a unique CDC message
   under the following conditions:

   o  The current window size (from a sender's perspective) is less than
      half of the receive buffer space, and the consumer cursor update
      will result in a minimum increase in the window size of 10% of the
      receive buffer space.  Some examples:

      a. Receive buffer size: 64K, current window size (from a sender's
         perspective): 50K.  No need to update the consumer cursor.
         Plenty of space is available for the sender.

      b. Receive buffer size: 64K, current window size (from a sender's
         perspective): 30K, current window size from a receiver's
         perspective: 31K.  No need to update the consumer cursor; even
         though the sender's window size is < 1/2 of the 64K, the window
         update would only increase that by 1K, which is < 1/10th of the
         64K buffer size.

      c. Receive buffer size: 64K, current window size (from a sender's
         perspective): 30K, current window size from a receiver's
         perspective: 64K.  The receiver updates the consumer cursor
         (sender's window size is < 1/2 of the 64K; the window update
         would increase that by > 6.4K).

Top      Up      ToC       Page 64 
   o  The receiver must always include a consumer cursor update whenever
      it sends a CDC message to the partner for another flow (i.e., send
      flow in the opposite direction).  This allows the window size
      update to be delivered with no additional overhead.  This is
      somewhat similar to TCP DelayAck processing and quite effective
      for request/response data patterns.

   o  If a peer has set the B-bit in a CDC message, then any consumption
      of data by the receiver causes a CDC message to be sent, updating
      the consumer cursor until a CDC message with that bit cleared is
      received from the peer.

   o  The optimized window size updates are overridden when the sender
      sets the Consumer Cursor Update Requested flag in a CDC message to
      the receiver.  When this indicator is on, the consumer must send a
      consumer cursor update immediately when data is consumed by the
      local application or if the cursor has not been updated for a
      while (i.e., local copy of the consumer cursor does not match the
      last consumer cursor value sent to the partner).  This allows the
      sender to perform optional diagnostics for detecting a stalled
      receiver application (data has been sent but not consumed).  It is
      recommended that the Consumer Cursor Update Requested flag only be
      sent for diagnostic procedures, as it may result in non-optimal
      data path performance.

4.5.2.  Small Data Sends

   The SMC-R protocol makes no special provisions for handling small
   data segments sent across a stream socket.  Data is always sent if
   sufficient window space is available.  In contrast to the TCP Nagle
   algorithm, there are no special provisions in SMC-R for coalescing
   small data segments.

   An implementation of SMC-R can be configured to optimize its sending
   processing by coalescing outbound data for a given SMC-R connection
   so that it can reduce the number of RDMA write operations it
   performs, in a fashion similar to Nagle's algorithm.  However, any
   such coalescing would require a timer on the sending host that would
   ensure that data was eventually sent.  Also, the sending host would
   have to opt out of this processing if Nagle's algorithm had been
   disabled (programmatically or via system configuration).

Top      Up      ToC       Page 65 
4.5.3.  TCP Keepalive Processing

   TCP keepalive processing allows applications to direct the local
   TCP/IP host to periodically "test" the viability of an idle TCP
   connection.  Since SMC-R connections have a TCP representation along
   with an SMC-R representation, there are unique keepalive processing
   considerations:

   o  SMC-R-layer keepalive processing: If keepalive is enabled for an
      SMC-R connection, the local host maintains a keepalive timer that
      reflects how long an SMC-R connection has been idle.  The local
      host also maintains a timestamp of last activity for each SMC-R
      link (for any SMC-R connection on that link).  When it is
      determined that an SMC-R connection has been idle longer than the
      keepalive interval, the host checks to see whether or not the
      SMC-R link has been idle for a duration longer than the keepalive
      timeout.  If both conditions are met, the local host then performs
      a TEST LINK LLC command to test the viability of the SMC-R link
      over the RoCE fabric (RC-QPs).  If a TEST LINK LLC command
      response is received within a reasonable amount of time, then the
      link is considered viable, and all connections using this link are
      considered viable as well.  If, however, a response is not
      received in a reasonable amount of time or there's a failure in
      sending the TEST LINK LLC command, then this is considered a
      failure in the SMC-R link, and failover processing to an alternate
      SMC-R link must be triggered.  If no alternate SMC-R link exists
      in the SMC-R link group, then all of the SMC-R connections on this
      link are abnormally terminated by resetting the TCP connections
      represented by these SMC-R connections.  Given that multiple SMC-R
      connections can share the same SMC-R link, implementing an SMC-R
      link-level probe using the TEST LINK LLC command will help reduce
      the amount of unproductive keepalive traffic for SMC-R
      connections; as long as some SMC-R connections on a given SMC-R
      link are active (i.e., have had I/O activity within the keepalive
      interval), then there is no need to perform additional link
      viability testing.

Top      Up      ToC       Page 66 
   o  TCP-layer keepalive processing: Traditional TCP "keepalive"
      packets are not as relevant for SMC-R connections, given that the
      TCP path is not used for these connections once the SMC-R
      Rendezvous processing is completed.  All SMC-R connections by
      default have associated TCP connections that are idle.  Are TCP
      keepalive probes still needed for these connections?  There are
      two main scenarios to consider:

      1. TCP keepalives that are used to determine whether or not the
         peer TCP endpoint is still active.  This is not needed for
         SMC-R connections, as the SMC-R-level keepalives mentioned
         above will determine whether or not the remote endpoint
         connections are still active.

      2. TCP keepalives that are used to ensure that TCP connections
         traversing an intermediate proxy maintain an active state.  For
         example, stateful firewalls typically maintain state
         representing every valid TCP connection that traverses the
         firewall.  These types of firewalls are known to expire idle
         connections by removing their state in the firewall to conserve
         memory.  TCP keepalives are often used in this scenario to
         prevent firewalls from timing out otherwise idle connections.
         When using SMC-R, both endpoints must reside in the same
         Layer 2 network (i.e., the same subnet).  As a result,
         firewalls cannot be injected in the path between two SMC-R
         endpoints.  However, other intermediate proxies, such as
         TCP/IP-layer load balancers, may be injected in the path of two
         SMC-R endpoints.  These types of load balancers also maintain
         connection state so that they can forward TCP connection
         traffic to the appropriate cluster endpoint.  When using SMC-R,
         these TCP connections will appear to be completely idle, making
         them susceptible to potential timeouts at the load-balancing
         proxy.  As a result, for this scenario, TCP keepalives may
         still be relevant.

   The following are the TCP-level keepalive processing requirements for
   SMC-R-enabled hosts:

   o  SMC-R peers should allow TCP keepalives to flow on the TCP path of
      SMC-R connections based on existing TCP keepalive configuration
      and programming options.  However, it is strongly recommended that
      platforms provide the ability to specify very granular keepalive
      timers (for example, single-digit-second timers) and should
      consider providing a configuration option that limits the minimum
      keepalive timer that will be used for TCP-layer keepalives on
      SMC-R connections.  This is important to minimize the amount of
      TCP keepalive packets transmitted in the network for SMC-R
      connections.

Top      Up      ToC       Page 67 
   o  SMC-R peers must always respond to inbound TCP-layer keepalives
      (by sending ACKs for these packets) even if the connection is
      using SMC-R.  Typically, once a TCP connection has completed the
      SMC-R Rendezvous processing and is using SMC-R for data flows, no
      new inbound TCP segments are expected on that TCP connection,
      other than TCP termination segments (FIN, RST, etc.).  TCP
      keepalives are the one exception that must be supported.  Also,
      since TCP keepalive probes do not carry any application-layer
      data, this has no adverse impact on the application's inbound data
      stream.

4.6.  TCP Connection Failover between SMC-R Links

   A peer may change which SMC-R link within a link group it sends its
   writes over in the event of a link failure.  Since each peer
   independently chooses which link to send writes over for a specific
   TCP connection, this process is done independently by each peer.

4.6.1.  Validating Data Integrity

   Even though RoCE is a reliable transport, there is a small subset of
   failure modes that could cause unrecoverable loss of data.  When an
   RNIC acknowledges receipt of an RDMA write to its peer, that creates
   a write completion event to the sending peer, which allows the sender
   to release any buffers it is holding for that write.  In normal
   operation and in most failures, this operation is reliable.

   However, there are failure modes possible in which a receiving RNIC
   has acknowledged an RDMA write but then was not able to place the
   received data into its host memory -- for example, a sudden,
   disorderly failure of the interface between the RNIC and the host.
   While rare, these types of events must be guarded against to ensure
   data integrity.  The process for switching SMC-R links during
   failover, as described in this section, guards against this
   possibility and is mandatory.

   Each peer must track the current state of the CDC sequence numbers
   for a TCP connection.  The sender must keep track of the sequence
   number of the CDC message that described the last write acknowledged
   by the peer RNIC, or Sequence Sent (SS).  In other words, SS
   describes the last write that the sender believes its peer has
   successfully received.  The receiver must keep track of the sequence
   number of the CDC message that described the last write that it has
   successfully received (i.e., the data has been successfully placed
   into an RMBE), or Sequence Received (SR).

Top      Up      ToC       Page 68 
   When an RNIC fails and the sender changes SMC-R links, the sender
   must first send a CDC message with the F-bit (failover validation
   indicator; see Appendix A.4) set over the new SMC-R link.  This is
   the failover data validation message.  The sequence number in this
   CDC message is equal to SS.  The CDC message key, the length, and the
   SMC-R alert token are the only other fields in this CDC message that
   are significant.  No reply is expected from this validation message,
   and once the sender has sent it, the sender may resume sending on the
   new SMC-R link as described in Section 4.6.2.

   Upon receipt of the failover validation message, the receiver must
   verify that its SR value for the TCP connection is equal to or
   greater than the sequence number in the failover validation message.
   If so, no further action is required, and the TCP connection resumes
   on the new SMC-R link.  If SR is less than the sequence number value
   in the validation message, data has been lost, and the receiver must
   immediately reset the TCP connection.

4.6.2.  Resuming the TCP Connection on a New SMC-R Link

   When a connection is moved to a new SMC-R link and the failover
   validation message has been sent, the sender can immediately resume
   normal transmission.  In order to preserve the application message
   stream, the sender must replay any RDMA writes (and their associated
   CDC messages) that were in progress or failed when the previous SMC-R
   link failed, before sending new data on the new SMC-R link.  The
   sender has two options for accomplishing this:

   o  Preserve the sequence numbers "as is": Retry all failed and
      pending operations as they were originally done, including
      reposting all associated RDMA write operations and their
      associated CDC messages without making any changes.  Then resume
      sending new data using new sequence numbers.

   o  Combine pending messages and possibly add new data: Combine failed
      and pending messages into a single new write with a new sequence
      number.  This allows the sender to combine pending messages into
      fewer operations.  As a further optimization, this write can also
      include new data, as long as all failed and pending data are also
      included.  If this approach is taken, the sequence number must be
      increased beyond the last failed or pending sequence number.

Top      Up      ToC       Page 69 
4.7.  RMB Data Flows

   The following sections describe the RDMA wire flows for the SMC-R
   protocol after a TCP connection has switched into SMC-R mode (i.e.,
   SMC-R Rendezvous processing is complete and a pair of RMB elements
   has been assigned and communicated by the SMC-R peers).  The ladder
   diagrams below include the following:

   o  RMBE control information kept by each peer.  Only a subset of the
      information is depicted, specifically only the fields that reflect
      the stream of data written by Host A and read by Host B.

   o  Time line 0-x, which shows the wire flows in a time-relative
      fashion.

   o  Note that RMBE control information is only shown in a time
      interval if its value changed (otherwise, assume that the value is
      unchanged from the previously depicted value).

   o  The local copy of the producer cursors and consumer cursors that
      is maintained by each host is not depicted in these figures.  Note
      that the cursor values in the diagram reflect the necessity of
      skipping over the eye catcher in the RMBE data area.  They start
      and wrap at 4, not 0.

4.7.1.  Scenario 1: Send Flow, Window Size Unconstrained

            SMC Host A                             SMC Host B
           RMBE A Info                            RMBE B Info
       (Consumer Cursors)                      (Producer Cursors)
   Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
   4        0         0                  0    4        0          0
   0        0         1 ---------------> 1    0        0          0
                        RDMA-WR Data
                          (4:1003)
   4        0         2 ...............> 2    1004     0          0
                        CDC Message

        Figure 16: Scenario 1: Send Flow, Window Size Unconstrained

   Scenario assumptions:

   o  Kernel implementation.

   o  New SMC-R connection; no data has been sent on the connection.

Top      Up      ToC       Page 70 
   o  Host A: Application issues send for 1000 bytes to Host B.

   o  Host B: RMBE receive buffer size is 10,000; application has issued
      a recv for 10,000 bytes.

   Flow description:

   1. The application issues a send() for 1000 bytes; the SMC-R layer
      copies data into a kernel send buffer.  It then schedules an RDMA
      write operation to move the data into the peer's RMBE receive
      buffer, at relative position 4-1003 (to skip the 4-byte
      eye catcher in the RMBE data area).  Note that no immediate data
      or alert (i.e., interrupt) is provided to Host B for this RDMA
      operation.

   2. Host A sends a CDC message to update the producer cursor to
      byte 1004.  This CDC message will deliver an interrupt to Host B.
      At this point, the SMC-R layer can return control back to the
      application.  Host B, once notified of the completion of the
      previous RDMA operation, locates the RMBE associated with the RMBE
      alert token that was included in the message and proceeds to
      perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  It will use the producer cursor as an
      indicator of how much data is available to be delivered to the
      local application.  After this processing is complete, the SMC-R
      layer will also update its local consumer cursor to match the
      producer cursor (i.e., indicating that all data has been
      consumed).  Note that a message to the peer updating the consumer
      cursor is not needed at this time, as the window size is
      unconstrained (> 1/2 of the receive buffer size).  The window size
      is calculated by taking the difference between the producer cursor
      and the consumer cursor in the RMBEs (10,000 - 1004 = 8996).

Top      Up      ToC       Page 71 
4.7.2.  Scenario 2: Send/Receive Flow, Window Size Unconstrained

             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
    4        0         0                  0    4        0          0
    0        0         1 ---------------> 1    0        0          0
                         RDMA-WR Data
                           (4:1003)
    4        0         2 ...............> 2    1004     0          0
                         CDC Message

    0        0         3 <--------------  3    1004     0          0
                         RDMA-WR Data
                           (4:503)
    1004     0         4 <..............  4    1004     0          0
                          CDC Message

    Figure 17: Scenario 2: Send/Receive Flow, Window Size Unconstrained

   Scenario assumptions:

   o  New SMC-R connection; no data has been sent on the connection.

   o  Host A: Application issues send for 1000 bytes to Host B.

   o  Host B: RMBE receive buffer size is 10,000; application has
      already issued a recv for 10,000 bytes.  Once the receive is
      completed, the application sends a 500-byte response to Host A.

   Flow description:

   1. The application issues a send() for 1000 bytes; the SMC-R layer
      copies data into a kernel send buffer.  It then schedules an RDMA
      write operation to move the data into the peer's RMBE receive
      buffer, at relative position 4-1003.  Note that no immediate data
      or alert (i.e., interrupt) is provided to Host B for this RDMA
      operation.

   2. Host A sends a CDC message to update the producer cursor to
      byte 1004.  This CDC message will deliver an interrupt to Host B.
      At this point, the SMC-R layer can return control back to the
      application.

Top      Up      ToC       Page 72 
   3. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token and proceeds
      to perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  After this processing is complete, the SMC-R
      layer will also update its local consumer cursor to match the
      producer cursor (i.e., indicating that all data has been
      consumed).  Note that an update of the consumer cursor to the peer
      is not needed at this time, as the window size is unconstrained
      (> 1/2 of the receive buffer size).  The application then performs
      a send() for 500 bytes to Host A.  The SMC-R layer will copy the
      data into a kernel buffer and then schedule an RDMA write into the
      partner's RMBE receive buffer.  Note that this RDMA write
      operation includes no immediate data or notification to Host A.

   4. Host B sends a CDC message to update the partner's RMBE control
      information with the latest producer cursor (set to 503 and not
      shown in the diagram above) and to also inform the peer that the
      consumer cursor value is now 1004.  It also updates the local
      current consumer cursor and the last sent consumer cursor to 1004.
      This CDC message includes notification, since we are updating our
      producer cursor; this requires attention by the peer host.

4.7.3.  Scenario 3: Send Flow, Window Size Constrained

             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
    4        0         0                  0    4        0          0
    4        0         1 ---------------> 1    4        0          0
                         RDMA-WR Data
                           (4:3003)
    4        0         2 ...............> 2    3004     0          0
                         CDC Message
    4        0         3                  3    3004     0          0
    4        0         4 ---------------> 4    3004     0          0
                         RDMA-WR Data
                           (3004:7003)
    4        0         5 ................> 5   7004     0          0
                         CDC Message
    7004     0         6 <................ 6   7004     0          0
                         CDC Message

         Figure 18: Scenario 3: Send Flow, Window Size Constrained

Top      Up      ToC       Page 73 
   Scenario assumptions:

   o  New SMC-R connection; no data has been sent on this connection.

   o  Host A: Application issues send for 3000 bytes to Host B and then
      another send for 4000 bytes.

   o  Host B: RMBE receive buffer size is 10,000.  Application has
      already issued a recv for 10,000 bytes.

   Flow description:

   1. The application issues a send() for 3000 bytes; the SMC-R layer
      copies data into a kernel send buffer.  It then schedules an RDMA
      write operation to move the data into the peer's RMBE receive
      buffer, at relative position 4-3003.  Note that no immediate data
      or alert (i.e., interrupt) is provided to Host B for this RDMA
      operation.

   2. Host A sends a CDC message to update its producer cursor to
      byte 3003.  This CDC message will deliver an interrupt to Host B.
      At this point, the SMC-R layer can return control back to the
      application.

   3. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token and proceeds
      to perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  After this processing is complete, the SMC-R
      layer will also update its local consumer cursor to match the
      producer cursor (i.e., indicating that all data has been
      consumed).  It will not, however, update the partner with this
      information, as the window size is not constrained
      (10,000 - 3000 = 7000 bytes of available space).  The application
      on Host B also issues a new recv() for 10,000 bytes.

   4. On Host A, the application issues a send() for 4000 bytes.  The
      SMC-R layer copies the data into a kernel buffer and schedules an
      async RDMA write into the peer's RMBE receive buffer at relative
      position 3003-7004.  Note that no alert is provided to Host B for
      this flow.

   5. Host A sends a CDC message to update the producer cursor to
      byte 7004.  This CDC message will deliver an interrupt to Host B.
      At this point, the SMC-R layer can return control back to the
      application.

Top      Up      ToC       Page 74 
   6. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token and proceeds
      to perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  After this processing is complete, the SMC-R
      layer will also update its local consumer cursor to match the
      producer cursor (i.e., indicating that all data has been
      consumed).  It will then determine whether or not it needs to
      update the consumer cursor to the peer.  The available window size
      is now 3000 (10,000 - (producer cursor - last sent consumer
      cursor)), which is < 1/2 of the receive buffer size
      (10,000/2 = 5000), and the advance of the window size is > 10% of
      the window size (1000).  Therefore, a CDC message is issued to
      update the consumer cursor to Peer A.

4.7.4.  Scenario 4: Large Send, Flow Control, Full Window Size Writes

             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
    1004     1         0                  0    1004     1          0
    1004     1         1 ---------------> 1    1004     1          0
                         RDMA-WR Data
                           (1004:9999)
    1004     1         2 ---------------> 2    1004     1          0
                         RDMA-WR Data
                           (4:1003)
    1004     1         3 ...............> 3    1004     2          Wrt
                         CDC Message                               Blk

    1004     2         4 <............... 4    1004     2          Wrt
                         CDC Message                               Blk

    1004     2         5 ---------------> 5    1004     2          Wrt
                         RDMA-WR Data                              Blk
                           (1004:9999)
    1004     2         6 ---------------> 6    1004     2          Wrt
                         RDMA-WR Data                              Blk
                          (4:1003)
    1004     2         7 ...............> 7    1004     3          Wrt
                         CDC Message                               Blk

    1004     3         8 <............... 8    1004     3          Wrt
                         CDC Message                               Blk

             Figure 19: Scenario 4: Large Send, Flow Control,
                          Full Window Size Writes

Top      Up      ToC       Page 75 
   Scenario assumptions:

   o  Kernel implementation.

   o  Existing SMC-R connection, Host B's receive window size is fully
      open (peer consumer cursor = peer producer cursor).

   o  Host A: Application issues send for 20,000 bytes to Host B.

   o  Host B: RMBE receive buffer size is 10,000; application has issued
      a recv for 10,000 bytes.

   Flow description:

   1. The application issues a send() for 20,000 bytes; the SMC-R layer
      copies data into a kernel send buffer (assumes that send buffer
      space of 20,000 is available for this connection).  It then
      schedules an RDMA write operation to move the data into the peer's
      RMBE receive buffer, at relative position 1004-9999.  Note that no
      immediate data or alert (i.e., interrupt) is provided to Host B
      for this RDMA operation.

   2. Host A then schedules an RDMA write operation to fill the
      remaining 1000 bytes of available space in the peer's RMBE receive
      buffer, at relative position 4-1003.  Note that no immediate data
      or alert (i.e., interrupt) is provided to Host B for this RDMA
      operation.  Also note that an implementation of SMC-R may optimize
      this processing by combining steps 1 and 2 into a single
      RDMA write operation (with two different data sources).

   3. Host A sends a CDC message to update the producer cursor to
      byte 1004.  Since the entire receive buffer space is filled, the
      producer writer blocked flag (the "Wrt Blk" indicator (flag) in
      Figure 19) is set and the producer cursor wrap sequence number
      (the producer "Wrap Seq#" in Figure 19) is incremented.  This CDC
      message will deliver an interrupt to Host B.  At this point, the
      SMC-R layer can return control back to the application.

   4. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token and proceeds
      to perform normal receive-side processing, waking up the suspended
      application read thread, copying the data into the application's
      receive buffer, etc.  In this scenario, Host B notices that the
      producer cursor has not been advanced (same value as the consumer
      cursor); however, it notices that the producer cursor wrap
      sequence number is different from its local value (1), indicating
      that a full window of new data is available.  All of the data in
      the receive buffer can be processed, with the first segment

Top      Up      ToC       Page 76 
      (1004-9999) followed by the second segment (4-1003).  Because the
      producer writer blocked indicator was set, Host B schedules a CDC
      message to update its latest information to the peer: consumer
      cursor (1004), consumer cursor wrap sequence number (the current
      value of 2 is used).

   5. Host A, upon receipt of the CDC message, locates the TCP
      connection associated with the alert token and, upon examining the
      control information provided, notices that Host B has consumed all
      of the data (based on the consumer cursor and the consumer cursor
      wrap sequence number) and initiates the next RDMA write to fill
      the receive buffer at offset 1003-9999.

   6. Host A then moves the next 1000 bytes into the beginning of the
      receive buffer (4-1003) by scheduling an RDMA write operation.
      Note that at this point there are still 8 bytes remaining to be
      written.

   7. Host A then sends a CDC message to set the producer writer blocked
      indicator and to increment the producer cursor wrap sequence
      number (3).

   8. Host B, upon notification, completes the same processing as step 4
      above, including sending a CDC message to update the peer to
      indicate that all data has been consumed.  At this point, Host A
      can write the final 8 bytes to Host B's RMBE into
      positions 1004-1011 (not shown).

Top      Up      ToC       Page 77 
4.7.5.  Scenario 5: Send Flow, Urgent Data, Window Size Unconstrained

             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flag
    1000     1         0                  0    1000     1          0
    1000     1         1 ---------------> 1    1000     1          0
                         RDMA-WR Data
                           (1000:1499)
    1000     1         2 ...............> 2    1500     1          UrgP
                         CDC Message                               UrgA

    1500     1         3 <............... 3    1500     1          UrgP
                         CDC Message                               UrgA

    1500     1         4 ---------------> 4    1500     1          UrgP
                         RDMA-WR Data                              UrgA
                           (1500:2499)
    1500     1         5 ...............> 5    2500     1          0
                         CDC Message

      Figure 20: Scenario 5: Send Flow, Urgent Data, Window Size Open

   Scenario assumptions:

   o  Kernel implementation.

   o  Existing SMC-R connection; window size open (unconstrained); all
      data has been consumed by receiver.

   o  Host A: Application issues send for 500 bytes with urgent data
      indicator (out of band) to Host B, then sends 1000 bytes of
      normal data.

   o  Host B: RMBE receive buffer size is 10,000; application has issued
      a recv for 10,000 bytes and is also monitoring the socket for
      urgent data.

   Flow description:

   1. The application issues a send() for 500 bytes of urgent data; the
      SMC-R layer copies data into a kernel send buffer.  It then
      schedules an RDMA write operation to move the data into the peer's
      RMBE receive buffer, at relative position 1000-1499.  Note that no
      immediate data or alert (i.e., interrupt) is provided to Host B
      for this RDMA operation.

Top      Up      ToC       Page 78 
   2. Host A sends a CDC message to update its producer cursor to
      byte 1500 and to turn on the producer Urgent Data Pending (UrgP)
      and Urgent Data Present (UrgA) flags.  This CDC message will
      deliver an interrupt to Host B.  At this point, the SMC-R layer
      can return control back to the application.

   3. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token, notices
      that the Urgent Data Pending flag is on, and proceeds with out-of-
      band socket API notification -- for example, satisfying any
      outstanding select() or poll() requests on the socket by
      indicating that urgent data is pending (i.e., by setting the
      exception bit on).  The urgent data present indicator allows
      Host B to also determine the position of the urgent data (the
      producer cursor points 1 byte beyond the last byte of urgent
      data).  Host B can then perform normal receive-side processing
      (including specific urgent data processing), copying the data into
      the application's receive buffer, etc.  Host B then sends a CDC
      message to update the partner's RMBE control area with its latest
      consumer cursor (1500).  Note that this CDC message must occur,
      regardless of the current local window size that is available.
      The partner host (Host A) cannot initiate any additional RDMA
      writes until it receives acknowledgment that the urgent data has
      been processed (or at least processed/remembered at the SMC-R
      layer).

   4. Upon receipt of the message, Host A wakes up, sees that the peer
      consumed all data up to and including the last byte of urgent
      data, and now resumes sending any pending data.  In this case, the
      application had previously issued a send for 1000 bytes of normal
      data, which would have been copied in the send buffer, and control
      would have been returned to the application.  Host A now initiates
      an RDMA write to move that data to the peer's receive buffer at
      position 1500-2499.

   5. Host A then sends a CDC message to update its producer cursor
      value (2500) and to turn off the Urgent Data Pending and Urgent
      Data Present flags.  Host B wakes up, processes the new data
      (resumes application, copies data into the application receive
      buffer), and then proceeds to update the local current consumer
      cursor (2500).  Given that the window size is unconstrained, there
      is no need for a consumer cursor update in the peer's RMBE.

Top      Up      ToC       Page 79 
4.7.6.  Scenario 6: Send Flow, Urgent Data, Window Size Closed

             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flag
    1000     1         0                  0    1000     2          Wrt
                                                                   Blk

    1000     1         1 ...............> 1    1000     2          Wrt
                         CDC Message                               Blk
                                                                   UrgP

    1000     2         2 <............... 2    1000     2          Wrt
                         CDC Message                               Blk
                                                                   UrgP

    1000     2         3 ---------------> 3    1000     2          Wrt
                         RDMA-WR Data                              Blk
                           (1000:1499)                             UrgP

    1000     2         4 ...............> 4    1500     2          UrgP
                         CDC Message                               UrgA

    1500     2         5 <............... 5    1500     2          UrgP
                         CDC Message                               UrgA

    1500     2         6 ---------------> 6    1500     2          UrgP
                         RDMA-WR Data                              UrgA
                           (1500:2499)
    1000     2         7 ...............> 7    2500     2          0
                         CDC Message

     Figure 21: Scenario 6: Send Flow, Urgent Data, Window Size Closed

   Scenario assumptions:

   o  Kernel implementation.

   o  Existing SMC-R connection; window size closed; writer is blocked.

   o  Host A: Application issues send for 500 bytes with urgent data
      indicator (out of band) to Host B, then sends 1000 bytes of
      normal data.

   o  Host B: RMBE receive buffer size is 10,000; application has no
      outstanding recv() (for normal data) and is monitoring the socket
      for urgent data.

Top      Up      ToC       Page 80 
   Flow description:

   1. The application issues a send() for 500 bytes of urgent data; the
      SMC-R layer copies data into a kernel send buffer (if available).
      Since the writer is blocked (window size closed), it cannot send
      the data immediately.  It then sends a CDC message to notify the
      peer of the Urgent Data Pending (UrgP) indicator (the writer
      blocked indicator remains on as well).  This serves as a signal to
      Host B that urgent data is pending in the stream.  Control is also
      returned to the application at this point.

   2. Host B, once notified of the receipt of the previous CDC message,
      locates the RMBE associated with the RMBE alert token, notices
      that the Urgent Data Pending flag is on, and proceeds with out-of-
      band socket API notification -- for example, satisfying any
      outstanding select() or poll() requests on the socket by
      indicating that urgent data is pending (i.e., by setting the
      exception bit on).  At this point, it is expected that the
      application will enter urgent data mode processing, expeditiously
      processing all normal data (by issuing recv API calls) so that it
      can get to the urgent data byte.  Whether the application has this
      urgent mode processing or not, at some point, the application will
      consume some or all of the pending data in the receive buffer.
      When this occurs, Host B will also send a CDC message to update
      its consumer cursor and consumer cursor wrap sequence number to
      the peer.  In the example above, a full window's worth of data was
      consumed.

   3. Host A, once awakened by the message, will notice that the window
      size is now open on this connection (based on the consumer cursor
      and the consumer cursor wrap sequence number, which now matches
      the producer cursor wrap sequence number) and resume sending of
      the urgent data segment by scheduling an RDMA write into relative
      position 1000-1499.

   4. Host A then sends a CDC message to advance its producer cursor
      (1500) and to also notify Host B of the Urgent Data Present (UrgA)
      indicator (and turn off the writer blocked indicator).  This
      signals to Host B that the urgent data is now in the local receive
      buffer and that the producer cursor points to the last byte of
      urgent data.

   5. Host B wakes up, processes the urgent data, and, once the urgent
      data is consumed, sends a CDC message to update its consumer
      cursor (1500).

Top      Up      ToC       Page 81 
   6. Host A wakes up, sees that Host B has consumed the sequence number
      associated with the urgent data, and then initiates the next RDMA
      write operation to move the 1000 bytes associated with the next
      send() of normal data into the peer's receive buffer at
      position 1500-2499.  Note that the send API would have likely
      completed earlier in the process by copying the 1000 bytes into a
      send buffer and returning back to the application, even though we
      could not send any new data until the urgent data was processed
      and acknowledged by Host B.

   7. Host A sends a CDC message to advance its producer cursor to 2500
      and to reset the Urgent Data Pending and Urgent Data Present
      flags.  Host B wakes up and processes the inbound data.

4.8.  Connection Termination

   Just as SMC-R connections are established using a combination of TCP
   connection establishment flows and SMC-R protocol flows, the
   termination of SMC-R connections also uses a similar combination of
   SMC-R protocol termination flows and normal TCP connection
   termination flows.  The following sections describe the SMC-R
   protocol normal and abnormal connection termination flows.

4.8.1.  Normal SMC-R Connection Termination Flows

   Normal SMC-R connection flows are triggered via the normal stream
   socket API semantics, namely by the application issuing a close() or
   shutdown() API.  Most applications, after consuming all incoming data
   and after sending any outbound data, will then issue a close() API to
   indicate that they are done both sending and receiving data.  Some
   applications, typically a small percentage, make use of the
   shutdown() API that allows them to indicate that the application is
   done sending data, receiving data, or both sending and receiving
   data.  The main use of this API is scenarios where a TCP application
   wants to alert its partner endpoint that it is done sending data but
   is still receiving data on its socket (shutdown for write).  Issuing
   shutdown() for both sending and receiving data is really no different
   than issuing a close() and can therefore be treated in a similar
   fashion.  Shutdown for read is typically not a very useful operation
   and in normal circumstances does not trigger any network flows to
   notify the partner TCP endpoint of this operation.

   These same trigger points will be used by the SMC-R layer to initiate
   SMC-R connection termination flows.  The main design point for SMC-R
   normal connection flows is to use the SMC-R protocol to first shut
   down the SMC-R connection and free up any SMC-R RDMA resources, and
   then allow the normal TCP connection termination protocol (i.e., FIN
   processing) to drive cleanup of the TCP connection.  This design

Top      Up      ToC       Page 82 
   point is very important in ensuring that RDMA resources such as
   the RMBEs are only freed and reused when both SMC-R endpoints
   are completely done with their RDMA write operations to the
   partner's RMBE.

                                      1
                            +-----------------+
            |-------------->|     CLOSED      |<-------------|
        3D  |               |                 |              |  4D
            |               +-----------------+              |
            |                       |                        |
            |                     2 |                        |
            |                       V                        |
    +----------------+     +-----------------+     +----------------+
    |AppFinCloseWait |     |     ACTIVE      |     |PeerFinCloseWait|
    |                |     |                 |     |                |
    +----------------+     +-----------------+     +----------------+
            |                   |         |                   |
            |     Active Close  | 3A | 4A |  Passive Close    |
            |                   V    |    V                   |
            |       +--------------+ | +-------------+        |
            |--<----|PeerCloseWait1| | |AppCloseWait1|--->----|
        3C  |       |              | | |             |        |  4C
            |       +--------------+ | +-------------+        |
            |             |          |         |              |
            |             | 3B       |     4B  |              |
            |             V          |         V              |
            |       +--------------+ | +-------------+        |
            |--<----|PeerCloseWait2| | |AppCloseWait2|--->----|
                    |              | | |             |
                    +--------------+ | +-------------+
                                     |
                                     |

                    Figure 22: SMC-R Connection States

   Figure 22 describes the states that an SMC-R connection typically
   goes through.  Note that there are variations to these states that
   can occur when an SMC-R connection is abnormally terminated, similar
   in a way to when a TCP connection is reset.  The following are the
   high-level state transitions for an SMC-R connection:

   1. An SMC-R connection begins in the Closed state.  This state is
      meant to reflect an RMBE that is not currently in use (was
      previously in use but no longer is, or was never allocated).

Top      Up      ToC       Page 83 
   2. An SMC-R connection progresses to the Active state once the SMC-R
      Rendezvous processing has successfully completed, RMB element
      indices have been exchanged, and SMC-R links have been activated.
      In this state, the TCP connection is fully established, rendezvous
      processing has been completed, and SMC-R peers can begin the
      exchange of data via RDMA.

   3. Active close processing (on the SMC-R peer that is initiating the
      connection termination).

      A. When an application on one of the SMC-R connection peers issues
         a close(), a shutdown() for write, or a shutdown() for both
         read and write, the SMC-R layer on that host will initiate
         SMC-R connection termination processing.  First, if a close()
         or shutdown(both) is issued, it will check to see that there's
         no data in the local RMB element that has not been read by the
         application.  If unread data is detected, the SMC-R connection
         must be abnormally reset; for more details on this, refer to
         Section 4.8.2 ("Abnormal SMC-R Connection Termination Flows").
         If no unread data is pending, it then checks to see whether or
         not any outstanding data is waiting to be written to the peer,
         or if any outstanding RDMA writes for this SMC-R connection
         have not yet completed.  If either of these two scenarios is
         true, an indicator that this connection is in a pending close
         state is saved in internal data structures representing this
         SMC-R connection, and control is returned to the application.
         If all data to be written to the partner has completed, this
         peer will send a CDC message to notify the peer of either the
         PeerConnectionClosed indicator (close or shutdown for both was
         issued) or the PeerDoneWriting indicator.  This will provide an
         interrupt to inform that partner SMC-R peer that the connection
         is terminating.  At this point, the local side of the SMC-R
         connection transitions in the PeerCloseWait1 state, and control
         can be returned to the application.  If this process could not
         be completed synchronously (the pending close condition
         mentioned above), it is completed when all RDMA writes for data
         and control cursors have been completed.

      B. At some point, the SMC-R peer application (passive close) will
         consume all incoming data, realize that that partner is done
         sending data on this connection, and proceed to initiate its
         own close of the connection once it has completed sending all
         data from its end.  The partner application can initiate this
         connection termination processing via close() or shutdown()
         APIs.  If the application does so by issuing a shutdown() for
         write, then the partner SMC-R layer will send a CDC message to
         notify the peer (the active close side) of the PeerDoneWriting
         indicator.  When the "active close" SMC-R peer wakes up as a

Top      Up      ToC       Page 84 
         result of the previous CDC message, it will notice that the
         PeerDoneWriting indicator is now on and transition to the
         PeerCloseWait2 state.  This state indicates that the peer is
         done sending data and may still be reading data.  At this
         point, the "active close" peer will also need to ensure that
         any outstanding recv() calls for this socket are woken up and
         remember that no more data is forthcoming on this connection
         (in case the local connection was shutdown() for write only).

      C. This flow is a common transition from 3A or 3B above.  When the
         SMC-R peer (passive close) consumes all data and updates all
         necessary cursors to the peer, and the application closes its
         socket (close or shutdown for both), it will send a CDC message
         to the peer (the active close side) with the
         PeerConnectionClosed indicator set.  At this point, the
         connection can transition back to the Closed state if the local
         application has already closed (or issued shutdown for both)
         the socket.  Once in the Closed state, the RMBE can now be
         safely reused for a new SMC-R connection.  When the
         PeerConnectionClosed indicator is turned on, the SMC-R peer is
         indicating that it is done updating the partner's RMBE.

      D. Conditional state: If the local application has not yet issued
         a close() or shutdown(both), we need to wait until the
         application does so.  Once it does, the local host will send a
         CDC message to notify the peer of the PeerConnectionClosed
         indicator and then transition to the Closed state.

   4. Passive close processing (on the SMC-R peer that receives an
      indication that the partner is closing the connection).

      A. Upon receipt of a CDC message, the SMC-R layer will detect that
         the PeerConnectionClosed indicator or PeerDoneWriting indicator
         is on.  If any outstanding recv() calls are pending, they are
         completed with an indicator that the partner has closed the
         connection (zero-length data presented to the application).  If
         there is any pending data to be written and
         PeerConnectionClosed is on, then an SMC-R connection reset must
         be performed.  The connection then enters the AppCloseWait1
         state on the passive close side waiting for the local
         application to initiate its own close processing.

      B. If the local application issues a shutdown() for writing, then
         the SMC-R layer will send a CDC message to notify the partner
         of the PeerDoneWriting indicator and then transition the local
         side of the SMC-R connection to the AppCloseWait2 state.

Top      Up      ToC       Page 85 
      C. When the application issues a close() or shutdown() for both,
         the local SMC-R peer will send a message informing the peer of
         the PeerConnectionClosed indicator and transition to the Closed
         state if the remote peer has also sent the local peer the
         PeerConnectionClosed indicator.  If the peer has not sent the
         PeerConnectionClosed indicator, we transition into the
         PeerFinCloseWait state.

      D. The local SMC-R connection stays in this state until the peer
         sends the PeerConnectionClosed indicator in a CDC message.
         When the indicator is sent, we transition to the Closed state
         and are then free to reuse this RMBE.

   Note that each SMC-R peer needs to provide some logic that will
   prevent being stranded in a termination state indefinitely.  For
   example, if an Active Close SMC-R peer is in a PeerCloseWait (1 or 2)
   state waiting for the remote SMC-R peer to update its connection
   termination status, it needs to provide a timer that will prevent it
   from waiting in that state indefinitely should the remote SMC-R peer
   not respond to this termination request.  This could occur in error
   scenarios -- for example, if the remote SMC-R peer suffered a failure
   prior to being able to respond to the termination request or the
   remote application is not responding to this connection termination
   request by closing its own socket.  This latter scenario is similar
   to the TCP FINWAIT2 state, which has been known to sometimes cause
   issues when remote TCP/IP hosts lose track of established connections
   and neglect to close them.  Even though the TCP standards do not
   mandate a timeout from the TCP FINWAIT2 state, most TCP/IP
   implementations assign a timeout for this state.  A similar timeout
   will be required for SMC-R connections.  When this timeout occurs,
   the local SMC-R peer performs TCP reset processing for this
   connection.  However, no additional RDMA writes to the partner RMBE
   can occur at this point (we have already indicated that we are done
   updating the peer's RMBE).  After the TCP connection is reset, the
   RMBE can be returned to the free pool for reallocation.  See
   Section 4.4.2 for more details.

   Also note that it is possible to have two SMC-R endpoints initiate an
   Active close concurrently.  In that scenario, the flows above still
   apply; however, both endpoints follow the active close path (path 3).

Top      Up      ToC       Page 86 
4.8.2.  Abnormal SMC-R Connection Termination Flows

   Abnormal SMC-R connection termination can occur for a variety of
   reasons, including the following:

   o  The TCP connection associated with an SMC-R connection is reset.
      In TCP, either endpoint can send a RST segment to abort an
      existing TCP connection when error conditions are detected for the
      connection or the application overtly requests that the connection
      be reset.

   o  Normal SMC-R connection termination processing has unexpectedly
      stalled for a given connection.  When the stall is detected
      (connection termination timeout condition), an abnormal SMC-R
      connection termination flow is initiated.

   In these scenarios, it is very important that resources associated
   with the affected SMC-R connections are properly cleaned up to ensure
   that there are no orphaned resources and that resources can reliably
   be reused for new SMC-R connections.  Given that SMC-R relies heavily
   on the RDMA write processing, special care needs to be taken to
   ensure that an RMBE is no longer being used by an SMC-R peer before
   logically reassigning that RMBE to a new SMC-R connection.

   When an SMC-R peer initiates a TCP connection reset, it also
   initiates an SMC-R abnormal connection flow at the same time.  The
   SMC-R peers explicitly signal their intent to abnormally terminate an
   SMC-R connection and await explicit acknowledgment that the peer has
   received this notification and has also completed abnormal connection
   termination on its end.  Note that TCP connection reset processing
   can occur in parallel to these flows.

Top      Up      ToC       Page 87 
                            +-----------------+
            |-------------->|     CLOSED      |<-------------|
            |               |                 |              |
            |               +-----------------+              |
            |                                                |
            |                                                |
            |                                                |
            |           +-----------------------+            |
            |           |     Any state         |            |
            |1B         | (before setting       |          2B|
            |           |  PeerConnectionClosed |            |
            |           |  indicator in         |            |
            |           |  peer's RMBE)         |            |
            |           +-----------------------+            |
            |         1A        |         |      2A          |
            |     Active Abort  |         |  Passive Abort   |
            |                   V         V                  |
            |       +--------------+   +--------------+      |
            |-------|PeerAbortWait |   | Process Abort|------|
                    |              |   |              |
                    +--------------+   +--------------+

      Figure 23: SMC-R Abnormal Connection Termination State Diagram

   Figure 23 above shows the SMC-R abnormal connection termination state
   diagram:

   1. Active abort designates the SMC-R peer that is initiating the TCP
      RST processing.  At the time that the TCP RST is sent, the active
      abort side must also do the following:

      A. Send the PeerConnAbort indicator to the partner in a CDC
         message, and then transition to the PeerAbortWait state.
         During this state, it will monitor this SMC-R connection
         waiting for the peer to send its corresponding PeerConnAbort
         indicator but will ignore any other activity in this connection
         (i.e., new incoming data).  It will also generate an
         appropriate error to any socket API calls issued against this
         socket (e.g., ECONNABORTED, ECONNRESET).

      B. Once the peer sends the PeerConnAbort indicator to the local
         host, the local host can transition this SMC-R connection to
         the Closed state and reuse this RMBE.  Note that the SMC-R peer
         that goes into the active abort state must provide some
         protection against staying in that state indefinitely should
         the remote SMC-R peer not respond by sending its own
         PeerConnAbort indicator to the local host.  While this should
         be a rare scenario, it could occur if the remote SMC-R peer

Top      Up      ToC       Page 88 
         (passive abort) suffered a failure right after the local SMC-R
         peer (active abort) sent the PeerConnAbort indicator.  To
         protect against these types of failures, a timer can be set
         after entering the PeerAbortWait state, and if that timer pops
         before the peer has sent its local PeerConnAbort indicator (to
         the active abort side), this RMBE can be returned to the free
         pool for possible reallocation.  See Section 4.4.2 for more
         details.

   2. Passive abort designates the SMC-R peer that is the recipient of
      an SMC-R abort from the peer designated by the PeerConnAbort
      indicator being sent by the peer in a CDC message.  Upon receiving
      this request, the local peer must do the following:

      A. Using the appropriate error codes, indicate to the socket
         application that this connection has been aborted, and then
         purge all in-flight data for this connection that is waiting to
         be read or waiting to be sent.

      B. Send a CDC message to notify the peer of the PeerConnAbort
         indicator and, once that is completed, transition this RMBE to
         the Closed state.

   If an SMC-R peer receives a TCP RST for a given SMC-R connection, it
   also initiates SMC-R abnormal connection termination processing if it
   has not already been notified (via the PeerConnAbort indicator) that
   the partner is severing the connection.  It is possible to have two
   SMC-R endpoints concurrently be in an active abort role for a given
   connection.  In that scenario, the flows above still apply but both
   endpoints take the active abort path (path 1).

4.8.3.  Other SMC-R Connection Termination Conditions

   The following are additional conditions that have implications for
   SMC-R connection termination:

   o  An SMC-R peer being gracefully shut down.  If an SMC-R peer
      supports a graceful shutdown operation, it should attempt to
      terminate all SMC-R connections as part of shutdown processing.
      This could be accomplished via LLC DELETE LINK requests on all
      active SMC-R links.

   o  Abnormal termination of an SMC-R peer.  In this example, there may
      be no opportunity for the host to perform any SMC-R cleanup
      processing.  In this scenario, it is up to the remote peer to
      detect a RoCE communications failure with the failing host.  This

Top      Up      ToC       Page 89 
      could trigger SMC-R link switchover, but that would also generate
      RoCE errors, causing the remote host to eventually terminate all
      existing SMC-R connections to this peer.

   o  Loss of RoCE connectivity between two SMC-R peers.  If two peers
      are no longer reachable across any links in their SMC-R link
      group, then both peers perform a TCP reset for the connections,
      generate an error to the local applications, and free up all QP
      resources associated with the link group.

5.  Security Considerations

5.1.  VLAN Considerations

   The concepts and access control of virtual LANs (VLANs) must be
   extended to also cover the RoCE network traffic flowing across the
   Ethernet.

   The RoCE VLAN configuration and access permissions must mirror the IP
   VLAN configuration and access permissions over the Converged Enhanced
   Ethernet fabric.  This means that hosts, routers, and switches that
   have access to specific VLANs on the IP fabric must also have the
   same VLAN access across the RoCE fabric.  In other words, the SMC-R
   connectivity will follow the same virtual network access permissions
   as normal TCP/IP traffic.

5.2.  Firewall Considerations

   As mentioned above, the RoCE fabric inherits the same VLAN
   topology/access as the IP fabric.  RoCE is a Layer 2 protocol that
   requires both endpoints to reside in the same Layer 2 network (i.e.,
   VLAN).  RoCE traffic cannot traverse multiple VLANs, as there is no
   support for routing RoCE traffic beyond a single VLAN.  As a result,
   SMC-R communications will also be confined to peers that are members
   of the same VLAN.  IP-based firewalls are typically inserted between
   VLANs (or physical LANs) and rely on normal IP routing to insert
   themselves in the data path.  Since RoCE (and by extension SMC-R) is
   not routable beyond the local VLAN, there is no ability to insert a
   firewall in the network path of two SMC-R peers.

5.3.  Host-Based IP Filters

   Because SMC-R maintains the TCP three-way handshake for connection
   setup before switching to RoCE out of band, existing IP filters that
   control connection setup flows remain effective in an SMC-R
   environment.  IP filters that operate on traffic flowing in an active
   TCP connection are not supported, because the connection data does
   not flow over IP.

Top      Up      ToC       Page 90 
5.4.  Intrusion Detection Services

   Similar to IP filters, intrusion detection services that operate on
   TCP connection setups are compatible with SMC-R with no changes
   required.  However, once the TCP connection has switched to RoCE out
   of band, packets are not available for examination.

5.5.  IP Security (IPsec)

   IP security is not compatible with SMC-R, because there are no IP
   packets on which to operate.  TCP connections that require IP
   security must opt out of SMC-R.

5.6.  TLS/SSL

   Transport Layer Security/Secure Socket Layer (TLS/SSL) is preserved
   in an SMC-R environment.  The TLS/SSL layer resides above the SMC-R
   layer, and outgoing connection data is encrypted before being passed
   down to the SMC-R layer for RDMA write.  Similarly, incoming
   connection data goes through the SMC-R layer encrypted and is
   decrypted by the TLS/SSL layer as it is today.

   The TLS/SSL handshake messages flow over the TCP connection after the
   connection has switched to SMC-R, and so they are exchanged using
   RDMA writes by the SMC-R layer, transparently to the TLS/SSL layer.

6.  IANA Considerations

   The scarcity of TCP option codes available for assignment is
   understood, and this architecture uses experimental TCP options
   following the conventions of [RFC6994] ("Shared Use of Experimental
   TCP Options").

   TCP ExID 0xE2D4C3D9 has been registered with IANA as a TCP Experiment
   Identifier.  See Section 3.1.

   If this protocol achieves wide acceptance, a discrete option code may
   be requested by subsequent versions of this protocol.

Top      Up      ToC       Page 91 
7.  Normative References

   [RFC793]   Postel, J., "Transmission Control Protocol", STD 7,
              RFC 793, DOI 10.17487/RFC0793, September 1981,
              <http://www.rfc-editor.org/info/rfc793>.

   [RFC6994]  Touch, J., "Shared Use of Experimental TCP Options",
              RFC 6994, DOI 10.17487/RFC6994, August 2013,
              <http://www.rfc-editor.org/info/rfc6994>.

   [RoCE]     InfiniBand, "RDMA over Converged Ethernet specification",
              <https://cw.infinibandta.org/wg/Members/documentRevision/
              download/7149>.


Next RFC Part