tech-invite   World Map     

IETF     RFCs     Groups     SIP     ABNFs    |    3GPP     Specs     Gloss.     Arch.     IMS     UICC    |    Misc.    |    search     info

RFC 7609

 
 
 

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Part 2 of 6, p. 11 to 26
Prev RFC Part       Next RFC Part

 


prevText      Top      Up      ToC       Page 11 
2.  Link Architecture

   An SMC-R link is based on reliably connected queue pairs (QPs) that
   form a "logical point-to-point link" between the two SMC-R peers over
   a RoCE fabric.  An SMC-R link extends from SMC-R peer to SMC-R peer,
   where typically each peer would be a TCP/IP stack and would reside on
   separate hosts.

                            ,,.--..,_
     +----+             _-``         `-,           +-----+
     |QP 8|            -   RoCE         ',         |QP 64|
     |    |          /     VLAN M         .        |     |
     +----+--------+/                     \+-------+-----+
      | RNIC 1     |    SMC-R Link         | RNIC 2     |
      |            |<--------------------->|            |
      +------------+ ,                    /+------------+
              MAC A (GID A)             MAC B (GID B)
                       .                .`
                        `',          ,-`
                           ``''--''``

                       Figure 1: SMC-R Link Overview

Top      Up      ToC       Page 12 
   Figure 1 illustrates an overview of the basic concepts of SMC-R peer-
   to-peer connectivity; this is called the SMC-R link.  The SMC-R link
   forms a logical point-to-point connection between two SMC-R peers via
   RoCE.  The SMC-R link is defined and identified by the following
   attributes:

      SMC-R link = RC QPs
         (source VMAC GID QP + target VMAC GID QP + VLAN ID)

   The SMC-R link can optionally be associated with a VLAN ID.  If VLANs
   are in use for the associated IP (LAN) connection, then the VLAN
   attribute is carried over on the SMC-R link.  When VLANs are in use,
   each SMC-R link group is associated with a single and specific VLAN.
   The RoCE fabric is the same physical Ethernet LAN used for standard
   TCP/IP-over-Ethernet communications, with switches as described in
   Section 1.1.1.

   An SMC-R link is designed to support multiple TCP connections between
   the same two peers.  An SMC-R link is intended to be long lived,
   while the underlying TCP connections can dynamically come and go.
   The associated RMBs can also be dynamically added and removed from
   the link as needed.  The first TCP connection between the peers
   establishes the SMC-R link.  Subsequent TCP connections then use the
   previously established link.  When the last TCP connection
   terminates, the link can then be terminated, typically after an
   implementation-defined idle timeout period has elapsed.  The TCP
   server is responsible for initiating and terminating the SMC-R link.

2.1.  Remote Memory Buffers (RMBs)

   Figure 2 shows the hosts -- Hosts X and Y -- and their associated
   RMBs within each host.  With the SMC-R link, and the associated RKeys
   and RDMA virtual addresses, each SMC-R-enabled TCP/IP stack can
   remotely access its peer's RMBs using RDMA.  The RKeys and virtual
   addresses are exchanged during the rendezvous processing when the
   link is established.  The combination of the RKey and the virtual
   address is the RToken.  Note that the SMC-R link ends at the QP
   providing access to the RMB (via the link + RToken).

Top      Up      ToC       Page 13 
          Host X                                     Host Y
     +-------------------+        ,.--.,_       +-------------------+
     |                   |     .'`       '.     |                   |
     | Protection        |   ,'            `,   |    Protection     |
     | Domain X          |  /                \  |    Domain Y       |
     |            +------+ /                  \ +------+            |
     |       QP 8 |RNIC 1| |   SMC-R Link     | |RNIC 2|  QP 64     |
     |        |   |      |<-------------------->|      |   |        |
     |        |   |      ||                    ||      |   |        |
     |        |   +------+|    VLAN A          |+------+   |        |
     |        |          ||                    ||          |        |
     |        |          | |   RoCE           | |          |        |
     |        |RToken X  | \                  / |RToken Y  |        |
     |        |          |  \                /  |          |        |
     |        V          |   `.            ,'   |          V        |
     | +--------+        |     '._       ,'     |        +--------+ |
     | |        |        |        `''-'``       |        |        | |
     | | RMB    |        |                      |        | RMB    | |
     | |        |        |                      |        |        | |
     | +--------+        |                      |        +--------+ |
     +-------------------+                      +-------------------+

                       Figure 2: SMC-R Link and RMBs

   An SMC-R link can support multiple RMBs that are independently
   managed by each peer.  The number and the size of RMBs are managed by
   the peers based on the host's unique memory management requirements;
   however, the maximum number of RMBs that can be associated to a link
   group on one peer is 255.  The QP has a single protection domain, but
   each RMB has a unique RToken.  All RTokens must be exchanged with the
   peer.

   Each peer manages the RMBs in its local memory for its remote SMC-R
   peer by sharing access to the RMBs via RTokens with its peers.  The
   remote peer writes into the RMBs via RDMA, and the local peer (RMB
   owner) then reads from the RMBs.

   When two peers decide to use SMC-R for a given TCP connection, they
   each allocate a local RMB element for the TCP connection and
   communicate the location of this local RMB element during rendezvous
   processing.  To that end, RMB elements are created in pairs, with one
   RMB element allocated locally on each peer of the SMC-R link.

Top      Up      ToC       Page 14 
                  ---  +------------+---------------+
                  /\   |Eye Catcher |               |
                   |   +------------+               |
                   |   |                            |
         RMB Element 1 |                            |
                   |   |   Receive Buffer           |
                   |   |                            |
                   |   |                            |
                  \/   |                            |
                  ---  +------------+---------------+
                  /\   |Eye Catcher |               |
                   |   +------------+               |
                   |   |                            |
         RMB Element 2 |                            |
                   |   |   Receive Buffer           |
                   |   |                            |
                   |   |                            |
                  \/   |                            |
                  ---  +----------------------------+
                       |            .               |
                       |            .               |
                       |            .               |
                       |            .               |
                       |    (up to 255 elements)    |
                       +----------------------------+

                           Figure 3: RMB Format

   Figure 3 illustrates the basic format of an RMB.  The RMB is a
   virtual memory buffer whose backing real memory is pinned, which can
   support up to 255 TCP connections to exactly one remote SMC-R peer.
   Each RMB is therefore associated with the SMC-R links within a link
   group for the two peers and a specific RoCE Protection Domain.  Other
   than the two peers identified by the SMC-R link, no other SMC-R peers
   can have RDMA access to an RMB; this requires a unique Protection
   Domain for every SMC-R link.  This is critical to ensure integrity of
   SMC-R communications.

   RMBs are subdivided into multiple elements for efficiency, with each
   RMB Element (RMBE) associated with a single TCP connection.
   Therefore, multiple TCP connections across an SMC-R link group can
   share the same memory for RDMA purposes, reducing the overhead of
   having to register additional memory with the RNIC for every new TCP
   connection.  The number of elements in an RMB and the size of each
   RMBE are entirely governed by the owning peer, subject to the SMC-R
   architecture rules; however, all RMB elements within a given RMB must
   be the same size.  Each peer can decide the level of resource-sharing
   that is desirable across TCP connections based on local constraints,

Top      Up      ToC       Page 15 
   such as available system memory.  An RMB element is identified to the
   remote SMC-R peer via an RMB Element Token, which consists of the
   following:

   o  RMB RToken: The combination of the RKey and virtual address
      provided by the RNIC that identifies the start of the RMB for RDMA
      operations.

   o  RMB Index: Identifies the RMB element index in the RMB.  Used to
      locate a specific RMB element within an RMB.  Valid value range is
      1-255.

   o  RMB Element Length: The length of the RMB element's eye catcher
      plus the length of the receive buffer.  This length is equal for
      all RMB elements in a given RMB.  This length can be variable
      across different RMBs.

   Multiple RMBs can be associated to an SMC-R link group, and each peer
   in an SMC-R link group manages allocation of its RMBs.  RMB
   allocation can be asymmetric.  For example, Server X can allocate two
   RMBs to an SMC-R link group while Server Y allocates five.  This
   provides maximum implementation flexibility to allow hosts to
   optimize RMB management for their own local requirements.  The
   maximum number of RMBs that can be allocated on one peer to a link
   group is 255.  If more RMBs are required, the peer may fall back to
   IP for subsequent connections or, if the peer is the server, create a
   parallel link group.

   One use case for multiple RMBs is multiple receive buffer sizes.
   Since every element in an RMB must be the same size, multiple RMBs
   with different element sizes can be allocated if varying receive
   buffer sizes are required.

   Also, since the maximum number of TCP connections whose receive
   buffers can be allocated to an RMB is 255, multiple RMBs may be
   required to provide capacity for large numbers of TCP connections
   between two peers.

Top      Up      ToC       Page 16 
   Separately from the RMB, the TCP/IP stack that owns each RMB
   maintains control data for each RMB element within its local control
   structures.  The control data contains flags for maintaining the
   state of the TCP data (for example, urgent data indicator) and, most
   importantly, the following two cursors, which are illustrated below
   in Figure 4:

   o  The peer producer cursor: This is a wrapping offset into the
      RMB element's receive buffer that points to the next byte of data
      to be written by the remote peer.  This cursor is provided by the
      remote peer in a Connection Data Control (CDC) message, which is
      sent using RoCE SendMsg processing, and tells the local peer how
      far it can consume data in the RMBE buffer.

   o  The peer consumer cursor: This is a wrapping offset into the
      remote peer's RMB element's receive buffer that points to the next
      byte of data to be consumed by the remote peer in its own RMBE.
      The local peer cannot write into the remote peer's RMBE beyond
      this point without causing data loss.  This cursor is also
      provided by the peer using a Connection Data Control message.

   Each TCP connection peer maintains its cursors for a TCP connection's
   RMBE in its local control structures.  In other words, the peer who
   writes into a remote peer's RMBE provides its producer cursor to the
   peer whose RMBE it has written into.  The peer who reads from its
   RMBE provides its consumer cursor to the writing peer.  In this
   manner, the reads and writes between peers are kept coordinated.

   For example, referring to Figure 4, Peer B writes the hashed data
   into the receive buffer of Peer A's RMBE.  After that write
   completes, Peer B uses a CDC message to update its producer cursor to
   Peer A, to indicate to Peer A how much data is available for Peer A
   to consume.  The CDC message that Peer B sends to Peer A wakes up
   Peer A and notifies it that there is data to be consumed.

   Similarly, when Peer A consumes data written by Peer B, it uses a CDC
   message to update its consumer cursor to Peer B to let Peer B know
   how much data it has consumed, so Peer B knows how much space is
   available for further writes.  If Peer B were to write enough data to
   Peer A that it would wrap the RMBE receive buffer and exceed the
   consumer cursor, data loss would result.

   Note that this is a simplistic description of the control flows, and
   they are optimized to minimize the number of CDC messages required,
   as described in Section 4.7 ("RMB Data Flows").

Top      Up      ToC       Page 17 
      Peer A's RMBE Control Info            Peer B's RMBE Control Info
     +--------------------------+          +--------------------------+
     |                          |          |                          |
      /----Peer producer cursor |    +-----+-Peer consumer cursor     |
    /|                          |    |     |                          |
   | +--------------------------+    |     +--------------------------+
   |  Peer A's RMBE                  |
   | +--------------------------+    |
   | |            +------------------+
   | |            |             |
   | |            \/            |
   | |             +------------|
   | |-------------+/////////// |
   | |//RDMA data written by ///|
   | |/// Peer B that is ////// |
   | |/available to be consumed/|
   | |///////////////////////// |
   | |///////// +---------------|
   | |----------+/\             |
   | |            |             |
    \|            |             |
     \           /              |
     |\---------/               |
     |                          |
     |                          |

                          Figure 4: RMBE Cursors

   Additional flags and indicators are communicated between peers.  In
   all cases, these flags and indicators are updated by the peer using
   CDC messages, which are sent using RoCE SendMsg.  More details on
   these additional flags and indicators are described in Section 4.3
   ("RMBE Control Information").

Top      Up      ToC       Page 18 
2.2.  SMC-R Link Groups

   SMC-R links are logically grouped together to form an SMC-R link
   group.  The purpose of the link group is for supporting multiple
   links between the same two peers to provide for:

   o  Resilience: Provides transparent and dynamic switching of the link
      used by existing TCP connections during link failures, typically
      hardware related.  TCP traffic using the failing link can be
      switched to an active link within the link group, thereby avoiding
      disruptions to application workloads.

   o  Link utilization: Provides an active/active link usage model
      allowing TCP traffic to be balanced across the links, which
      increases bandwidth and also avoids hardware imbalances and
      bottlenecks.  Note that both adapter and switch utilization can
      become potential resource constraint issues.

   SMC-R link group support is required.  Resilience is not optional.
   However, the user can elect to provision a single RNIC (on one or
   both hosts).

   Multiple links that are formed between the same two peers fall into
   two distinct categories:

   1. Equal Links: Links providing equal access to the same RMB(s) at
      both endpoints, whereby all TCP connections associated with the
      links must have the same VLAN ID and have the same TCP server and
      TCP client roles or relationship.

   2. Unequal Links: Links providing access to unique, unrelated and
      isolated RMB(s) (i.e., for unique VLANs or unique and isolated
      application workloads, etc.) or having unique TCP server or client
      roles.

   Links that are logically grouped together forming an SMC-R link group
   must be equal links.

2.2.1.  Link Group Types

   Equal links within a link group also have another "Link Group Type"
   attribute based on the link's associated underlying physical path.
   The following SMC-R link types are defined:

   1. Single link: the only active link within a link group

   2. Parallel link: not allowed -- SMC-R links having the same physical
      RNIC at both hosts

Top      Up      ToC       Page 19 
   3. Asymmetric link: links that have unique RNIC adapters at one host
      but share a single adapter at the peer host

   4. Symmetric link: links that have unique RNIC adapters at both hosts

   These link group types are further explained in the following figures
   and descriptions.

   Figure 2 above shows the single-link case.  The single link
   illustrated in Figure 2 also establishes the SMC-R link group.  Link
   groups are supposed to have multiple links, but when only one RNIC is
   available at both hosts then only a single link can be created.  This
   is expected to be a transient case.

   Figure 5 shows the symmetric-link case.  Both hosts have unique and
   redundant RNIC adapters.  This configuration meets the objectives for
   providing full RoCE redundancy required to provide the level of
   resilience required for high availability for SMC-R.  While this
   configuration is not required, it is a strongly recommended "best
   practice" for the exploitation of SMC-R.  Single and asymmetric links
   must be supported but are intended to provide for short-term
   transient conditions -- for example, during a temporary outage or
   recycle of an RNIC.

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
     |RToken X|   |      |<-------------------->|      |   |        |
     |        |   |      |                      |      |   |RToken Y|
     |       \/   +------+                      +------+  \/        |
     |+--------+         |                      |        +--------+ |
     ||        |         |                      |        |        | |
     || RMB    |         |                      |        | RMB    | |
     ||        |         |                      |        |        | |
     |+--------+         |                      |        +--------+ |
     |       /\   +------+                      +------+  /\        |
     |RToken Z|   |      |     SMC-R Link 2     |      |   |RToken W|
     |        |   |RNIC 3|<-------------------->|RNIC 4|   |        |
     |       QP 9 |      |                      |      |  QP 65     |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

                      Figure 5: Symmetric SMC-R Links

Top      Up      ToC       Page 20 
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
     |RToken X|   |      |<-------------------->|      |   |        |
     |        |   |      |                   .->|      |   |RToken Y|
     |       \/   +------+                 .`   +------+  \/        |
     |+--------+         |               .`     |        +--------+ |
     ||        |         |             .`       |        |        | |
     || RMB    |         |           .`         |        | RMB    | |
     ||        |         |         .`SMC-R      |        |        | |
     |+--------+         |       .` Link 2      |        +--------+ |
     |       /\   +------+     .`               +------+            |
     |RToken Z|   |      |   .`                 |      |down or     |
     |        |   |RNIC 3|<-`                   |RNIC 4|unavailable |
     |       QP 9 |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

                     Figure 6: Asymmetric SMC-R Links

   In the example provided by Figure 6, Host X has two RNICs but Host Y
   only has one RNIC because RNIC 4 is not available.  This
   configuration allows for the creation of an asymmetric link.  While
   an asymmetric link will provide some resilience (for example, when
   RNIC 1 fails), ideally each host should provide two redundant RNICs.
   This should be a transient case, and when RNIC 4 becomes available,
   this configuration must transition to a symmetric-link configuration.
   This transition is accomplished by first creating the new symmetric
   link and then deleting the asymmetric link with reason code
   "Asymmetric link no longer needed" specified in the DELETE LINK LLC
   message.

Top      Up      ToC       Page 21 
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+  SMC-R Link 1        +------+            |
     |       QP 8 |RNIC 1|<-------------------->|RNIC 2|  QP 64     |
     |RToken X|   |      |                      |      |   |        |
     |        |   |      |<-------------------->|      |   |RToken Y|
     |       \/   +------+  SMC-R Link 2        +------+  \/        |
     |+--------+   QP 9  |                      | QP 65  +--------+ |
     ||        |    |    |                      |  |     |        | |
     || RMB    |<-- +    |                      |  +---->| RMB    | |
     ||        |         |                      |        |        | |
     |+--------+         |                      |        +--------+ |
     |            +------+                      +------+            |
     |     down or|      |                      |      |down or     |
     | unavailable|RNIC 3|                      |RNIC 4|unavailable |
     |            |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+

              Figure 7: SMC-R Parallel Links (Not Supported)

   Figure 7 shows parallel links, which are two links in the link group
   that use the same hardware.  This configuration is not permitted.
   Because SMC-R multiplexes multiple TCP connections over an SMC-R link
   and both links are using the exact same hardware, there is no
   additional redundancy or capacity benefit obtained from this
   configuration.  In addition to providing no real benefit, this
   configuration adds the unnecessary overhead of additional queue
   pairs, generation of additional RKeys, etc.

2.2.2.  Maximum Number of Links in Link Group

   The SMC-R protocol defines a maximum of eight symmetric SMC-R links
   within a single SMC-R link group.  This allows for support for up to
   eight unique physical paths between peer hosts.  However, in terms of
   meeting the basic requirements for redundancy, support for at least
   two symmetric links must be implemented.  Supporting more than two
   links also simplifies implementation for practical matters relating
   to dynamically adding and removing links -- for example, starting a
   third SMC-R link prior to taking down one of the two existing links.
   Recall that all links within a link group must have equal access to
   all associated RMBs.

Top      Up      ToC       Page 22 
   The SMC-R protocol allows an implementation to assign an
   implementation-specific and appropriate value for maximum symmetric
   links.  The implementation value must not exceed the architecture
   limit of 8; also, the value must not be lower than 2, because the
   SMC-R protocol requires redundancy.  This does not mean that two
   RNICs are physically required to enable SMC-R connectivity, but at
   least two RNICs for redundancy are strongly recommended.

   The SMC-R peers exchange their implementation maximum link values
   during the link group establishment using the defined maximum link
   value in the CONFIRM LINK LLC command.  Once the initial exchange
   completes, the value is set for the life of the link group.  The
   maximum link value can be provided by both the server and client.
   The server must supply a value, whereas the client maximum link value
   is optional.  When the client does not supply a value, it indicates
   that the client accepts the server-supplied maximum value.  If the
   client provides a value, it cannot exceed the server-supplied maximum
   value.  If the client passes a lower value, this lower value then
   becomes the final negotiated maximum number of symmetric links for
   this link group.  Again, the minimum value is 2.

   During run time, the client must never request that the server add a
   symmetric link to a link group that would exceed the negotiated
   maximum link value.  Likewise, the server must never attempt to add a
   symmetric link to a link group that would exceed the negotiated
   maximum value.

   In terms of counting the number of active links within a link group,
   the initial link (or the only/last) link is always counted as 1.
   Then, as additional links are added, they are either symmetric or
   asymmetric links.

   With regards to enforcing the maximum link rules, asymmetric links
   are an exception having a unique set of rules:

   o  Asymmetric links are always limited to one asymmetric link allowed
      per link group.

   o  Asymmetric links must not be counted in the maximum symmetric-link
      count calculation.  When tracking the current count or enforcing
      the negotiated maximum number of links, an asymmetric link is not
      to be counted.

Top      Up      ToC       Page 23 
2.2.3.  Forming and Managing Link Groups

   SMC-R link groups are self-defining.  The first SMC-R link in a link
   group is created using TCP option flows on the TCP three-way
   handshake followed by CLC message flows over the TCP connection.
   Subsequent SMC-R links in the link group are created by sending LLC
   messages over an SMC-R link that already exists in the link group.
   Once an SMC-R link group is created, no additional SMC-R links in
   that group are created using TCP and CLC negotiation.  Because
   subsequent SMC-R links are created exclusively by sending LLC
   messages over an existing SMC-R link in a link group, the membership
   of SMC-R links in a link group is self-defining.

   This architecture does not define a specific identifier for an SMC-R
   link group.  This identification may be useful for network management
   and may be assigned in a platform-specific manner, or in an extension
   to this architecture.

   In each SMC-R link group, one peer is the server for all TCP
   connections and the other peer is the client.  If there are
   additional TCP connections between the peers that use SMC-R and have
   the client and server roles reversed, another SMC-R link group is set
   up between them with the opposite client-server relationship.

   This is required because there are specific responsibilities divided
   between the client and server in the management of an SMC-R link
   group.

   In this architecture, the decision of whether to use an existing
   SMC-R link group or create a new SMC-R link group for a TCP
   connection is made exclusively by the server.

   Management of the links in an SMC-R link group is also a server
   responsibility.  The server is responsible for adding and deleting
   links in a link group.  The client may request that the server take
   certain actions, but the final responsibility is the server's.

Top      Up      ToC       Page 24 
2.2.4.  SMC-R Link Identifiers

   This architecture defines multiple identifiers to identify SMC-R
   links and peers.

   o  Link number: This is a 1-byte value that identifies an SMC-R link
      within a link group.  Both the server and the client use this
      number to distinguish an SMC-R link from other links within the
      same link group.  It is only unique within a link group.  In order
      to prevent timing windows that may occur when a server creates a
      new link while the client is still cleaning up a previously
      existing link, link numbers cannot be reused until the entire link
      numbering space has been exhausted.

   o  Link user ID: This is an architecturally opaque 4-byte value that
      a peer uses to uniquely define an SMC-R link within its own space.
      This means that a link user ID is unique within one peer only.
      Each peer defines its own link user ID for a link.  The peers
      exchange this information once during link setup, and it is never
      used architecturally again.  The purpose of this identifier is for
      network management, display, and debugging.  For example, an
      operator on a client could provide the operator on the server with
      the server's link user ID if he requires the server's operator to
      check on the operation of a link that the client is having trouble
      with.

   o  Peer ID: The SMC-R peer ID uniquely identifies a specific instance
      of a specific TCP/IP stack.  It is required because in clustered
      and load-balancing environments, an IP address does not uniquely
      identify a TCP/IP stack.  An RNIC's MAC/GID also doesn't uniquely
      or reliably identify a TCP/IP stack, because RNICs can go up and
      down and even be redeployed to other TCP/IP stacks in a
      multiple-partitioned or virtualized environment.  The peer ID is
      not only unique per TCP/IP stack but is also unique per instance
      of a TCP/IP stack, meaning that if a TCP/IP stack is restarted,
      its peer ID changes.

2.3.  SMC-R Resilience and Load Balancing

   The SMC-R multilink architecture provides resilience for network high
   availability via failover capability to an alternate RoCE adapter.

   The SMC-R multilink architecture does not define primary, secondary,
   or alternate roles to the links.  Instead, there are multiple active
   links representing multiple redundant RoCE paths over the same LAN.

Top      Up      ToC       Page 25 
   Assignment of TCP connections to links is unidirectional and
   asymmetric.  This means that the client and server may each choose a
   separate link for their RDMA writes associated with a specific TCP
   connection.

   If a hardware failure occurs or a QP failure associated with an
   individual link occurs, then the TCP connections that were associated
   with the failing link are dynamically and transparently switched to
   use another available link.  The server or the client can detect a
   failure, immediately move their TCP connections, and then notify
   their peer via the DELETE LINK LLC command.  While the client can
   notify the server of an apparent link failure with the DELETE LINK
   LLC command, the server performs the actual link deletion.

   The movement of TCP connections to another link can be accomplished
   with minimal coordination between the peers.  The TCP connection
   movement is also transparent to, and non-disruptive to, the TCP
   socket application workloads for most failure scenarios.  After a
   failure, the surviving links and all associated hardware must handle
   the link group's workload.

   As each SMC-R peer begins to move active TCP connections to another
   link, all current RDMA write operations must be allowed to complete.
   The moving peer then sends a signal to verify receipt of the last
   successful write by its peer.  If this verification fails, the TCP
   connection must be reset.  Once this verification is complete, all
   writes that failed may then be retried, in order, over the new link.
   Any data writes or CDC messages for which the sender did not receive
   write completion must be replayed before any subsequent data or CDC
   write operations are sent.  LLC messages are not retried over the new
   link, because they are dependent on a known link configuration, which
   has just changed because of the failure.  The initiator of an LLC
   message exchange that fails will be responsible for retrying once the
   link group configuration stabilizes.

   When a new link becomes available and is re-added to the link group,
   each peer is free to rebalance its current TCP connections as needed
   or only assign new TCP connections to the newly added link.  Both the
   server and client are free to manage TCP connections across the link
   group as needed.  TCP connection movement does not have to be
   stimulated by a link failure.

   The SMC-R architecture also defines orderly versus disorderly
   failover.  The type of failover is communicated in the LLC
   DELETE LINK command and is simply a means to indicate that the link
   has terminated (disorderly) or link termination is imminent
   (orderly).  The orderly link deletion could be initiated via operator
   command or programmatically to bring down an idle link.  For example,

Top      Up      ToC       Page 26 
   an operator command could initiate orderly shutdown of an adapter for
   service.  Implementation of the two types is based on implementation
   requirements and is beyond the scope of the SMC-R architecture.



(page 26 continued on part 3)

Next RFC Part