tech-invite   World Map     

IETF     RFCs     Groups     SIP     ABNFs    |    3GPP     Specs     Glossaries     Architecture     IMS     UICC    |    search     info

RFC 7609

Informational
Pages: 143
Top     in Index     Prev     Next
in Group Index     Prev in Group     Next in Group     Group: ~zz

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Part 1 of 6, p. 1 to 11
None       Next RFC Part

 


Top       ToC       Page 1 
Independent Submission                                            M. Fox
Request for Comments: 7609                                   C. Kassimis
Category: Informational                                       J. Stevens
ISSN: 2070-1721                                                      IBM
                                                             August 2015


     IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Abstract

   This document describes IBM's Shared Memory Communications over RDMA
   (SMC-R) protocol.  This protocol provides Remote Direct Memory Access
   (RDMA) communications to TCP endpoints in a manner that is
   transparent to socket applications.  It further provides for dynamic
   discovery of partner RDMA capabilities and dynamic setup of RDMA
   connections, as well as transparent high availability and load
   balancing when redundant RDMA network paths are available.  It
   maintains many of the traditional TCP/IP qualities of service such as
   filtering that enterprise users demand, as well as TCP socket
   semantics such as urgent data.

Status of This Memo

   This document is not an Internet Standards Track specification; it is
   published for informational purposes.

   This is a contribution to the RFC Series, independently of any other
   RFC stream.  The RFC Editor has chosen to publish this document at
   its discretion and makes no statement about its value for
   implementation or deployment.  Documents approved for publication by
   the RFC Editor are not a candidate for any level of Internet
   Standard; see Section 2 of RFC 5741.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   http://www.rfc-editor.org/info/rfc7609.

Page 2 
Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.

Table of Contents

   1. Introduction ....................................................5
      1.1. Protocol Overview ..........................................6
           1.1.1. Hardware Requirements ...............................8
      1.2. Definition of Common Terms .................................8
      1.3. Conventions Used in This Document .........................11
   2. Link Architecture ..............................................11
      2.1. Remote Memory Buffers (RMBs) ..............................12
      2.2. SMC-R Link Groups .........................................18
           2.2.1. Link Group Types ...................................18
           2.2.2. Maximum Number of Links in Link Group ..............21
           2.2.3. Forming and Managing Link Groups ...................23
           2.2.4. SMC-R Link Identifiers .............................24
      2.3. SMC-R Resilience and Load Balancing .......................24
   3. SMC-R Rendezvous Architecture ..................................26
      3.1. TCP Options ...............................................26
      3.2. Connection Layer Control (CLC) Messages ...................27
      3.3. LLC Messages ..............................................27
      3.4. CDC Messages ..............................................29
      3.5. Rendezvous Flows ..........................................29
           3.5.1. First Contact ......................................29
                  3.5.1.1. Pre-negotiation of TCP Options ............29
                  3.5.1.2. Client Proposal ...........................30
                  3.5.1.3. Server Acceptance .........................32
                  3.5.1.4. Client Confirmation .......................32
                  3.5.1.5. Link (QP) Confirmation ....................32
                  3.5.1.6. Second SMC-R Link Setup ...................35
                           3.5.1.6.1. Client Processing of ADD LINK
                                      LLC Message from Server ........35
                           3.5.1.6.2. Server Processing of ADD LINK
                                      Reply LLC Message from Client ..36
                           3.5.1.6.3. Exchange of RKeys on
                                      Second SMC-R Link ..............38
                           3.5.1.6.4. Aborting SMC-R and
                                      Falling Back to IP .............38

Top      ToC       Page 3 
           3.5.2. Subsequent Contact .................................38
                  3.5.2.1. SMC-R Proposal ............................39
                  3.5.2.2. SMC-R Acceptance ..........................40
                  3.5.2.3. SMC-R Confirmation ........................41
                  3.5.2.4. TCP Data Flow Race with SMC
                           Confirm CLC Message .......................41
           3.5.3. First Contact Variation: Creating a
                  Parallel Link Group ................................42
           3.5.4. Normal SMC-R Link Termination ......................43
           3.5.5. Link Group Management Flows ........................44
                  3.5.5.1. Adding and Deleting Links in an
                           SMC-R Link Group ..........................44
                           3.5.5.1.1. Server-Initiated ADD
                                      LINK Processing ................45
                           3.5.5.1.2. Client-Initiated ADD
                                      LINK Processing ................45
                           3.5.5.1.3. Server-Initiated DELETE
                                      LINK Processing ................46
                           3.5.5.1.4. Client-Initiated DELETE
                                      LINK Request ...................48
                  3.5.5.2. Managing Multiple RKeys over
                           Multiple SMC-R Links in a Link Group ......49
                           3.5.5.2.1. Adding a New RMB to an
                                      SMC-R Link Group ...............50
                           3.5.5.2.2. Deleting an RMB from an
                                      SMC-R Link Group ...............53
                           3.5.5.2.3. Adding a New SMC-R Link to a
                                      Link Group with Multiple RMBs ..54
                  3.5.5.3. Serialization of LLC Exchanges,
                           and Collisions ............................56
                           3.5.5.3.1. Collisions with ADD
                                      LINK / CONFIRM LINK Exchange ...57
                           3.5.5.3.2. Collisions during
                                      DELETE LINK Exchange ...........58
                           3.5.5.3.3. Collisions during
                                      CONFIRM RKEY Exchange ..........59
   4. SMC-R Memory-Sharing Architecture ..............................60
      4.1. RMB Element Allocation Considerations .....................60
      4.2. RMB and RMBE Format .......................................60
      4.3. RMBE Control Information ..................................60
      4.4. Use of RMBEs ..............................................61
           4.4.1. Initializing and Accessing RMBEs ...................61
           4.4.2. RMB Element Reuse and Conflict Resolution ..........62
      4.5. SMC-R Protocol Considerations .............................63
           4.5.1. SMC-R Protocol Optimized Window Size Updates .......63
           4.5.2. Small Data Sends ...................................64
           4.5.3. TCP Keepalive Processing ...........................65

Top      ToC       Page 4 
      4.6. TCP Connection Failover between SMC-R Links ...............67
           4.6.1. Validating Data Integrity ..........................67
           4.6.2. Resuming the TCP Connection on a New SMC-R Link ....68
      4.7. RMB Data Flows ............................................69
           4.7.1. Scenario 1: Send Flow, Window Size Unconstrained ...69
           4.7.2. Scenario 2: Send/Receive Flow, Window Size
                  Unconstrained ......................................71
           4.7.3. Scenario 3: Send Flow, Window Size Constrained .....72
           4.7.4. Scenario 4: Large Send, Flow Control, Full
                  Window Size Writes .................................74
           4.7.5. Scenario 5: Send Flow, Urgent Data, Window
                  Size Unconstrained .................................77
           4.7.6. Scenario 6: Send Flow, Urgent Data, Window
                  Size Closed ........................................79
      4.8. Connection Termination ....................................81
           4.8.1. Normal SMC-R Connection Termination Flows ..........81
           4.8.2. Abnormal SMC-R Connection Termination Flows ........86
           4.8.3. Other SMC-R Connection Termination Conditions ......88
   5. Security Considerations ........................................89
      5.1. VLAN Considerations .......................................89
      5.2. Firewall Considerations ...................................89
      5.3. Host-Based IP Filters .....................................89
      5.4. Intrusion Detection Services ..............................90
      5.5. IP Security (IPsec) .......................................90
      5.6. TLS/SSL ...................................................90
   6. IANA Considerations ............................................90
   7. Normative References ...........................................91
   Appendix A. Formats ...............................................92
     A.1. TCP Option .................................................92
     A.2. CLC Messages ...............................................92
          A.2.1. Peer ID Format ......................................93
          A.2.2. SMC Proposal CLC Message Format .....................94
          A.2.3. SMC Accept CLC Message Format .......................98
          A.2.4. SMC Confirm CLC Message Format .....................102
          A.2.5. SMC Decline CLC Message Format .....................105
     A.3. LLC Messages ..............................................106
          A.3.1. CONFIRM LINK LLC Message Format ....................107
          A.3.2. ADD LINK LLC Message Format ........................109
          A.3.3. ADD LINK CONTINUATION LLC Message Format ...........112
          A.3.4. DELETE LINK LLC Message Format .....................115
          A.3.5. CONFIRM RKEY LLC Message Format ....................117
          A.3.6. CONFIRM RKEY CONTINUATION LLC Message Format .......120
          A.3.7. DELETE RKEY LLC Message Format .....................122
          A.3.8. TEST LINK LLC Message Format .......................124
     A.4. Connection Data Control (CDC) Message Format ..............125

Top      ToC       Page 5 
   Appendix B. Socket API Considerations ............................129
     B.1. setsockopt() / getsockopt() Considerations ................130
   Appendix C. Rendezvous Error Scenarios ...........................131
     C.1. SMC Decline during CLC Negotiation ........................131
     C.2. SMC Decline during LLC Negotiation ........................131
     C.3. The SMC Decline Window ....................................133
     C.4. Out-of-Sync Conditions during SMC-R Negotiation ...........133
     C.5. Timeouts during CLC Negotiation ...........................134
     C.6. Protocol Errors during CLC Negotiation ....................134
     C.7. Timeouts during LLC Negotiation ...........................135
          C.7.1. Recovery Actions for LLC Timeouts and Failures .....136
     C.8. Failure to Add Second SMC-R Link to a Link Group ..........142
   Authors' Addresses ...............................................143

1.  Introduction

   This document specifies IBM's Shared Memory Communications over RDMA
   (SMC-R) protocol.  SMC-R is a protocol for Remote Direct Memory
   Access (RDMA) communication between TCP socket endpoints.  SMC-R runs
   over networks that support RDMA over Converged Ethernet (RoCE).  It
   is designed to permit existing TCP applications to benefit from RDMA
   without requiring modifications to the applications or predefinition
   of RDMA partners.

   SMC-R provides dynamic discovery of the RDMA capabilities of TCP
   peers and automatic setup of RDMA connections that those peers can
   use.  SMC-R also provides transparent high availability and
   load-balancing capabilities that are demanded by enterprise
   installations but are missing from current RDMA protocols.  If
   redundant RoCE-capable hardware such as RDMA-capable Network
   Interface Cards (RNICs) and RoCE-capable switches is present, SMC-R
   can load-balance over that redundant hardware and can also
   non-disruptively move TCP traffic from failed paths to surviving
   paths, all seamlessly to the application and the sockets layer.
   Because SMC-R preserves socket semantics and the TCP three-way
   handshake, many TCP qualities of service such as filtering, load
   balancing, and Secure Socket Layer (SSL) encryption are preserved, as
   are TCP features such as urgent data.

   Because of the dynamic discovery and setup of SMC-R connectivity
   between peers, no RDMA connection manager (RDMA-CM) is required.
   This also means that support for Unreliable Datagram (UD) Queue Pairs
   (QPs) is also not required.

Top      ToC       Page 6 
   It is recommended that the SMC-R services be implemented in kernel
   space, which enables optimizations such as resource-sharing between
   connections across multiple processes and also permits applications
   using SMC-R to spawn multiple processes (e.g., fork) without losing
   SMC-R functionality.  A user-space implementation is compatible with
   this architecture, but it may not support spawned processes (e.g.,
   fork), which limits sharing and resource optimization to TCP
   connections that originate from the same process.  This might be an
   appropriate design choice if the use case is a system that hosts a
   large single process application that creates many TCP connections to
   a peer host, or in implementations where a kernel-space
   implementation is not possible or introduces excessive overhead for
   "kernel space to user space" context switches.

1.1.  Protocol Overview

   SMC-R defines the concept of the SMC-R link, which is a logical
   point-to-point link using reliably connected queue pairs between
   TCP/IP stack peers over a RoCE fabric.  An SMC-R link is bound to a
   specific hardware path, meaning a specific RNIC on each peer.  SMC-R
   links are created and maintained by an SMC-R layer, which may reside
   in kernel space or user space, depending upon operating system and
   implementation requirements.  The SMC-R layer resides below the
   sockets layer and directs data traffic for TCP connections between
   connected peers over the RoCE fabric using RDMA rather than over a
   TCP connection.  The TCP/IP stack, with its requirements for
   fragmentation, packetization, etc., is bypassed, and the application
   data is moved between peers using RDMA.

   Multiple SMC-R links between the same two TCP/IP stack peers are also
   supported.  A set of SMC-R links called a link group can be logically
   bonded together to provide redundant connectivity.  If there is
   redundant hardware -- for example, two RNICs on each peer -- separate
   SMC-R links are created between the peers to exploit that redundant
   hardware.  The link group architecture with redundant links provides
   load balancing and increased bandwidth, as well as seamless failover.

   Each SMC-R link group is associated with an area of memory called
   Remote Memory Buffers (RMBs), which are areas of memory that are
   available for SMC-R peers to write into using RDMA writes.  Multiple
   TCP connections between peers may be multiplexed over a single SMC-R
   link, in which case the SMC-R layer manages the partitioning of the
   RMBs between the TCP connections.  This multiplexing reduces the RDMA
   resources, such as QPs and RMBs, that are required to support
   multiple connections between peers, and it also reduces the
   processing and delays related to setting up QPs, pinning memory, and
   other RDMA setup tasks when new TCP connections are created.  In a
   kernel-space SMC-R implementation in which the RMBs reside in kernel

Top      ToC       Page 7 
   storage, this sharing and optimization works across multiple
   processes executing on the same host.  In a user-space SMC-R
   implementation in which the RMBs reside in user space, this sharing
   and optimization is limited to multiple TCP connections created by a
   single process, as separate RMBs and QPs will be required for each
   process.

   SMC-R also introduces a rendezvous protocol that is used to
   dynamically discover the RDMA capabilities of TCP connection partners
   and exchange credentials necessary to exploit that capability if
   present.  TCP connections are set up using the normal TCP three-way
   handshake [RFC793], with the addition of a new TCP option that
   indicates SMC-R capability.  If both partners indicate SMC-R
   capability, then at the completion of the three-way TCP handshake the
   SMC-R layers in each peer take control of the TCP connection and use
   it to exchange additional Connection Layer Control (CLC) messages to
   negotiate SMC-R credentials such as QP information; addressability
   over the RoCE fabric; RMB buffer sizes; and keys and addresses for
   accessing RMBs over RDMA.  If at any time during this negotiation a
   failure or decline occurs, the TCP connection falls back to using the
   IP fabric.

   If the SMC-R negotiation succeeds and either a new SMC-R link is set
   up or an existing SMC-R link is chosen for the TCP connection, then
   the SMC-R layers open the sockets to the applications and the
   applications use the sockets as normal.  The SMC-R layer intercepts
   the socket reads and writes and moves the TCP connection data over
   the SMC-R link, "out of band" to the TCP connection, which remains
   open and idle over the IP fabric, except for termination flows and
   possible keepalive flows.  Regular TCP sequence numbering methods are
   used for the TCP flows that do occur; data flowing over RDMA does not
   use or affect TCP sequence numbers.

   This architecture does not support fallback of active SMC-R
   connections to IP.  Once connection data has completed the switch to
   RDMA, a TCP connection cannot be switched back to IP and will reset
   if RDMA becomes unusable.

   The SMC-R protocol defines the format of the RMBs that are used to
   receive TCP connection data written over RDMA, as well as the
   semantics for managing and writing to these buffers using Connection
   Data Control (CDC) messages.

Top      ToC       Page 8 
   Finally, SMC-R defines Link Layer Control (LLC) messages that are
   exchanged over the RoCE fabric between peer SMC-R layers to manage
   the SMC-R links and link groups.  These include messages to test and
   confirm connectivity over an SMC-R link, add and delete SMC-R links
   to or from the link group, and exchange RMB addressability
   information.

1.1.1.  Hardware Requirements

   SMC-R does not require full Converged Enhanced Ethernet switch
   functionality.  SMC-R functions over standard Ethernet fabrics,
   provided that endpoint RNICs are provided and IEEE 802.3x Global
   Pause Frame is supported and enabled in the switch fabric.

   While SMC-R as specified in this document is designed to operate over
   RoCE fabrics, adjustments to the rendezvous methods could enable it
   to run over other RDMA fabrics, such as InfiniBand [RoCE] and iWARP.

1.2.  Definition of Common Terms

   This section provides definitions of terms that have a specific
   meaning to the SMC-R protocol and are used throughout this document.

   SMC-R Link

      An SMC-R link is a logical point-to-point connection over the RoCE
      fabric via specific physical adapters (Media Access Control /
      Global Identifier (MAC/GID)).  The link is formed during the
      "first contact" sequence of the TCP/IP three-way handshake
      sequence that occurs over the IP fabric.  During this handshake,
      an RDMA reliably connected queue pair (RC-QP) connection is formed
      between the two peer SMC hosts and is defined as the SMC-R link.
      The SMC-R link can then support multiple TCP connections between
      the two peers.  An SMC-R link is associated with a single LAN (or
      VLAN) segment and is not routable.

   SMC-R Link Group

      An SMC-R link group is a group of SMC-R links between the same two
      SMC-R peers, typically with each link over unique RoCE adapters.
      Each link in the link group has equal characteristics, such as the
      same VLAN ID (if VLANs are in use), access to the same RMB(s), and
      access to the same TCP server/client.

Top      ToC       Page 9 
   SMC-R Peer

      The SMC-R peer is the peer software stack within the peer
      operating system with respect to the Shared Memory Communications
      (messaging) protocol.

   SMC-R Rendezvous

      SMC-R Rendezvous is the SMC-R peer discovery and handshake
      sequence that occurs transparently over the IP (Ethernet) fabric
      during and immediately after the TCP connection three-way
      handshake by exchanging the SMC-R capabilities and credentials
      using experimental TCP option and CLC messages.

   RoCE SendMsg

      RoCE SendMsg is a send operation posted to a reliably connected
      queue pair with inline data, for the purpose of transferring
      control information between peers.

   TCP Client

      The TCP client is the TCP socket-based peer that initiates a TCP
      connection.

   TCP Server

      The TCP server is the TCP socket-based peer that accepts a TCP
      connection.

   CLC Messages

      The SMC-R protocol defines a set of Connection Layer Control
      messages that flow over the TCP connection that are used to manage
      SMC-R link rendezvous at TCP connection setup time.  This
      mechanism is analogous to SSL setup messages.

   LLC Commands

      The SMC-R protocol defines a set of RoCE Link Layer Control
      commands that flow over the RoCE fabric using RoCE SendMsg, that
      are used to manage SMC-R links, SMC-R link groups, and SMC-R
      link group RMB expansion and contraction.

Top      ToC       Page 10 
   CDC Message

      The SMC-R protocol defines a Connection Data Control message that
      flows over the RoCE fabric using RoCE SendMsg that is used to
      manage the SMC-R connection data.  This message provides
      information about data being transferred over the out-of-band RDMA
      connection, such as data cursors, sequence numbers, and data flags
      (for example, urgent data).  The receipt of this message also
      provides an interrupt to inform the receiver that it has received
      RDMA data.

   RMB

      A Remote (RDMA) Memory Buffer is a fixed or pinned buffer
      allocated in each of the peer hosts for a TCP (via SMC-R)
      connection.  The RMB is registered to the RNIC and allows remote
      access by the remote peer using RDMA semantics.  Each host is
      passed the peer's RMB-specific access information (RMB Key (RKey)
      and RMB element offset) during the SMC-R Rendezvous process.  The
      host stores socket application user data directly into the peer's
      RMB using RDMA over RoCE.

   RToken

      The RToken is the combination of an RMB's RKey and RDMA virtual
      address.  An RToken provides RMB addressability information to an
      RDMA peer.

   RMBE

      The Remote Memory Buffer Element (RMBE) is an area of an RMB that
      is allocated to a specific TCP connection.  The RMBE contains data
      for the TCP connection.  The RMBE represents the TCP receive
      buffer, whereby the remote peer writes into the RMBE and the local
      peer reads from the local RMBE.  The alert token resolves to a
      specific RMBE.

   Alert Token

      The SMC-R alert token is a 4-byte value that uniquely identifies
      the TCP connection over an SMC-R connection.  The alert token
      allows the SMC peer to quickly identify the target TCP connection
      that now has new work.  The format of the token is defined by the
      owning SMC-R endpoint and is considered opaque to the remote peer.
      However, the token should not simply be an index to an RMBE; it
      should reference a TCP connection and be able to be validated to
      avoid reading data from stale connections.

Top      ToC       Page 11 
   RNIC

      The RDMA-capable Network Interface Card (RNIC) is an Ethernet NIC
      that supports RDMA semantics and verbs using RoCE.

   First Contact

      "First contact" describes an SMC-R negotiation to set up the first
      link in a link group.

   Subsequent Contact

      "Subsequent contact" describes an SMC-R negotiation between peers
      who are using an already-existing SMC-R link group.

1.3.  Conventions Used in This Document

   In the rendezvous flow diagrams, dashed lines (----) are used to
   indicate flows over the TCP/IP fabric and dotted lines (....) are
   used to indicate flows over the RoCE fabric.

   In the data transfer ladder diagrams, dashed lines (----) are used to
   indicate RDMA write operations and dotted lines (....) are used to
   indicate CDC messages, which are RDMA messages with inline data that
   contain control information for the connection.



(page 11 continued on part 2)

Next RFC Part