RFC 5044

Marker PDU Aligned Framing for TCP Specification

Pages: 74
Proposed Standard
→ Errata
Updated by: 6581 7146

Part 3 of 3 – Pages 45 to 74

RFC5044 - Page 45 prevText

Appendix A.  Optimized MPA-Aware TCP Implementations

   This appendix is for information only and is NOT part of the
   standard.

   This appendix covers some Optimized MPA-aware TCP implementation
   guidance to implementers.  It is intended for those implementations
   that want to send/receive as much traffic as possible in an aligned
   and zero-copy fashion.

                   +-----------------------------------+
                   | +-----------+ +-----------------+ |
                   | | Optimized | | Other Protocols | |
                   | |  MPA/TCP  | +-----------------+ |
                   | +-----------+        ||           |
                   |         \\     --- socket API --- |
                   |          \\          ||           |
                   |           \\      +-----+         |
                   |            \\     | TCP |         |
                   |             \\    +-----+         |
                   |              \\    //             |
                   |             +-------+             |
                   |             |  IP   |             |
                   |             +-------+             |
                   +-----------------------------------+

                Figure 11: Optimized MPA/TCP Implementation

   The diagram above shows a block diagram of a potential
   implementation.  The network sub-system in the diagram can support
   traditional sockets-based connections using the normal API as shown
   on the right side of the diagram.  Connections for DDP/MPA/TCP are
   run using the facilities shown on the left side of the diagram.

   The DDP/MPA/TCP connections can be started using the facilities shown
   on the left side using some suitable API, or they can be initiated
   using the facilities shown on the right side and transitioned to the
   left side at the point in the connection setup where MPA goes to
   "Full MPA/DDP Operation Phase" as described in Section 7.1.2.

   The optimized MPA/TCP implementations (left side of diagram and
   described below) are only applicable to MPA.  All other TCP
   applications continue to use the standard TCP stacks and interfaces
   shown in the right side of the diagram.

RFC5044 - Page 46

A.1.  Optimized MPA/TCP Transmitters

   The various TCP RFCs allow considerable choice in segmenting a TCP
   stream.  In order to optimize FPDU recovery at the MPA receiver, an
   optimized MPA/TCP implementation uses additional segmentation rules.

   To provide optimum performance, an optimized MPA/TCP transmit side
   implementation should be enabled to:

   *   With an EMSS large enough to contain the FPDU(s), segment the
       outgoing TCP stream such that the first octet of every TCP
       segment begins with an FPDU.  Multiple FPDUs may be packed into a
       single TCP segment as long as they are entirely contained in the
       TCP segment.

   *   Report the current EMSS from the TCP to the MPA transmit layer.

   There are exceptions to the above rule.  Once an ULPDU is provided to
   MPA, the MPA/TCP sender transmits it or fails the connection; it
   cannot be repudiated.  As a result, during changes in MTU and EMSS,
   or when TCP's Receive Window size (RWIN) becomes too small, it may be
   necessary to send FPDUs that do not conform to the segmentation rule
   above.

   A possible, but less desirable, alternative is to use IP
   fragmentation on accepted FPDUs to deal with MTU reductions or
   extremely small EMSS.

   Even when alignment with TCP segments is lost, the sender still
   formats the FPDU according to FPDU format as shown in Figure 2.

   On a retransmission, TCP does not necessarily preserve original TCP
   segmentation boundaries.  This can lead to the loss of FPDU Alignment
   and containment within a TCP segment during TCP retransmissions.  An
   optimized MPA/TCP sender should try to preserve original TCP
   segmentation boundaries on a retransmission.

A.2.  Effects of Optimized MPA/TCP Segmentation

   Optimized MPA/TCP senders will fill TCP segments to the EMSS with a
   single FPDU when a DDP message is large enough.  Since the DDP
   message may not exactly fit into TCP segments, a "message tail" often
   occurs that results in an FPDU that is smaller than a single TCP
   segment.  Additionally, some DDP messages may be considerably shorter
   than the EMSS.  If a small FPDU is sent in a single TCP segment, the
   result is a "short" TCP segment.

RFC5044 - Page 47

   Applications expected to see strong advantages from Direct Data
   Placement include transaction-based applications and throughput
   applications.  Request/response protocols typically send one FPDU per
   TCP segment and then wait for a response.  Under these conditions,
   these "short" TCP segments are an appropriate and expected effect of
   the segmentation.

   Another possibility is that the application might be sending multiple
   messages (FPDUs) to the same endpoint before waiting for a response.
   In this case, the segmentation policy would tend to reduce the
   available connection bandwidth by under-filling the TCP segments.

   Standard TCP implementations often utilize the Nagle [RFC896]
   algorithm to ensure that segments are filled to the EMSS whenever the
   round-trip latency is large enough that the source stream can fully
   fill segments before ACKs arrive.  The algorithm does this by
   delaying the transmission of TCP segments until a ULP can fill a
   segment, or until an ACK arrives from the far side.  The algorithm
   thus allows for smaller segments when latencies are shorter to keep
   the ULP's end-to-end latency to reasonable levels.

   The Nagle algorithm is not mandatory to use [RFC1122].

   When used with optimized MPA/TCP stacks, Nagle and similar algorithms
   can result in the "packing" of multiple FPDUs into TCP segments.

   If a "message tail", small DDP messages, or the start of a larger DDP
   message are available, MPA may pack multiple FPDUs into TCP segments.
   When this is done, the TCP segments can be more fully utilized, but,
   due to the size constraints of FPDUs, segments may not be filled to
   the EMSS.  A dynamic MULPDU that informs DDP of the size of the
   remaining TCP segment space makes filling the TCP segment more
   effective.

       Note that MPA receivers do more processing of a TCP segment that
       contains multiple FPDUs; this may affect the performance of some
       receiver implementations.

   It is up to the ULP to decide if Nagle is useful with DDP/MPA.  Note
   that many of the applications expected to take advantage of MPA/DDP
   prefer to avoid the extra delays caused by Nagle.  In such scenarios,
   it is anticipated there will be minimal opportunity for packing at
   the transmitter and receivers may choose to optimize their
   performance for this anticipated behavior.

RFC5044 - Page 48

   Therefore, the application is expected to set TCP parameters such
   that it can trade off latency and wire efficiency.  Implementations
   should provide a connection option that disables Nagle for MPA/TCP
   similar to the way the TCP_NODELAY socket option is provided for a
   traditional sockets interface.

   When latency is not critical, application is expected to leave Nagle
   enabled.  In this case, the TCP implementation may pack any available
   FPDUs into TCP segments so that the segments are filled to the EMSS.
   If the amount of data available is not enough to fill the TCP segment
   when it is prepared for transmission, TCP can send the segment partly
   filled, or use the Nagle algorithm to wait for the ULP to post more
   data.

A.3.  Optimized MPA/TCP Receivers

   When an MPA receive implementation and the MPA-aware receive side TCP
   implementation support handling out-of-order ULPDUs, the TCP receive
   implementation performs the following functions:

   1)  The implementation passes incoming TCP segments to MPA as soon as
       they have been received and validated, even if not received in
       order.  The TCP layer commits to keeping each segment before it
       can be passed to the MPA.  This means that the segment must have
       passed the TCP, IP, and lower layer data integrity validation
       (i.e., checksum), must be in the receive window, must be part of
       the same epoch (if timestamps are used to verify this), and must
       have passed any other checks required by TCP RFCs.

       This is not to imply that the data must be completely ordered
       before use.  An implementation can accept out-of-order segments,
       SACK them [RFC2018], and pass them to MPA immediately, before the
       reception of the segments needed to fill in the gaps.  MPA
       expects to utilize these segments when they are complete FPDUs or
       can be combined into complete FPDUs to allow the passing of
       ULPDUs to DDP when they arrive, independent of ordering.  DDP
       uses the passed ULPDU to "place" the DDP segments (see [DDP] for
       more details).

       Since MPA performs a CRC calculation and other checks on received
       FPDUs, the MPA/TCP implementation ensures that any TCP segments
       that duplicate data already received and processed (as can happen
       during TCP retries) do not overwrite already received and
       processed FPDUs.  This avoids the possibility that duplicate data
       may corrupt already validated FPDUs.

RFC5044 - Page 49

   2)  The implementation provides a mechanism to indicate the ordering
       of TCP segments as the sender transmitted them.  One possible
       mechanism might be attaching the TCP sequence number to each
       segment.

   3)  The implementation also provides a mechanism to indicate when a
       given TCP segment (and the prior TCP stream) is complete.  One
       possible mechanism might be to utilize the leading (left) edge of
       the TCP Receive Window.

       MPA uses the ordering and completion indications to inform DDP
       when a ULPDU is complete; MPA Delivers the FPDU to DDP.  DDP uses
       the indications to "deliver" its messages to the DDP consumer
       (see [DDP] for more details).

       DDP on MPA utilizes the above two mechanisms to establish the
       Delivery semantics that DDP's consumers agree to.  These
       semantics are described fully in [DDP].  These include
       requirements on DDP's consumer to respect ownership of buffers
       prior to the time that DDP delivers them to the Consumer.

   The use of SACK [RFC2018] significantly improves network utilization
   and performance and is therefore recommended.  When combined with the
   out-of-order passing of segments to MPA and DDP, significant
   buffering and copying of received data can be avoided.

A.4.  Re-Segmenting Middleboxes and Non-Optimized MPA/TCP Senders

   Since MPA senders often start FPDUs on TCP segment boundaries, a
   receiving optimized MPA/TCP implementation may be able to optimize
   the reception of data in various ways.

   However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
   segment boundaries.

   Some MPA senders may be unable to conform to the sender requirements
   because their implementation of TCP is not designed with MPA in mind.
   Even for optimized MPA/TCP senders, the network may contain
   "middleboxes" which modify the TCP stream by changing the
   segmentation.  This is generally interoperable with TCP and its users
   and MPA must be no exception.

   The presence of Markers in MPA (when enabled) allows an optimized
   MPA/TCP receiver to recover the FPDUs despite these obstacles,
   although it may be necessary to utilize additional buffering at the
   receiver to do so.

RFC5044 - Page 50

   Some of the cases that a receiver may have to contend with are listed
   below as a reminder to the implementer:

   *   A single aligned and complete FPDU, either in order or out of
       order:  This can be passed to DDP as soon as validated, and
       Delivered when ordering is established.

   *   Multiple FPDUs in a TCP segment, aligned and fully contained,
       either in order or out of order:  These can be passed to DDP as
       soon as validated, and Delivered when ordering is established.

   *   Incomplete FPDU: The receiver should buffer until the remainder
       of the FPDU arrives.  If the remainder of the FPDU is already
       available, this can be passed to DDP as soon as validated, and
       Delivered when ordering is established.

   *   Unaligned FPDU start: The partial FPDU must be combined with its
       preceding portion(s).  If the preceding parts are already
       available, and the whole FPDU is present, this can be passed to
       DDP as soon as validated, and Delivered when ordering is
       established.  If the whole FPDU is not available, the receiver
       should buffer until the remainder of the FPDU arrives.

   *   Combinations of unaligned or incomplete FPDUs (and potentially
       other complete FPDUs) in the same TCP segment:  If any FPDU is
       present in its entirety, or can be completed with portions
       already available, it can be passed to DDP as soon as validated,
       and Delivered when ordering is established.

A.5.  Receiver Implementation

   Transport & Network Layer Reassembly Buffers:

   The use of reassembly buffers (either TCP reassembly buffers or IP
   fragmentation reassembly buffers) is implementation dependent.  When
   MPA is enabled, reassembly buffers are needed if out-of-order packets
   arrive and Markers are not enabled.  Buffers are also needed if FPDU
   alignment is lost or if IP fragmentation occurs.  This is because the
   incoming out-of-order segment may not contain enough information for
   MPA to process all of the FPDU.  For cases where a re-segmenting
   middlebox is present, or where the TCP sender is not optimized, the
   presence of Markers significantly reduces the amount of buffering
   needed.

   Recovery from IP fragmentation is transparent to the MPA Consumers.

RFC5044 - Page 51

A.5.1  Network Layer Reassembly Buffers

   The MPA/TCP implementation should set the IP Don't Fragment bit at
   the IP layer.  Thus, upon a path MTU change, intermediate devices
   drop the IP datagram if it is too large and reply with an ICMP
   message that tells the source TCP that the path MTU has changed.
   This causes TCP to emit segments conformant with the new path MTU
   size.  Thus, IP fragments under most conditions should never occur at
   the receiver.  But it is possible.

   There are several options for implementation of network layer
   reassembly buffers:

   1.  drop any IP fragments, and reply with an ICMP message according
       to [RFC792] (fragmentation needed and DF set) to tell the Remote
       Peer to resize its TCP segment.

   2.  support an IP reassembly buffer, but have it of limited size
       (possibly the same size as the local link's MTU).  The end node
       would normally never Advertise a path MTU larger than the local
       link MTU.  It is recommended that a dropped IP fragment cause an
       ICMP message to be generated according to RFC 792.

   3.  multiple IP reassembly buffers, of effectively unlimited size.

   4.  support an IP reassembly buffer for the largest IP datagram (64
       KB).

   5.  support for a large IP reassembly buffer that could span multiple
       IP datagrams.

   An implementation should support at least 2 or 3 above, to avoid
   dropping packets that have traversed the entire fabric.

   There is no end-to-end ACK for IP reassembly buffers, so there is no
   flow control on the buffer.  The only end-to-end ACK is a TCP ACK,
   which can only occur when a complete IP datagram is delivered to TCP.
   Because of this, under worst case, pathological scenarios, the
   largest IP reassembly buffer is the TCP receive window (to buffer
   multiple IP datagrams that have all been fragmented).

   Note that if the Remote Peer does not implement re-segmentation of
   the data stream upon receiving the ICMP reply updating the path MTU,
   it is possible to halt forward progress because the opposite peer
   would continue to retransmit using a transport segment size that is
   too large.  This deadlock scenario is no different than if the fabric
   MTU (not last-hop MTU) was reduced after connection setup, and the
   remote node's behavior is not compliant with [RFC1122].

RFC5044 - Page 52

A.5.2  TCP Reassembly Buffers

   A TCP reassembly buffer is also needed.  TCP reassembly buffers are
   needed if FPDU Alignment is lost when using TCP with MPA or when the
   MPA FPDU spans multiple TCP segments.  Buffers are also needed if
   Markers are disabled and out-of-order packets arrive.

   Since lost FPDU Alignment often means that FPDUs are incomplete, an
   MPA on TCP implementation must have a reassembly buffer large enough
   to recover an FPDU that is less than or equal to the MTU of the
   locally attached link (this should be the largest possible Advertised
   TCP path MTU).  If the MTU is smaller than 140 octets, a buffer of at
   least 140 octets long is needed to support the minimum FPDU size.
   The 140 octets allow for the minimum MULPDU of 128, 2 octets of pad,
   2 of ULPDU_Length, 4 of CRC, and space for a possible Marker.  As
   usual, additional buffering is likely to provide better performance.

   Note that if the TCP segments were not stored, it would be possible
   to deadlock the MPA algorithm.  If the path MTU is reduced, FPDU
   Alignment requires the source TCP to re-segment the data stream to
   the new path MTU.  The source MPA will detect this condition and
   reduce the MPA segment size, but any FPDUs already posted to the
   source TCP will be re-segmented and lose FPDU Alignment.  If the
   destination does not support a TCP reassembly buffer, these segments
   can never be successfully transmitted and the protocol deadlocks.

   When a complete FPDU is received, processing continues normally.

Appendix B.  Analysis of MPA over TCP Operations

   This appendix is for information only and is NOT part of the
   standard.

   This appendix is an analysis of MPA on TCP and why it is useful to
   integrate MPA with TCP (with modifications to typical TCP
   implementations) to reduce overall system buffering and overhead.

   One of MPA's high-level goals is to provide enough information, when
   combined with the Direct Data Placement Protocol [DDP], to enable
   out-of-order placement of DDP payload into the final Upper Layer
   Protocol (ULP) Buffer.  Note that DDP separates the act of placing
   data into a ULP Buffer from that of notifying the ULP that the ULP
   Buffer is available for use.  In DDP terminology, the former is
   defined as "Placement", and the later is defined as "Delivery".  MPA
   supports in-order Delivery of the data to the ULP, including support
   for Direct Data Placement in the final ULP Buffer location when TCP
   segments arrive out of order.  Effectively, the goal is to use the

RFC5044 - Page 53

   pre-posted ULP Buffers as the TCP receive buffer, where the
   reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and
   DDP) is done in place, in the ULP Buffer, with no data copies.

   This appendix walks through the advantages and disadvantages of the
   TCP sender modifications proposed by MPA:

   1) that MPA prefers that the TCP sender to do Header Alignment, where
      a TCP segment should begin with an MPA Framing Protocol Data Unit
      (FPDU) (if there is payload present).

   2) that there be an integral number of FPDUs in a TCP segment (under
      conditions where the path MTU is not changing).

   This appendix concludes that the scaling advantages of FPDU Alignment
   are strong, based primarily on fairly drastic TCP receive buffer
   reduction requirements and simplified receive handling.  The analysis
   also shows that there is little effect to TCP wire behavior.

B.1.  Assumptions

B.1.1  MPA Is Layered beneath DDP

   MPA is an adaptation layer between DDP and TCP.  DDP requires
   preservation of DDP segment boundaries and a CRC32c digest covering
   the DDP header and data.  MPA adds these features to the TCP stream
   so that DDP over TCP has the same basic properties as DDP over SCTP.

B.1.2.  MPA Preserves DDP Message Framing

   MPA was designed as a framing layer specifically for DDP and was not
   intended as a general-purpose framing layer for any other ULP using
   TCP.

   A framing layer allows ULPs using it to receive indications from the
   transport layer only when complete ULPDUs are present.  As a framing
   layer, MPA is not aware of the content of the DDP PDU, only that it
   has received and, if necessary, reassembled a complete PDU for
   Delivery to the DDP.

B.1.3.  The Size of the ULPDU Passed to MPA Is Less Than EMSS under
        Normal Conditions

   To make reception of a complete DDP PDU on every received segment
   possible, DDP passes to MPA a PDU that is no larger than the EMSS of
   the underlying fabric.  Each FPDU that MPA creates contains
   sufficient information for the receiver to directly place the ULP
   payload in the correct location in the correct receive buffer.

RFC5044 - Page 54

   Edge cases when this condition does not occur are dealt with, but do
   not need to be on the fast path.

B.1.4.  Out-of-Order Placement but NO Out-of-Order Delivery

   DDP receives complete DDP PDUs from MPA.  Each DDP PDU contains the
   information necessary to place its ULP payload directly in the
   correct location in host memory.

   Because each DDP segment is self-describing, it is possible for DDP
   segments received out of order to have their ULP payload placed
   immediately in the ULP receive buffer.

   Data delivery to the ULP is guaranteed to be in the order the data
   was sent.  DDP only indicates data delivery to the ULP after TCP has
   acknowledged the complete byte stream.

B.2.  The Value of FPDU Alignment

   Significant receiver optimizations can be achieved when Header
   Alignment and complete FPDUs are the common case.  The optimizations
   allow utilizing significantly fewer buffers on the receiver and less
   computation per FPDU.  The net effect is the ability to build a
   "flow-through" receiver that enables TCP-based solutions to scale to
   10G and beyond in an economical way.  The optimizations are
   especially relevant to hardware implementations of receivers that
   process multiple protocol layers -- Data Link Layer (e.g., Ethernet),
   Network and Transport Layer (e.g., TCP/IP), and even some ULP on top
   of TCP (e.g., MPA/DDP).  As network speed increases, there is an
   increasing desire to use a hardware-based receiver in order to
   achieve an efficient high performance solution.

   A TCP receiver, under worst-case conditions, has to allocate buffers
   (BufferSizeTCP) whose capacities are a function of the bandwidth-
   delay product.  Thus:

       BufferSizeTCP = K * bandwidth [octets/second] * Delay [seconds].

   Where bandwidth is the end-to-end bandwidth of the connection, delay
   is the round-trip delay of the connection, and K is an
   implementation-dependent constant.

   Thus, BufferSizeTCP scales with the end-to-end bandwidth (10x more
   buffers for a 10x increase in end-to-end bandwidth).  As this
   buffering approach may scale poorly for hardware or software
   implementations alike, several approaches allow reduction in the
   amount of buffering required for high-speed TCP communication.

RFC5044 - Page 55

   The MPA/DDP approach is to enable the ULP's Buffer to be used as the
   TCP receive buffer.  If the application pre-posts a sufficient amount
   of buffering, and each TCP segment has sufficient information to
   place the payload into the right application buffer, when an out-of-
   order TCP segment arrives it could potentially be placed directly in
   the ULP Buffer.  However, placement can only be done when a complete
   FPDU with the placement information is available to the receiver, and
   the FPDU contents contain enough information to place the data into
   the correct ULP Buffer (e.g., there is a DDP header available).

   For the case when the FPDU is not aligned with the TCP segment, it
   may take, on average, 2 TCP segments to assemble one FPDU.
   Therefore, the receiver has to allocate BufferSizeNAF (Buffer Size,
   Non-Aligned FPDU) octets:

       BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS

   Where K1 and K2 are implementation-dependent constants and EMSS is
   the effective maximum segment size.

   For example, a 1 GB/sec link with 10,000 connections and an EMSS of
   1500 B would require 15 MB of memory.  Often the number of
   connections used scales with the network speed, aggravating the
   situation for higher speeds.

   FPDU Alignment would allow the receiver to allocate BufferSizeAF
   (Buffer Size, Aligned FPDU) octets:

       BufferSizeAF = K2 * EMSS

   for the same conditions.  An FPDU Aligned receiver may require memory
   in the range of ~100s of KB -- which is feasible for an on-chip
   memory and enables a "flow-through" design, in which the data flows
   through the network interface card (NIC) and is placed directly in
   the destination buffer.  Assuming most of the connections support
   FPDU Alignment, the receiver buffers no longer scale with number of
   connections.

   Additional optimizations can be achieved in a balanced I/O sub-system
   -- where the system interface of the network controller provides
   ample bandwidth as compared with the network bandwidth.  For almost
   twenty years this has been the case and the trend is expected to
   continue.  While Ethernet speeds have scaled by 1000 (from 10
   megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU
   architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to
   PCI-X DDR).  Under these conditions, the FPDU Alignment approach
   allows BufferSizeAF to be indifferent to network speed.  It is
   primarily a function of the local processing time for a given frame.

RFC5044 - Page 56

   Thus, when the FPDU Alignment approach is used, receive buffering is
   expected to scale gracefully (i.e., less than linear scaling) as
   network speed is increased.

B.2.1.  Impact of Lack of FPDU Alignment on the Receiver Computational
        Load and Complexity

   The receiver must perform IP and TCP processing, and then perform
   FPDU CRC checks, before it can trust the FPDU header placement
   information.  For simplicity of the description, the assumption is
   that an FPDU is carried in no more than 2 TCP segments.  In reality,
   with no FPDU Alignment, an FPDU can be carried by more than 2 TCP
   segments (e.g., if the path MTU was reduced).

   ----++-----------------------------++-----------------------++-----
   +---||---------------+    +--------||--------+   +----------||----+
   |   TCP Seg X-1      |    |     TCP Seg X    |   |  TCP Seg X+1   |
   +---||---------------+    +--------||--------+   +----------||----+
   ----++-----------------------------++-----------------------++-----
                   FPDU #N-1                  FPDU #N

     Figure 12: Non-Aligned FPDU Freely Placed in TCP Octet Stream

   The receiver algorithm for processing TCP segments (e.g., TCP segment
   #X in Figure 12) carrying non-aligned FPDUs (in order or out of
   order) includes:

   Data Link Layer processing (whole frame) -- typically including a CRC
   calculation.

       1.  Network Layer processing (assuming not an IP fragment, the
           whole Data Link Layer frame contains one IP datagram.  IP
           fragments should be reassembled in a local buffer.  This is
           not a performance optimization goal.)

       2.  Transport Layer processing -- TCP protocol processing, header
           and checksum checks.

           a.  Classify incoming TCP segment using the 5 tuple (IP SRC,
               IP DST, TCP SRC Port, TCP DST Port, protocol).

RFC5044 - Page 57

       3.  Find FPDU message boundaries.

           a.  Get MPA state information for the connection.

               If the TCP segment is in order, use the receiver-managed
               MPA state information to calculate where the previous
               FPDU message (#N-1) ends in the current TCP segment X.
               (previously, when the MPA receiver processed the first
               part of FPDU #N-1, it calculated the number of bytes
               remaining to complete FPDU #N-1 by using the MPA Length
               field).

                   Get the stored partial CRC for FPDU #N-1.

                   Complete CRC calculation for FPDU #N-1 data (first
                       portion of TCP segment #X).

                   Check CRC calculation for FPDU #N-1.

                   If no FPDU CRC errors, placement is allowed.

                   Locate the local buffer for the first portion of
                       FPDU#N-1, CopyData(local buffer of first portion
                       of FPDU #N-1, host buffer address, length).

                   Compute host buffer address for second portion of
                       FPDU #N-1.

                   CopyData (local buffer of second portion of FPDU #N-
                       1, host buffer address for second portion,
                       length).

                   Calculate the octet offset into the TCP segment for
                       the next FPDU #N.

                   Start calculation of CRC for available data for FPDU.
                       #N

                   Store partial CRC results for FPDU #N.

                   Store local buffer address of first portion of FPDU
                       #N.

                   No further action is possible on FPDU #N, before it
                       is completely received.

RFC5044 - Page 58

               If the TCP segment is out of order, the receiver must
               buffer the data until at least one complete FPDU is
               received.  Typically, buffering for more than one TCP
               segment per connection is required.  Use the MPA-based
               Markers to calculate where FPDU boundaries are.

                   When a complete FPDU is available, a similar
                   procedure to the in-order algorithm above is used.
                   There is additional complexity, though, because when
                   the missing segment arrives, this TCP segment must be
                   run through the CRC engine after the CRC is
                   calculated for the missing segment.

   If we assume FPDU Alignment, the following diagram and the algorithm
   below apply.  Note that when using MPA, the receiver is assumed to
   actively detect presence or loss of FPDU Alignment for every TCP
   segment received.

      +--------------------------+      +--------------------------+
   +--|--------------------------+   +--|--------------------------+
   |  |       TCP Seg X          |   |  |         TCP Seg X+1      |
   +--|--------------------------+   +--|--------------------------+
      +--------------------------+      +--------------------------+
                FPDU #N                          FPDU #N+1

      Figure 13: Aligned FPDU Placed Immediately after TCP Header

RFC5044 - Page 59

   The receiver algorithm for FPDU Aligned frames (in order or out of
   order) includes:

       1)  Data Link Layer processing (whole frame) -- typically
           including a CRC calculation.

       2)  Network Layer processing (assuming not an IP fragment, the
           whole Data Link Layer frame contains one IP datagram.  IP
           fragments should be reassembled in a local buffer.  This is
           not a performance optimization goal.)

       3)  Transport Layer processing -- TCP protocol processing, header
           and checksum checks.

           a.  Classify incoming TCP segment using the 5 tuple (IP SRC,
               IP DST, TCP SRC Port, TCP DST Port, protocol).

       4)  Check for Header Alignment (described in detail in Section
           6).  Assuming Header Alignment for the rest of the algorithm
           below.

           a.  If the header is not aligned, see the algorithm defined
               in the prior section.

       5)  If TCP segment is in order or out of order, the MPA header is
           at the beginning of the current TCP payload.  Get the FPDU
           length from the FPDU header.

       6)  Calculate CRC over FPDU.

       7)  Check CRC calculation for FPDU #N.

       8)  If no FPDU CRC errors, placement is allowed.

       9)  CopyData(TCP segment #X, host buffer address, length).

       10) Loop to #5 until all the FPDUs in the TCP segment are
           consumed in order to handle FPDU packing.

   Implementation note: In both cases, the receiver has to classify the
   incoming TCP segment and associate it with one of the flows it
   maintains.  In the case of no FPDU Alignment, the receiver is forced
   to classify incoming traffic before it can calculate the FPDU CRC.
   In the case of FPDU Alignment, the operations order is left to the
   implementer.

RFC5044 - Page 60

   The FPDU Aligned receiver algorithm is significantly simpler.  There
   is no need to locally buffer portions of FPDUs.  Accessing state
   information is also substantially simplified -- the normal case does
   not require retrieving information to find out where an FPDU starts
   and ends or retrieval of a partial CRC before the CRC calculation can
   commence.  This avoids adding internal latencies, having multiple
   data passes through the CRC machine, or scheduling multiple commands
   for moving the data to the host buffer.

   The aligned FPDU approach is useful for in-order and out-of-order
   reception.  The receiver can use the same mechanisms for data storage
   in both cases, and only needs to account for when all the TCP
   segments have arrived to enable Delivery.  The Header Alignment,
   along with the high probability that at least one complete FPDU is
   found with every TCP segment, allows the receiver to perform data
   placement for out-of-order TCP segments with no need for intermediate
   buffering.  Essentially, the TCP receive buffer has been eliminated
   and TCP reassembly is done in place within the ULP Buffer.

   In case FPDU Alignment is not found, the receiver should follow the
   algorithm for non-aligned FPDU reception, which may be slower and
   less efficient.

B.2.2.  FPDU Alignment Effects on TCP Wire Protocol

   In an optimized MPA/TCP implementation, TCP exposes its EMSS to MPA.
   MPA uses the EMSS to calculate its MULPDU, which it then exposes to
   DDP, its ULP.  DDP uses the MULPDU to segment its payload so that
   each FPDU sent by MPA fits completely into one TCP segment.  This has
   no impact on wire protocol, and exposing this information is already
   supported on many TCP implementations, including all modern flavors
   of BSD networking, through the TCP_MAXSEG socket option.

   In the common case, the ULP (i.e., DDP over MPA) messages provided to
   the TCP layer are segmented to MULPDU size.  It is assumed that the
   ULP message size is bounded by MULPDU, such that a single ULP message
   can be encapsulated in a single TCP segment.  Therefore, in the
   common case, there is no increase in the number of TCP segments
   emitted.  For smaller ULP messages, the sender can also apply
   packing, i.e., the sender packs as many complete FPDUs as possible
   into one TCP segment.  The requirement to always have a complete FPDU
   may increase the number of TCP segments emitted.  Typically, a ULP
   message size varies from a few bytes to multiple EMSSs (e.g., 64
   Kbytes).  In some cases, the ULP may post more than one message at a
   time for transmission, giving the sender an opportunity for packing.
   In the case where more than one FPDU is available for transmission
   and the FPDUs are encapsulated into a TCP segment and there is no
   room in the TCP segment to include the next complete FPDU, another

RFC5044 - Page 61

   TCP segment is sent.  In this corner case, some of the TCP segments
   are not full size.  In the worst-case scenario, the ULP may choose an
   FPDU size that is EMSS/2 +1 and has multiple messages available for
   transmission.  For this poor choice of FPDU size, the average TCP
   segment size is therefore about 1/2 of the EMSS and the number of TCP
   segments emitted is approaching 2x of what is possible without the
   requirement to encapsulate an integer number of complete FPDUs in
   every TCP segment.  This is a dynamic situation that only lasts for
   the duration where the sender ULP has multiple non-optimal messages
   for transmission and this causes a minor impact on the wire
   utilization.

   However, it is not expected that requiring FPDU Alignment will have a
   measurable impact on wire behavior of most applications.  Throughput
   applications with large I/Os are expected to take full advantage of
   the EMSS.  Another class of applications with many small outstanding
   buffers (as compared to EMSS) is expected to use packing when
   applicable.  Transaction-oriented applications are also optimal.

   TCP retransmission is another area that can affect sender behavior.
   TCP supports retransmission of the exact, originally transmitted
   segment (see [RFC793], Sections 2.6 and 3.7 (under "Managing the
   Window") and [RFC1122], Section 4.2.2.15).  In the unlikely event
   that part of the original segment has been received and acknowledged
   by the Remote Peer (e.g., a re-segmenting middlebox, as documented in
   Appendix A.4, Re-Segmenting Middleboxes and Non-Optimized MPA/TCP
   Senders), a better available bandwidth utilization may be possible by
   retransmitting only the missing octets.  If an optimized MPA/TCP
   retransmits complete FPDUs, there may be some marginal bandwidth
   loss.

   Another area where a change in the TCP segment number may have impact
   is that of slow start and congestion avoidance.  Slow-start
   exponential increase is measured in segments per second, as the
   algorithm focuses on the overhead per segment at the source for
   congestion that eventually results in dropped segments.  Slow-start
   exponential bandwidth growth for optimized MPA/TCP is similar to any
   TCP implementation.  Congestion avoidance allows for a linear growth
   in available bandwidth when recovering after a packet drop.  Similar
   to the analysis for slow start, optimized MPA/TCP doesn't change the
   behavior of the algorithm.  Therefore, the average size of the
   segment versus EMSS is not a major factor in the assessment of the
   bandwidth growth for a sender.  Both slow start and congestion
   avoidance for an optimized MPA/TCP will behave similarly to any TCP
   sender and allow an optimized MPA/TCP to enjoy the theoretical
   performance limits of the algorithms.

RFC5044 - Page 62

   In summary, the ULP messages generated at the sender (e.g., the
   amount of messages grouped for every transmission request) and
   message size distribution has the most significant impact over the
   number of TCP segments emitted.  The worst-case effect for certain
   ULPs (with average message size of EMSS/2+1 to EMSS) is bounded by an
   increase of up to 2x in the number of TCP segments and acknowledges.
   In reality, the effect is expected to be marginal.

Appendix C.  IETF Implementation Interoperability with RDMA Consortium
             Protocols

   This appendix is for information only and is NOT part of the
   standard.

   This appendix covers methods of making MPA implementations
   interoperate with both IETF and RDMA Consortium versions of the
   protocols.

   The RDMA Consortium created early specifications of the MPA/DDP/RDMA
   protocols, and some manufacturers created implementations of those
   protocols before the IETF versions were finalized.  These protocols
   are very similar to the IETF versions making it possible for
   implementations to be created or modified to support either set of
   specifications.

   For those interested, the RDMA Consortium protocol documents (draft-
   culley-iwarp-mpa-v1.0.pdf [RDMA-MPA], draft-shah-iwarp-ddp-v1.0.pdf
   [RDMA-DDP], and draft-recio-iwarp-rdmac-v1.0.pdf [RDMA-RDMAC]) can be
   obtained at http://www.rdmaconsortium.org/home.

   In this section, implementations of MPA/DDP/RDMA that conform to the
   RDMAC specifications are called RDMAC RNICs.  Implementations of
   MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs.

   Without the exchange of MPA Request/Reply Frames, there is no
   standard mechanism for enabling RDMAC RNICs to interoperate with IETF
   RNICs.  Even if a ULP uses a well-known port to start an IETF RNIC
   immediately in RDMA mode (i.e., without exchanging the MPA
   Request/Reply messages), there is no reason to believe an IETF RNIC
   will interoperate with an RDMAC RNIC because of the differences in
   the version number in the DDP and RDMAP headers on the wire.

   Therefore, the ULP or other supporting entity at the RDMAC RNIC must
   implement MPA Request/Reply Frames on behalf of the RNIC in order to
   negotiate the connection parameters.  The following section describes
   the results following the exchange of the MPA Request/Reply Frames
   before the conversion from streaming to RDMA mode.

RFC5044 - Page 63

C.1.  Negotiated Parameters

   Three types of RNICs are considered:

   Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols that
   has a ULP or other supporting entity that exchanges the MPA
   Request/Reply Frames in streaming mode before the conversion to RDMA
   mode.

   Non-permissive IETF RNIC - an RNIC implementing the IETF protocols
   that is not capable of implementing the RDMAC protocols.  Such an
   RNIC can only interoperate with other IETF RNICs.

   Permissive IETF RNIC - an RNIC implementing the IETF protocols that
   is capable of implementing the RDMAC protocols on a per-connection
   basis.

   The Permissive IETF RNIC is recommended for those implementers that
   want maximum interoperability with other RNIC implementations.

   The values used by these three RNIC types for the MPA, DDP, and RDMAP
   versions as well as MPA Markers and CRC are summarized in Figure 14.

    +----------------++-----------+-----------+-----------+-----------+
    | RNIC TYPE      || DDP/RDMAP |    MPA    |    MPA    |    MPA    |
    |                ||  Version  | Revision  |  Markers  |    CRC    |
    +----------------++-----------+-----------+-----------+-----------+
    +----------------++-----------+-----------+-----------+-----------+
    | RDMAC          ||     0     |     0     |     1     |     1     |
    |                ||           |           |           |           |
    +----------------++-----------+-----------+-----------+-----------+
    | IETF           ||     1     |     1     |  0 or 1   |  0 or 1   |
    | Non-permissive ||           |           |           |           |
    +----------------++-----------+-----------+-----------+-----------+
    | IETF           ||  1 or 0   |  1 or 0   |  0 or 1   |  0 or 1   |
    | permissive     ||           |           |           |           |
    +----------------++-----------+-----------+-----------+-----------+

           Figure 14: Connection Parameters for the RNIC Types
            for MPA Markers and MPA CRC, enabled=1, disabled=0.

   It is assumed there is no mixing of versions allowed between MPA,
   DDP, and RDMAP.  The RNIC either generates the RDMAC protocols on the
   wire (version is zero) or uses the IETF protocols (version is one).

RFC5044 - Page 64

   During the exchange of the MPA Request/Reply Frames, each peer
   provides its MPA Revision, Marker preference (M: 0=disabled,
   1=enabled), and CRC preference.  The MPA Revision provided in the MPA
   Request Frame and the MPA Reply Frame may differ.

   From the information in the MPA Request/Reply Frames, each side sets
   the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as
   well as the state of the Markers for each half connection.  Between
   DDP and RDMAP, no mixing of versions is allowed.  Moreover, the DDP
   and RDMAP version MUST be identical in the two directions.  The RNIC
   either generates the RDMAC protocols on the wire (version is zero) or
   uses the IETF protocols (version is one).

   In the following sections, the figures do not discuss CRC negotiation
   because there is no interoperability issue for CRCs.  Since the RDMAC
   RNIC will always request CRC use, then, according to the IETF MPA
   specification, both peers MUST generate and check CRCs.

C.2.  RDMAC RNIC and Non-Permissive IETF RNIC

   Figure 15 shows that a Non-permissive IETF RNIC cannot interoperate
   with an RDMAC RNIC, despite the fact that both peers exchange MPA
   Request/Reply Frames.  For a Non-permissive IETF RNIC, the MPA
   negotiation has no effect on the DDP/RDMAP version and it is unable
   to interoperate with the RDMAC RNIC.

   The rows in the figure show the state of the Marker field in the MPA
   Request Frame sent by the MPA Initiator.  The columns show the state
   of the Marker field in the MPA Reply Frame sent by the MPA Responder.
   Each type of RNIC is shown as an Initiator and a Responder.  The
   connection results are shown in the lower right corner, at the
   intersection of the different RNIC types, where V=0 is the RDMAC
   DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA
   Markers are disabled, and M=1 means MPA Markers are enabled.  The
   negotiated Marker state is shown as X/Y, for the receive direction of
   the Initiator/Responder.

RFC5044 - Page 65

          +---------------------------++-----------------------+
          |   MPA                     ||          MPA          |
          | CONNECT                   ||       Responder       |
          |   MODE  +-----------------++-------+---------------+
          |         |   RNIC          || RDMAC |     IETF      |
          |         |   TYPE          ||       | Non-permissive|
          |         |          +------++-------+-------+-------+
          |         |          |MARKER|| M=1   | M=0   |  M=1  |
          +---------+----------+------++-------+-------+-------+
          +---------+----------+------++-------+-------+-------+
          |         |   RDMAC  | M=1  || V=0   | close | close |
          |         |          |      || M=1/1 |       |       |
          |         +----------+------++-------+-------+-------+
          |   MPA   |          | M=0  || close | V=1   | V=1   |
          |Initiator|   IETF   |      ||       | M=0/0 | M=0/1 |
          |         |Non-perms.+------++-------+-------+-------+
          |         |          | M=1  || close | V=1   | V=1   |
          |         |          |      ||       | M=1/0 | M=1/1 |
          +---------+----------+------++-------+-------+-------+

           Figure 15: MPA Negotiation between an RDMAC RNIC and
                      a Non-Permissive IETF RNIC

C.2.1.  RDMAC RNIC Initiator

   If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request
   Frame with Rev field set to zero and the M and C bits set to one.
   Because the Non-permissive IETF RNIC cannot dynamically downgrade the
   version number it uses for DDP and RDMAP, it would send an MPA Reply
   Frame with the Rev field equal to one and then gracefully close the
   connection.

C.2.2.  Non-Permissive IETF RNIC Initiator

   If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA
   Request Frame with Rev field equal to one.  The ULP or supporting
   entity for the RDMAC RNIC responds with an MPA Reply Frame that has
   the Rev field equal to zero and the M bit set to one.  The Non-
   permissive IETF RNIC will gracefully close the connection after it
   reads the incompatible Rev field in the MPA Reply Frame.

C.2.3.  RDMAC RNIC and Permissive IETF RNIC

   Figure 16 shows that a Permissive IETF RNIC can interoperate with an
   RDMAC RNIC regardless of its Marker preference.  The figure uses the
   same format as shown with the Non-permissive IETF RNIC.

RFC5044 - Page 66

          +---------------------------++-----------------------+
          |   MPA                     ||          MPA          |
          | CONNECT                   ||       Responder       |
          |   MODE  +-----------------++-------+---------------+
          |         |   RNIC          || RDMAC |     IETF      |
          |         |   TYPE          ||       |  Permissive   |
          |         |          +------++-------+-------+-------+
          |         |          |MARKER|| M=1   | M=0   | M=1   |
          +---------+----------+------++-------+-------+-------+
          +---------+----------+------++-------+-------+-------+
          |         |   RDMAC  | M=1  || V=0   | N/A   | V=0   |
          |         |          |      || M=1/1 |       | M=1/1 |
          |         +----------+------++-------+-------+-------+
          |   MPA   |          | M=0  || V=0   | V=1   | V=1   |
          |Initiator|   IETF   |      || M=1/1 | M=0/0 | M=0/1 |
          |         |Permissive+------++-------+-------+-------+
          |         |          | M=1  || V=0   | V=1   | V=1   |
          |         |          |      || M=1/1 | M=1/0 | M=1/1 |
          +---------+----------+------++-------+-------+-------+

           Figure 16: MPA Negotiation between an RDMAC RNIC and
                         a Permissive IETF RNIC

   A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the
   Rev field of the MPA Req/Rep Frames and then adjust its receive
   Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC.  As
   a result, as an MPA Responder, the Permissive IETF RNIC will never
   return an MPA Reply Frame with the M bit set to zero.  This case is
   shown as a not applicable (N/A) in Figure 16.

C.2.4.  RDMAC RNIC Initiator

   When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting
   entity prepares an MPA Request message and sets the revision to zero
   and the M bit and C bit to one.

   The Permissive IETF Responder receives the MPA Request message and
   checks the revision field.  Since it is capable of generating RDMAC
   DDP/RDMAP headers, it sends an MPA Reply message with revision set to
   zero and the M and C bits set to one.  The Responder must inform its
   ULP that it is generating version zero DDP/RDMAP messages.

RFC5044 - Page 67

C.2.5  Permissive IETF RNIC Initiator

   If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA
   Request Frame setting the Rev field to one.  Regardless of the value
   of the M bit in the MPA Request Frame, the ULP or other supporting
   entity for the RDMAC RNIC will create an MPA Reply Frame with Rev
   equal to zero and the M bit set to one.

   When the Initiator reads the Rev field of the MPA Reply Frame and
   finds that its peer is an RDMAC RNIC, it must inform its ULP that it
   should generate version zero DDP/RDMAP messages and enable MPA
   Markers and CRC.

C.3.  Non-Permissive IETF RNIC and Permissive IETF RNIC

   For completeness, Figure 17 below shows the results of MPA
   negotiation between a Non-permissive IETF RNIC and a Permissive IETF
   RNIC.  The important point from this figure is that an IETF RNIC
   cannot detect whether its peer is a Permissive or Non-permissive
   RNIC.

      +---------------------------++-------------------------------+
      |   MPA                     ||              MPA              |
      | CONNECT                   ||            Responder          |
      |   MODE  +-----------------++---------------+---------------+
      |         |   RNIC          ||     IETF      |     IETF      |
      |         |   TYPE          || Non-permissive|  Permissive   |
      |         |          +------++-------+-------+-------+-------+
      |         |          |MARKER|| M=0   | M=1   | M=0   | M=1   |
      +---------+----------+------++-------+-------+-------+-------+
      +---------+----------+------++-------+-------+-------+-------+
      |         |          | M=0  || V=1   | V=1   | V=1   | V=1   |
      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
      |         |Non-perms.+------++-------+-------+-------+-------+
      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
      |   MPA   +----------+------++-------+-------+-------+-------+
      |Initiator|          | M=0  || V=1   | V=1   | V=1   | V=1   |
      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
      |         |Permissive+------++-------+-------+-------+-------+
      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
      +---------+----------+------++-------+-------+-------+-------+

    Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a
                           Permissive IETF RNIC.

RFC5044 - Page 68

Normative References

   [iSCSI]      Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M.,
                and E. Zeidner, "Internet Small Computer Systems
                Interface (iSCSI)", RFC 3720, April 2004.

   [RFC1191]    Mogul, J. and S. Deering, "Path MTU discovery", RFC
                1191, November 1990.

   [RFC2018]    Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
                Selective Acknowledgment Options", RFC 2018, October
                1996.

   [RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate
                Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2401]    Kent, S. and R. Atkinson, "Security Architecture for the
                Internet Protocol", RFC 2401, November 1998.

   [RFC3723]    Aboba, B., Tseng, J., Walker, J., Rangan, V., and F.
                Travostino, "Securing Block Storage Protocols over IP",
                RFC 3723, April 2004.

   [RFC793]     Postel, J., "Transmission Control Protocol", STD 7, RFC
                793, September 1981.

   [RDMASEC]    Pinkerton, J. and E. Deleganes, "Direct Data Placement
                Protocol (DDP) / Remote Direct Memory Access Protocol
                (RDMAP) Security", RFC 5042, October 2007.

Informative References

   [APPL]       Bestler, C. and L. Coene, "Applicability of Remote
                Direct Memory Access Protocol (RDMA) and Direct Data
                Placement (DDP)", RFC 5045, October 2007.

   [CRCTCP]     Stone J., Partridge, C., "When the CRC and TCP checksum
                disagree", ACM Sigcomm, Sept. 2000.

   [DAT-API]    DAT Collaborative, "kDAPL (Kernel Direct Access
                Programming Library) and uDAPL (User Direct Access
                Programming Library)", Http://www.datcollaborative.org.

   [DDP]        Shah, H., Pinkerton, J., Recio, R., and P. Culley,
                "Direct Data Placement over Reliable Transports", RFC
                5041, October 2007.

RFC5044 - Page 69

   [iSER]       Ko, M., Chadalapaka, M., Hufferd, J., Elzur, U., Shah,
                H., and P. Thaler, "Internet Small Computer System
                Interface (iSCSI) Extensions for Remote Direct Memory
                Access (RDMA)" RFC 5046, October 2007.

   [IT-API]     The Open Group, "Interconnect Transport API (IT-API)"
                Version 2.1, http://www.opengroup.org.

   [NFSv4CHAN]  Williams, N., "On the Use of Channel Bindings to Secure
                Channels", Work in Progress, June 2006.

   [RDMA-DDP]   "Direct Data Placement over Reliable Transports (Version
                1.0)", RDMA Consortium, October 2002,
                <http://www.rdmaconsortium.org/home/draft-shah-iwarp-
                ddp-v1.0.pdf>.

   [RDMA-MPA]   "Marker PDU Aligned Framing for TCP Specification
                (Version 1.0)", RDMA Consortium, October 2002,
                <http://www.rdmaconsortium.org/home/draft-culley-iwarp-
                mpa-v1.0.pdf>.

   [RDMA-RDMAC] "An RDMA Protocol Specification (Version 1.0)", RDMA
                Consortium, October 2002,
                <http://www.rdmaconsortium.org/home/draft-recio-iwarp-
                rdmac-v1.0.pdf>.

   [RDMAP]      Recio, R., Culley, P., Garcia, D., Hilland, J., and B.
                Metzler, "A Remote Direct Memory Access Protocol
                Specification", RFC 5040, October 2007.

   [RFC792]     Postel, J., "Internet Control Message Protocol", STD 5,
                RFC 792, September 1981.

   [RFC896]     Nagle, J., "Congestion control in IP/TCP internetworks",
                RFC 896, January 1984.

   [RFC1122]    Braden, R., "Requirements for Internet Hosts -
                Communication Layers", STD 3, RFC 1122, October 1989.

   [RFC4960]    Stewart, R., Ed., "Stream Control Transmission
                Protocol", RFC 4960, September 2007.

   [RFC4296]    Bailey, S. and T. Talpey, "The Architecture of Direct
                Data Placement (DDP) and Remote Direct Memory Access
                (RDMA) on Internet Protocols", RFC 4296, December 2005.

RFC5044 - Page 70

   [RFC4297]    Romanow, A., Mogul, J., Talpey, T., and S. Bailey,
                "Remote Direct Memory Access (RDMA) over IP Problem
                Statement", RFC 4297, December 2005.

   [RFC4301]    Kent, S. and K. Seo, "Security Architecture for the
                Internet Protocol", RFC 4301, December 2005.

   [VERBS-RMDA] "RDMA Protocol Verbs Specification", RDMA Consortium
                standard, April 2003, <http://www.rdmaconsortium.org/
                home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf>.

Contributors

   Dwight Barron
   Hewlett-Packard Company
   20555 SH 249
   Houston, TX 77070-2698 USA
   Phone: 281-514-2769
   EMail: dwight.barron@hp.com

   Jeff Chase
   Department of Computer Science
   Duke University
   Durham, NC 27708-0129 USA
   Phone: +1 919 660 6559
   EMail: chase@cs.duke.edu

   Ted Compton
   EMC Corporation
   Research Triangle Park, NC 27709 USA
   Phone: 919-248-6075
   EMail: compton_ted@emc.com

   Dave Garcia
   24100 Hutchinson Rd.
   Los Gatos, CA  95033
   Phone: 831 247 4464
   EMail: Dave.Garcia@StanfordAlumni.org

   Hari Ghadia
   Gen10 Technology, Inc.
   1501 W Shady Grove Road
   Grand Prairie, TX 75050
   Phone: (972) 301 3630
   EMail: hghadia@gen10technology.com

RFC5044 - Page 71

   Howard C. Herbert
   Intel Corporation
   MS CH7-404
   5000 West Chandler Blvd.
   Chandler, AZ 85226
   Phone: 480-554-3116
   EMail: howard.c.herbert@intel.com

   Jeff Hilland
   Hewlett-Packard Company
   20555 SH 249
   Houston, TX 77070-2698 USA
   Phone: 281-514-9489
   EMail: jeff.hilland@hp.com

   Mike Ko
   IBM
   650 Harry Rd.
   San Jose, CA 95120
   Phone: (408) 927-2085
   EMail: mako@us.ibm.com

   Mike Krause
   Hewlett-Packard Corporation, 43LN
   19410 Homestead Road
   Cupertino, CA 95014 USA
   Phone: +1 (408) 447-3191
   EMail: krause@cup.hp.com

   Dave Minturn
   Intel Corporation
   MS JF1-210
   5200 North East Elam Young Parkway
   Hillsboro, Oregon  97124
   Phone: 503-712-4106
   EMail: dave.b.minturn@intel.com

   Jim Pinkerton
   Microsoft, Inc.
   One Microsoft Way
   Redmond, WA 98052 USA
   EMail: jpink@microsoft.com

RFC5044 - Page 72

   Hemal Shah
   Broadcom Corporation
   5300 California Avenue
   Irvine, CA 92617 USA
   Phone: +1 (949) 926-6941
   EMail: hemal@broadcom.com

   Allyn Romanow
   Cisco Systems
   170 W Tasman Drive
   San Jose, CA 95134 USA
   Phone: +1 408 525 8836
   EMail: allyn@cisco.com

   Tom Talpey
   Network Appliance
   1601 Trapelo Road #16
   Waltham, MA  02451 USA
   Phone: +1 (781) 768-5329
   EMail: thomas.talpey@netapp.com

   Patricia Thaler
   Broadcom
   16215 Alton Parkway
   Irvine, CA 92618
   Phone: 916 570 2707
   EMail: pthaler@broadcom.com

   Jim Wendt
   Hewlett Packard Corporation
   8000 Foothills Boulevard MS 5668
   Roseville, CA 95747-5668 USA
   Phone: +1 916 785 5198
   EMail: jim_wendt@hp.com

   Jim Williams
   Emulex Corporation
   580 Main Street
   Bolton, MA 01740 USA
   Phone: +1 978 779 7224
   EMail: jim.williams@emulex.com

RFC5044 - Page 73

Authors' Addresses

   Paul R. Culley
   Hewlett-Packard Company
   20555 SH 249
   Houston, TX 77070-2698 USA
   Phone: 281-514-5543
   EMail: paul.culley@hp.com

   Uri Elzur
   5300 California Avenue
   Irvine, CA 92617, USA
   Phone: 949.926.6432
   EMail: uri@broadcom.com

   Renato J Recio
   IBM
   Internal Zip 9043
   11400 Burnett Road
   Austin, Texas 78759
   Phone: 512-838-3685
   EMail: recio@us.ibm.com

   Stephen Bailey
   Sandburst Corporation
   600 Federal Street
   Andover, MA 01810 USA
   Phone: +1 978 689 1614
   EMail: steph@sandburst.com

   John Carrier
   Cray Inc.
   411 First Avenue S, Suite 600
   Seattle, WA 98104-2860
   Phone: 206-701-2090
   EMail: carrier@cray.com

RFC5044 - Page 74

Full Copyright Statement

   Copyright (C) The IETF Trust (2007).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.