Tech-invite3GPPspaceIETFspace
96959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7323

TCP Extensions for High Performance

Pages: 49
Proposed Standard
Errata
Obsoletes:  1323
Part 1 of 3 – Pages 1 to 13
None   None   Next

Top   ToC   RFC7323 - Page 1
Internet Engineering Task Force (IETF)                         D. Borman
Request for Comments: 7323                           Quantum Corporation
Obsoletes: 1323                                                B. Braden
Category: Standards Track              University of Southern California
ISSN: 2070-1721                                              V. Jacobson
                                                            Google, Inc.
                                                   R. Scheffenegger, Ed.
                                                            NetApp, Inc.
                                                          September 2014


                  TCP Extensions for High Performance

Abstract

This document specifies a set of TCP extensions to improve performance over paths with a large bandwidth * delay product and to provide reliable operation over very high-speed paths. It defines the TCP Window Scale (WS) option and the TCP Timestamps (TS) option and their semantics. The Window Scale option is used to support larger receive windows, while the Timestamps option can be used for at least two distinct mechanisms, Protection Against Wrapped Sequences (PAWS) and Round-Trip Time Measurement (RTTM), that are also described herein. This document obsoletes RFC 1323 and describes changes from it. Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc7323.
Top   ToC   RFC7323 - Page 2
Copyright Notice

   Copyright (c) 2014 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.
Top   ToC   RFC7323 - Page 3

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 2. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 8 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 8 2.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 8 2.3. Using the Window Scale Option . . . . . . . . . . . . . . 9 2.4. Addressing Window Retraction . . . . . . . . . . . . . . 10 3. TCP Timestamps Option . . . . . . . . . . . . . . . . . . . . 11 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 11 3.2. Timestamps Option . . . . . . . . . . . . . . . . . . . . 12 4. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . . 14 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 14 4.2. Updating the RTO Value . . . . . . . . . . . . . . . . . 15 4.3. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 16 5. PAWS - Protection Against Wrapped Sequences . . . . . . . . . 19 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 19 5.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . 19 5.3. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . . 20 5.4. Timestamp Clock . . . . . . . . . . . . . . . . . . . . . 22 5.5. Outdated Timestamps . . . . . . . . . . . . . . . . . . . 24 5.6. Header Prediction . . . . . . . . . . . . . . . . . . . . 25 5.7. IP Fragmentation . . . . . . . . . . . . . . . . . . . . 26 5.8. Duplicates from Earlier Incarnations of Connection . . . 26 6. Conclusions and Acknowledgments . . . . . . . . . . . . . . . 27 7. Security Considerations . . . . . . . . . . . . . . . . . . . 27 7.1. Privacy Considerations . . . . . . . . . . . . . . . . . 29 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 30 9.1. Normative References . . . . . . . . . . . . . . . . . . 30 9.2. Informative References . . . . . . . . . . . . . . . . . 30 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 34 Appendix B. Duplicates from Earlier Connection Incarnations . . 35 B.1. System Crash with Loss of State . . . . . . . . . . . . . 35 B.2. Closing and Reopening a Connection . . . . . . . . . . . 35 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . 37 Appendix D. Event Processing Summary . . . . . . . . . . . . . . 38 Appendix E. Timestamps Edge Cases . . . . . . . . . . . . . . . 44 Appendix F. Window Retraction Example . . . . . . . . . . . . . 44 Appendix G. RTO Calculation Modification . . . . . . . . . . . . 45 Appendix H. Changes from RFC 1323 . . . . . . . . . . . . . . . 46
Top   ToC   RFC7323 - Page 4

1. Introduction

The TCP protocol [RFC0793] was designed to operate reliably over almost any transmission medium regardless of transmission rate, delay, corruption, duplication, or reordering of segments. Over the years, advances in networking technology have resulted in ever-higher transmission speeds, and the fastest paths are well beyond the domain for which TCP was originally engineered. This document defines a set of modest extensions to TCP to extend the domain of its application to match the increasing network capability. It is an update to and obsoletes [RFC1323], which in turn is based upon and obsoletes [RFC1072] and [RFC1185]. Changes between [RFC1323] and this document are detailed in Appendix H. These changes are partly due to errata in [RFC1323], and partly due to the improved understanding of how the involved components interact. For brevity, the full discussions of the merits and history behind the TCP options defined within this document have been omitted. [RFC1323] should be consulted for reference. It is recommended that a modern TCP stack implements and make use of the extensions described in this document.

1.1. TCP Performance

TCP performance problems arise when the bandwidth * delay product is large. A network having such paths is referred to as a "long, fat network" (LFN). There are two fundamental performance problems with basic TCP over LFN paths: (1) Window Size Limit The TCP header uses a 16-bit field to report the receive window size to the sender. Therefore, the largest window that can be used is 2^16 = 64 KiB. For LFN paths where the bandwidth * delay product exceeds 64 KiB, the receive window limits the maximum throughput of the TCP connection over the path, i.e., the amount of unacknowledged data that TCP can send in order to keep the pipeline full.
Top   ToC   RFC7323 - Page 5
        To circumvent this problem, Section 2 of this memo defines a TCP
        option, "Window Scale", to allow windows larger than 2^16.  This
        option defines an implicit scale factor, which is used to
        multiply the window size value found in a TCP header to obtain
        the true window size.

        It must be noted that the use of large receive windows increases
        the chance of too quickly wrapping sequence numbers, as
        described below in Section 1.2, (1).

   (2)  Recovery from Losses

        Packet losses in an LFN can have a catastrophic effect on
        throughput.

        To generalize the Fast Retransmit / Fast Recovery mechanism to
        handle multiple packets dropped per window, Selective
        Acknowledgments are required.  Unlike the normal cumulative
        acknowledgments of TCP, Selective Acknowledgments give the
        sender a complete picture of which segments are queued at the
        receiver and which have not yet arrived.

        Selective Acknowledgments and their use are specified in
        separate documents, "TCP Selective Acknowledgment Options"
        [RFC2018], "An Extension to the Selective Acknowledgement (SACK)
        Option for TCP" [RFC2883], and "A Conservative Loss Recovery
        Algorithm Based on Selective Acknowledgment (SACK) for TCP"
        [RFC6675], and are not further discussed in this document.

1.2. TCP Reliability

An especially serious kind of error may result from an accidental reuse of TCP sequence numbers in data segments. TCP reliability depends upon the existence of a bound on the lifetime of a segment: the "Maximum Segment Lifetime" or MSL. Duplication of sequence numbers might happen in either of two ways: (1) Sequence number wrap-around on the current connection A TCP sequence number contains 32 bits. At a high enough transfer rate of large volumes of data (at least 4 GiB in the same session), the 32-bit sequence space may be "wrapped" (cycled) within the time that a segment is delayed in queues.
Top   ToC   RFC7323 - Page 6
   (2)  Earlier incarnation of the connection

        Suppose that a connection terminates, either by a proper close
        sequence or due to a host crash, and the same connection (i.e.,
        using the same pair of port numbers) is immediately reopened.  A
        delayed segment from the terminated connection could fall within
        the current window for the new incarnation and be accepted as
        valid.

   Duplicates from earlier incarnations, case (2), are avoided by
   enforcing the current fixed MSL of the TCP specification, as
   explained in Section 5.8 and Appendix B.  In addition, the
   randomizing of ephemeral ports can also help to probabilistically
   reduce the chances of duplicates from earlier connections.  However,
   case (1), avoiding the reuse of sequence numbers within the same
   connection, requires an upper bound on MSL that depends upon the
   transfer rate, and at high enough rates, a dedicated mechanism is
   required.

   A possible fix for the problem of cycling the sequence space would be
   to increase the size of the TCP sequence number field.  For example,
   the sequence number field (and also the acknowledgment field) could
   be expanded to 64 bits.  This could be done either by changing the
   TCP header or by means of an additional option.

   Section 5 presents a different mechanism, which we call PAWS, to
   extend TCP reliability to transfer rates well beyond the foreseeable
   upper limit of network bandwidths.  PAWS uses the TCP Timestamps
   option defined in Section 3.2 to protect against old duplicates from
   the same connection.

1.3. Using TCP options

The extensions defined in this document all use TCP options. When [RFC1323] was published, there was concern that some buggy TCP implementation might crash on the first appearance of an option on a non-<SYN> segment. However, bugs like that can lead to denial-of- service (DoS) attacks against a TCP. Research has shown that most TCP implementations will properly handle unknown options on non-<SYN> segments ([Medina04], [Medina05]). But it is still prudent to be conservative in what you send, and avoiding buggy TCP implementation is not the only reason for negotiating TCP options on <SYN> segments.
Top   ToC   RFC7323 - Page 7
   The Window Scale option negotiates fundamental parameters of the TCP
   session.  Therefore, it is only sent during the initial handshake.
   Furthermore, the Window Scale option will be sent in a <SYN,ACK>
   segment only if the corresponding option was received in the initial
   <SYN> segment.

   The Timestamps option may appear in any data or <ACK> segment, adding
   10 bytes (up to 12 bytes including padding) to the 20-byte TCP
   header.  It is required that this TCP option will be sent on all
   non-<SYN> segments after an exchange of options on the <SYN> segments
   has indicated that both sides understand this extension.

   Research has shown that the use of the Timestamps option to take
   additional RTT samples within each RTT has little effect on the
   ultimate retransmission timeout value [Allman99].  However, there are
   other uses of the Timestamps option, such as the Eifel mechanism
   ([RFC3522], [RFC4015]) and PAWS (see Section 5), which improve
   overall TCP security and performance.  The extra header bandwidth
   used by this option should be evaluated for the gains in performance
   and security in an actual deployment.

   Appendix A contains a recommended layout of the options in TCP
   headers to achieve reasonable data field alignment.

   Finally, we observe that most of the mechanisms defined in this
   document are important for LFNs and/or very high-speed networks.  For
   low-speed networks, it might be a performance optimization to NOT use
   these mechanisms.  A TCP vendor concerned about optimal performance
   over low-speed paths might consider turning these extensions off for
   low-speed paths, or allow a user or installation manager to disable
   them.

1.4. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. In this document, these words will appear with that interpretation only when in UPPER CASE. Lower case uses of these words are not to be interpreted as carrying [RFC2119] significance.
Top   ToC   RFC7323 - Page 8

2. TCP Window Scale Option

2.1. Introduction

The window scale extension expands the definition of the TCP window to 30 bits and then uses an implicit scale factor to carry this 30-bit value in the 16-bit window field of the TCP header (SEG.WND in [RFC0793]). The exponent of the scale factor is carried in a TCP option, Window Scale. This option is sent only in a <SYN> segment (a segment with the SYN bit on), hence the window scale is fixed in each direction when a connection is opened. The maximum receive window, and therefore the scale factor, is determined by the maximum receive buffer space. In a typical modern implementation, this maximum buffer space is set by default but can be overridden by a user program before a TCP connection is opened. This determines the scale factor, and therefore no new user interface is needed for window scaling.

2.2. Window Scale Option

The three-byte Window Scale option MAY be sent in a <SYN> segment by a TCP. It has two purposes: (1) indicate that the TCP is prepared to both send and receive window scaling, and (2) communicate the exponent of a scale factor to be applied to its receive window. Thus, a TCP that is prepared to scale windows SHOULD send the option, even if its own scale factor is 1 and the exponent 0. The scale factor is limited to a power of two and encoded logarithmically, so it may be implemented by binary shift operations. The maximum scale exponent is limited to 14 for a maximum permissible receive window size of 1 GiB (2^(14+16)). TCP Window Scale option (WSopt): Kind: 3 Length: 3 bytes +---------+---------+---------+ | Kind=3 |Length=3 |shift.cnt| +---------+---------+---------+ 1 1 1 This option is an offer, not a promise; both sides MUST send Window Scale options in their <SYN> segments to enable window scaling in either direction. If window scaling is enabled, then the TCP that sent this option will right-shift its true receive-window values by 'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt'
Top   ToC   RFC7323 - Page 9
   MAY be zero (offering to scale, while applying a scale factor of 1 to
   the receive window).

   This option MAY be sent in an initial <SYN> segment (i.e., a segment
   with the SYN bit on and the ACK bit off).  If a Window Scale option
   was received in the initial <SYN> segment, then this option MAY be
   sent in the <SYN,ACK> segment.  A Window Scale option in a segment
   without a SYN bit MUST be ignored.

   The window field in a segment where the SYN bit is set (i.e., a <SYN>
   or <SYN,ACK>) MUST NOT be scaled.

2.3. Using the Window Scale Option

A model implementation of window scaling is as follows, using the notation of [RFC0793]: o The connection state is augmented by two window shift counters, Snd.Wind.Shift and Rcv.Wind.Shift, to be applied to the incoming and outgoing window fields, respectively. o If a TCP receives a <SYN> segment containing a Window Scale option, it SHOULD send its own Window Scale option in the <SYN,ACK> segment. o The Window Scale option MUST be sent with shift.cnt = R, where R is the value that the TCP would like to use for its receive window. o Upon receiving a <SYN> segment with a Window Scale option containing shift.cnt = S, a TCP MUST set Snd.Wind.Shift to S and MUST set Rcv.Wind.Shift to R; otherwise, it MUST set both Snd.Wind.Shift and Rcv.Wind.Shift to zero. o The window field (SEG.WND) in the header of every incoming segment, with the exception of <SYN> segments, MUST be left- shifted by Snd.Wind.Shift bits before updating SND.WND: SND.WND = SEG.WND << Snd.Wind.Shift (assuming the other conditions of [RFC0793] are met, and using the "C" notation "<<" for left-shift). o The window field (SEG.WND) of every outgoing segment, with the exception of <SYN> segments, MUST be right-shifted by Rcv.Wind.Shift bits: SEG.WND = RCV.WND >> Rcv.Wind.Shift
Top   ToC   RFC7323 - Page 10
   TCP determines if a data segment is "old" or "new" by testing whether
   its sequence number is within 2^31 bytes of the left edge of the
   window, and if it is not, discarding the data as "old".  To insure
   that new data is never mistakenly considered old and vice versa, the
   left edge of the sender's window has to be at most 2^31 away from the
   right edge of the receiver's window.  The same is true of the
   sender's right edge and receiver's left edge.  Since the right and
   left edges of either the sender's or receiver's window differ by the
   window size, and since the sender and receiver windows can be out of
   phase by at most the window size, the above constraints imply that
   two times the maximum window size must be less than 2^31, or

                             max window < 2^30

   Since the max window is 2^S (where S is the scaling shift count)
   times at most 2^16 - 1 (the maximum unscaled window), the maximum
   window is guaranteed to be < 2^30 if S <= 14.  Thus, the shift count
   MUST be limited to 14 (which allows windows of 2^30 = 1 GiB).  If a
   Window Scale option is received with a shift.cnt value larger than
   14, the TCP SHOULD log the error but MUST use 14 instead of the
   specified value.  This is safe as a sender can always choose to only
   partially use any signaled receive window.  If the receiver is
   scaling by a factor larger than 14 and the sender is only scaling by
   14, then the receive window used by the sender will appear smaller
   than it is in reality.

   The scale factor applies only to the window field as transmitted in
   the TCP header; each TCP using extended windows will maintain the
   window values locally as 32-bit numbers.  For example, the
   "congestion window" computed by slow start and congestion avoidance
   (see [RFC5681]) is not affected by the scale factor, so window
   scaling will not introduce quantization into the congestion window.

2.4. Addressing Window Retraction

When a non-zero scale factor is in use, there are instances when a retracted window can be offered -- see Appendix F for a detailed example. The end of the window will be on a boundary based on the granularity of the scale factor being used. If the sequence number is then updated by a number of bytes smaller than that granularity, the TCP will have to either advertise a new window that is beyond what it previously advertised (and perhaps beyond the buffer) or will have to advertise a smaller window, which will cause the TCP window to shrink. Implementations MUST ensure that they handle a shrinking window, as specified in Section 4.2.2.16 of [RFC1122].
Top   ToC   RFC7323 - Page 11
   For the receiver, this implies that:

   1)  The receiver MUST honor, as in window, any segment that would
       have been in window for any <ACK> sent by the receiver.

   2)  When window scaling is in effect, the receiver SHOULD track the
       actual maximum window sequence number (which is likely to be
       greater than the window announced by the most recent <ACK>, if
       more than one segment has arrived since the application consumed
       any data in the receive buffer).

   On the sender side:

   3)  The initial transmission MUST be within the window announced by
       the most recent <ACK>.

   4)  On first retransmission, or if the sequence number is out of
       window by less than 2^Rcv.Wind.Shift, then do normal
       retransmission(s) without regard to the receiver window as long
       as the original segment was in window when it was sent.

   5)  Subsequent retransmissions MAY only be sent if they are within
       the window announced by the most recent <ACK>.

3. TCP Timestamps Option

3.1. Introduction

The Timestamps option is introduced to address some of the issues mentioned in Sections 1.1 and 1.2. The Timestamps option is specified in a symmetrical manner, so that Timestamp Value (TSval) timestamps are carried in both data and <ACK> segments and are echoed in Timestamp Echo Reply (TSecr) fields carried in returning <ACK> or data segments. Originally used primarily for timestamping individual segments, the properties of the Timestamps option allow for taking time measurements (Section 4) as well as additional uses (Section 5). It is necessary to remember that there is a distinction between the Timestamps option conveying timestamp information and the use of that information. In particular, the RTTM mechanism must be viewed independently from updating the Retransmission Timeout (RTO) (see Section 4.2). In this case, the sample granularity also needs to be taken into account. Other mechanisms, such as PAWS or Eifel, are not built upon the timestamp information itself but are based on the intrinsic property of monotonically non-decreasing values. The Timestamps option is important when large receive windows are used to allow the use of the PAWS mechanism (see Section 5).
Top   ToC   RFC7323 - Page 12
   Furthermore, the option may be useful for all TCPs, since it
   simplifies the sender and allows the use of additional optimizations
   such as Eifel ([RFC3522], [RFC4015]) and others ([RFC6817],
   [Kuzmanovic03], [Kuehlewind10]).

3.2. Timestamps Option

TCP is a symmetric protocol, allowing data to be sent at any time in either direction, and therefore timestamp echoing may occur in either direction. For simplicity and symmetry, we specify that timestamps always be sent and echoed in both directions. For efficiency, we combine the timestamp and timestamp reply fields into a single TCP Timestamps option. TCP Timestamps option (TSopt): Kind: 8 Length: 10 bytes +-------+-------+---------------------+---------------------+ |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| +-------+-------+---------------------+---------------------+ 1 1 4 4 The Timestamps option carries two four-byte timestamp fields. The TSval field contains the current value of the timestamp clock of the TCP sending the option. The TSecr field is valid if the ACK bit is set in the TCP header. If the ACK bit is not set in the outgoing TCP header, the sender of that segment SHOULD set the TSecr field to zero. When the ACK bit is set in an outgoing segment, the sender MUST echo a recently received TSval sent by the remote TCP in the TSval field of a Timestamps option. The exact rules on which TSval MUST be echoed are given in Section 4.3. When the ACK bit is not set, the receiver MUST ignore the value of the TSecr field. A TCP MAY send the TSopt in an initial <SYN> segment (i.e., segment containing a SYN bit and no ACK bit), and MAY send a TSopt in <SYN,ACK> only if it received a TSopt in the initial <SYN> segment for the connection. Once TSopt has been successfully negotiated, that is both <SYN> and <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST> segment for the duration of the connection, and SHOULD be sent in an <RST> segment (see Section 5.2 for details). The TCP SHOULD remember this state by setting a flag, referred to as Snd.TS.OK, to one. If a
Top   ToC   RFC7323 - Page 13
   non-<RST> segment is received without a TSopt, a TCP SHOULD silently
   drop the segment.  A TCP MUST NOT abort a TCP connection because any
   segment lacks an expected TSopt.

   Implementations are strongly encouraged to follow the above rules for
   handling a missing Timestamps option and the order of precedence
   mentioned in Section 5.3 when deciding on the acceptance of a
   segment.

   If a receiver chooses to accept a segment without an expected
   Timestamps option, it must be clear that undetectable data corruption
   may occur.

   Such a TCP receiver may experience undetectable wrapped-sequence
   effects, such as data (payload) corruption or session stalls.  In
   order to maintain the integrity of the payload data, in particular on
   high-speed networks, it is paramount to follow the described
   processing rules.

   However, it has been mentioned that under some circumstances, the
   above guidelines are too strict, and some paths sporadically suppress
   the Timestamps option, while maintaining payload integrity.  A path
   behaving in this manner should be deemed unacceptable, but it has
   been noted that some implementations relax the acceptance rules as a
   workaround and allow TCP to run across such paths [RE-1323BIS].

   If a TSopt is received on a connection where TSopt was not negotiated
   in the initial three-way handshake, the TSopt MUST be ignored and the
   packet processed normally.

   In the case of crossing <SYN> segments where one <SYN> contains a
   TSopt and the other doesn't, both sides MAY send a TSopt in the
   <SYN,ACK> segment.

   TSopt is required for the two mechanisms described in Sections 4 and
   5.  There are also other mechanisms that rely on the presence of the
   TSopt, e.g., [RFC3522].  If a TCP stopped sending TSopt at any time
   during an established session, it interferes with these mechanisms.
   This update to [RFC1323] describes explicitly the previous assumption
   (see Section 5.2) that each TCP segment must have a TSopt, once
   negotiated.


(next page on part 2)

Next Section