Internet Engineering Task Force (IETF) D. Borman
Request for Comments: 7323 Quantum Corporation
Obsoletes: 1323 B. Braden
Category: Standards Track University of Southern California
ISSN: 2070-1721 V. Jacobson
R. Scheffenegger, Ed.
September 2014 TCP Extensions for High Performance
This document specifies a set of TCP extensions to improve
performance over paths with a large bandwidth * delay product and to
provide reliable operation over very high-speed paths. It defines
the TCP Window Scale (WS) option and the TCP Timestamps (TS) option
and their semantics. The Window Scale option is used to support
larger receive windows, while the Timestamps option can be used for
at least two distinct mechanisms, Protection Against Wrapped
Sequences (PAWS) and Round-Trip Time Measurement (RTTM), that are
also described herein.
This document obsoletes RFC 1323 and describes changes from it.
Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in Section 2 of RFC 5741.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
The TCP protocol [RFC0793] was designed to operate reliably over
almost any transmission medium regardless of transmission rate,
delay, corruption, duplication, or reordering of segments. Over the
years, advances in networking technology have resulted in ever-higher
transmission speeds, and the fastest paths are well beyond the domain
for which TCP was originally engineered.
This document defines a set of modest extensions to TCP to extend the
domain of its application to match the increasing network capability.
It is an update to and obsoletes [RFC1323], which in turn is based
upon and obsoletes [RFC1072] and [RFC1185].
Changes between [RFC1323] and this document are detailed in
Appendix H. These changes are partly due to errata in [RFC1323], and
partly due to the improved understanding of how the involved
For brevity, the full discussions of the merits and history behind
the TCP options defined within this document have been omitted.
[RFC1323] should be consulted for reference. It is recommended that
a modern TCP stack implements and make use of the extensions
described in this document.
1.1. TCP Performance
TCP performance problems arise when the bandwidth * delay product is
large. A network having such paths is referred to as a "long, fat
There are two fundamental performance problems with basic TCP over
(1) Window Size Limit
The TCP header uses a 16-bit field to report the receive window
size to the sender. Therefore, the largest window that can be
used is 2^16 = 64 KiB. For LFN paths where the bandwidth *
delay product exceeds 64 KiB, the receive window limits the
maximum throughput of the TCP connection over the path, i.e.,
the amount of unacknowledged data that TCP can send in order to
keep the pipeline full.
To circumvent this problem, Section 2 of this memo defines a TCP
option, "Window Scale", to allow windows larger than 2^16. This
option defines an implicit scale factor, which is used to
multiply the window size value found in a TCP header to obtain
the true window size.
It must be noted that the use of large receive windows increases
the chance of too quickly wrapping sequence numbers, as
described below in Section 1.2, (1).
(2) Recovery from Losses
Packet losses in an LFN can have a catastrophic effect on
To generalize the Fast Retransmit / Fast Recovery mechanism to
handle multiple packets dropped per window, Selective
Acknowledgments are required. Unlike the normal cumulative
acknowledgments of TCP, Selective Acknowledgments give the
sender a complete picture of which segments are queued at the
receiver and which have not yet arrived.
Selective Acknowledgments and their use are specified in
separate documents, "TCP Selective Acknowledgment Options"
[RFC2018], "An Extension to the Selective Acknowledgement (SACK)
Option for TCP" [RFC2883], and "A Conservative Loss Recovery
Algorithm Based on Selective Acknowledgment (SACK) for TCP"
[RFC6675], and are not further discussed in this document.
1.2. TCP Reliability
An especially serious kind of error may result from an accidental
reuse of TCP sequence numbers in data segments. TCP reliability
depends upon the existence of a bound on the lifetime of a segment:
the "Maximum Segment Lifetime" or MSL.
Duplication of sequence numbers might happen in either of two ways:
(1) Sequence number wrap-around on the current connection
A TCP sequence number contains 32 bits. At a high enough
transfer rate of large volumes of data (at least 4 GiB in the
same session), the 32-bit sequence space may be "wrapped"
(cycled) within the time that a segment is delayed in queues.
(2) Earlier incarnation of the connection
Suppose that a connection terminates, either by a proper close
sequence or due to a host crash, and the same connection (i.e.,
using the same pair of port numbers) is immediately reopened. A
delayed segment from the terminated connection could fall within
the current window for the new incarnation and be accepted as
Duplicates from earlier incarnations, case (2), are avoided by
enforcing the current fixed MSL of the TCP specification, as
explained in Section 5.8 and Appendix B. In addition, the
randomizing of ephemeral ports can also help to probabilistically
reduce the chances of duplicates from earlier connections. However,
case (1), avoiding the reuse of sequence numbers within the same
connection, requires an upper bound on MSL that depends upon the
transfer rate, and at high enough rates, a dedicated mechanism is
A possible fix for the problem of cycling the sequence space would be
to increase the size of the TCP sequence number field. For example,
the sequence number field (and also the acknowledgment field) could
be expanded to 64 bits. This could be done either by changing the
TCP header or by means of an additional option.
Section 5 presents a different mechanism, which we call PAWS, to
extend TCP reliability to transfer rates well beyond the foreseeable
upper limit of network bandwidths. PAWS uses the TCP Timestamps
option defined in Section 3.2 to protect against old duplicates from
the same connection.
1.3. Using TCP options
The extensions defined in this document all use TCP options.
When [RFC1323] was published, there was concern that some buggy TCP
implementation might crash on the first appearance of an option on a
non-<SYN> segment. However, bugs like that can lead to denial-of-
service (DoS) attacks against a TCP. Research has shown that most
TCP implementations will properly handle unknown options on non-<SYN>
segments ([Medina04], [Medina05]). But it is still prudent to be
conservative in what you send, and avoiding buggy TCP implementation
is not the only reason for negotiating TCP options on <SYN> segments.
The Window Scale option negotiates fundamental parameters of the TCP
session. Therefore, it is only sent during the initial handshake.
Furthermore, the Window Scale option will be sent in a <SYN,ACK>
segment only if the corresponding option was received in the initial
The Timestamps option may appear in any data or <ACK> segment, adding
10 bytes (up to 12 bytes including padding) to the 20-byte TCP
header. It is required that this TCP option will be sent on all
non-<SYN> segments after an exchange of options on the <SYN> segments
has indicated that both sides understand this extension.
Research has shown that the use of the Timestamps option to take
additional RTT samples within each RTT has little effect on the
ultimate retransmission timeout value [Allman99]. However, there are
other uses of the Timestamps option, such as the Eifel mechanism
([RFC3522], [RFC4015]) and PAWS (see Section 5), which improve
overall TCP security and performance. The extra header bandwidth
used by this option should be evaluated for the gains in performance
and security in an actual deployment.
Appendix A contains a recommended layout of the options in TCP
headers to achieve reasonable data field alignment.
Finally, we observe that most of the mechanisms defined in this
document are important for LFNs and/or very high-speed networks. For
low-speed networks, it might be a performance optimization to NOT use
these mechanisms. A TCP vendor concerned about optimal performance
over low-speed paths might consider turning these extensions off for
low-speed paths, or allow a user or installation manager to disable
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
In this document, these words will appear with that interpretation
only when in UPPER CASE. Lower case uses of these words are not to
be interpreted as carrying [RFC2119] significance.
2. TCP Window Scale Option
The window scale extension expands the definition of the TCP window
to 30 bits and then uses an implicit scale factor to carry this
30-bit value in the 16-bit window field of the TCP header (SEG.WND in
[RFC0793]). The exponent of the scale factor is carried in a TCP
option, Window Scale. This option is sent only in a <SYN> segment (a
segment with the SYN bit on), hence the window scale is fixed in each
direction when a connection is opened.
The maximum receive window, and therefore the scale factor, is
determined by the maximum receive buffer space. In a typical modern
implementation, this maximum buffer space is set by default but can
be overridden by a user program before a TCP connection is opened.
This determines the scale factor, and therefore no new user interface
is needed for window scaling.
2.2. Window Scale Option
The three-byte Window Scale option MAY be sent in a <SYN> segment by
a TCP. It has two purposes: (1) indicate that the TCP is prepared to
both send and receive window scaling, and (2) communicate the
exponent of a scale factor to be applied to its receive window.
Thus, a TCP that is prepared to scale windows SHOULD send the option,
even if its own scale factor is 1 and the exponent 0. The scale
factor is limited to a power of two and encoded logarithmically, so
it may be implemented by binary shift operations. The maximum scale
exponent is limited to 14 for a maximum permissible receive window
size of 1 GiB (2^(14+16)).
TCP Window Scale option (WSopt):
Length: 3 bytes
| Kind=3 |Length=3 |shift.cnt|
1 1 1
This option is an offer, not a promise; both sides MUST send Window
Scale options in their <SYN> segments to enable window scaling in
either direction. If window scaling is enabled, then the TCP that
sent this option will right-shift its true receive-window values by
'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt'
MAY be zero (offering to scale, while applying a scale factor of 1 to
the receive window).
This option MAY be sent in an initial <SYN> segment (i.e., a segment
with the SYN bit on and the ACK bit off). If a Window Scale option
was received in the initial <SYN> segment, then this option MAY be
sent in the <SYN,ACK> segment. A Window Scale option in a segment
without a SYN bit MUST be ignored.
The window field in a segment where the SYN bit is set (i.e., a <SYN>
or <SYN,ACK>) MUST NOT be scaled.
2.3. Using the Window Scale Option
A model implementation of window scaling is as follows, using the
notation of [RFC0793]:
o The connection state is augmented by two window shift counters,
Snd.Wind.Shift and Rcv.Wind.Shift, to be applied to the incoming
and outgoing window fields, respectively.
o If a TCP receives a <SYN> segment containing a Window Scale
option, it SHOULD send its own Window Scale option in the
o The Window Scale option MUST be sent with shift.cnt = R, where R
is the value that the TCP would like to use for its receive
o Upon receiving a <SYN> segment with a Window Scale option
containing shift.cnt = S, a TCP MUST set Snd.Wind.Shift to S and
MUST set Rcv.Wind.Shift to R; otherwise, it MUST set both
Snd.Wind.Shift and Rcv.Wind.Shift to zero.
o The window field (SEG.WND) in the header of every incoming
segment, with the exception of <SYN> segments, MUST be left-
shifted by Snd.Wind.Shift bits before updating SND.WND:
SND.WND = SEG.WND << Snd.Wind.Shift
(assuming the other conditions of [RFC0793] are met, and using the
"C" notation "<<" for left-shift).
o The window field (SEG.WND) of every outgoing segment, with the
exception of <SYN> segments, MUST be right-shifted by
SEG.WND = RCV.WND >> Rcv.Wind.Shift
TCP determines if a data segment is "old" or "new" by testing whether
its sequence number is within 2^31 bytes of the left edge of the
window, and if it is not, discarding the data as "old". To insure
that new data is never mistakenly considered old and vice versa, the
left edge of the sender's window has to be at most 2^31 away from the
right edge of the receiver's window. The same is true of the
sender's right edge and receiver's left edge. Since the right and
left edges of either the sender's or receiver's window differ by the
window size, and since the sender and receiver windows can be out of
phase by at most the window size, the above constraints imply that
two times the maximum window size must be less than 2^31, or
max window < 2^30
Since the max window is 2^S (where S is the scaling shift count)
times at most 2^16 - 1 (the maximum unscaled window), the maximum
window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count
MUST be limited to 14 (which allows windows of 2^30 = 1 GiB). If a
Window Scale option is received with a shift.cnt value larger than
14, the TCP SHOULD log the error but MUST use 14 instead of the
specified value. This is safe as a sender can always choose to only
partially use any signaled receive window. If the receiver is
scaling by a factor larger than 14 and the sender is only scaling by
14, then the receive window used by the sender will appear smaller
than it is in reality.
The scale factor applies only to the window field as transmitted in
the TCP header; each TCP using extended windows will maintain the
window values locally as 32-bit numbers. For example, the
"congestion window" computed by slow start and congestion avoidance
(see [RFC5681]) is not affected by the scale factor, so window
scaling will not introduce quantization into the congestion window.
2.4. Addressing Window Retraction
When a non-zero scale factor is in use, there are instances when a
retracted window can be offered -- see Appendix F for a detailed
example. The end of the window will be on a boundary based on the
granularity of the scale factor being used. If the sequence number
is then updated by a number of bytes smaller than that granularity,
the TCP will have to either advertise a new window that is beyond
what it previously advertised (and perhaps beyond the buffer) or will
have to advertise a smaller window, which will cause the TCP window
to shrink. Implementations MUST ensure that they handle a shrinking
window, as specified in Section 126.96.36.199 of [RFC1122].
For the receiver, this implies that:
1) The receiver MUST honor, as in window, any segment that would
have been in window for any <ACK> sent by the receiver.
2) When window scaling is in effect, the receiver SHOULD track the
actual maximum window sequence number (which is likely to be
greater than the window announced by the most recent <ACK>, if
more than one segment has arrived since the application consumed
any data in the receive buffer).
On the sender side:
3) The initial transmission MUST be within the window announced by
the most recent <ACK>.
4) On first retransmission, or if the sequence number is out of
window by less than 2^Rcv.Wind.Shift, then do normal
retransmission(s) without regard to the receiver window as long
as the original segment was in window when it was sent.
5) Subsequent retransmissions MAY only be sent if they are within
the window announced by the most recent <ACK>.
3. TCP Timestamps Option
The Timestamps option is introduced to address some of the issues
mentioned in Sections 1.1 and 1.2. The Timestamps option is
specified in a symmetrical manner, so that Timestamp Value (TSval)
timestamps are carried in both data and <ACK> segments and are echoed
in Timestamp Echo Reply (TSecr) fields carried in returning <ACK> or
data segments. Originally used primarily for timestamping individual
segments, the properties of the Timestamps option allow for taking
time measurements (Section 4) as well as additional uses (Section 5).
It is necessary to remember that there is a distinction between the
Timestamps option conveying timestamp information and the use of that
information. In particular, the RTTM mechanism must be viewed
independently from updating the Retransmission Timeout (RTO) (see
Section 4.2). In this case, the sample granularity also needs to be
taken into account. Other mechanisms, such as PAWS or Eifel, are not
built upon the timestamp information itself but are based on the
intrinsic property of monotonically non-decreasing values.
The Timestamps option is important when large receive windows are
used to allow the use of the PAWS mechanism (see Section 5).
Furthermore, the option may be useful for all TCPs, since it
simplifies the sender and allows the use of additional optimizations
such as Eifel ([RFC3522], [RFC4015]) and others ([RFC6817],
3.2. Timestamps Option
TCP is a symmetric protocol, allowing data to be sent at any time in
either direction, and therefore timestamp echoing may occur in either
direction. For simplicity and symmetry, we specify that timestamps
always be sent and echoed in both directions. For efficiency, we
combine the timestamp and timestamp reply fields into a single TCP
TCP Timestamps option (TSopt):
Length: 10 bytes
|Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)|
1 1 4 4
The Timestamps option carries two four-byte timestamp fields. The
TSval field contains the current value of the timestamp clock of the
TCP sending the option.
The TSecr field is valid if the ACK bit is set in the TCP header. If
the ACK bit is not set in the outgoing TCP header, the sender of that
segment SHOULD set the TSecr field to zero. When the ACK bit is set
in an outgoing segment, the sender MUST echo a recently received
TSval sent by the remote TCP in the TSval field of a Timestamps
option. The exact rules on which TSval MUST be echoed are given in
Section 4.3. When the ACK bit is not set, the receiver MUST ignore
the value of the TSecr field.
A TCP MAY send the TSopt in an initial <SYN> segment (i.e., segment
containing a SYN bit and no ACK bit), and MAY send a TSopt in
<SYN,ACK> only if it received a TSopt in the initial <SYN> segment
for the connection.
Once TSopt has been successfully negotiated, that is both <SYN> and
<SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
segment for the duration of the connection, and SHOULD be sent in an
<RST> segment (see Section 5.2 for details). The TCP SHOULD remember
this state by setting a flag, referred to as Snd.TS.OK, to one. If a
non-<RST> segment is received without a TSopt, a TCP SHOULD silently
drop the segment. A TCP MUST NOT abort a TCP connection because any
segment lacks an expected TSopt.
Implementations are strongly encouraged to follow the above rules for
handling a missing Timestamps option and the order of precedence
mentioned in Section 5.3 when deciding on the acceptance of a
If a receiver chooses to accept a segment without an expected
Timestamps option, it must be clear that undetectable data corruption
Such a TCP receiver may experience undetectable wrapped-sequence
effects, such as data (payload) corruption or session stalls. In
order to maintain the integrity of the payload data, in particular on
high-speed networks, it is paramount to follow the described
However, it has been mentioned that under some circumstances, the
above guidelines are too strict, and some paths sporadically suppress
the Timestamps option, while maintaining payload integrity. A path
behaving in this manner should be deemed unacceptable, but it has
been noted that some implementations relax the acceptance rules as a
workaround and allow TCP to run across such paths [RE-1323BIS].
If a TSopt is received on a connection where TSopt was not negotiated
in the initial three-way handshake, the TSopt MUST be ignored and the
packet processed normally.
In the case of crossing <SYN> segments where one <SYN> contains a
TSopt and the other doesn't, both sides MAY send a TSopt in the
TSopt is required for the two mechanisms described in Sections 4 and
5. There are also other mechanisms that rely on the presence of the
TSopt, e.g., [RFC3522]. If a TCP stopped sending TSopt at any time
during an established session, it interferes with these mechanisms.
This update to [RFC1323] describes explicitly the previous assumption
(see Section 5.2) that each TCP segment must have a TSopt, once