Network Working Group S. Floyd Request for Comments: 2883 ACIRI Category: Standards Track J. Mahdavi Novell M. Mathis Pittsburgh Supercomputing Center M. Podolsky UC Berkeley July 2000 An Extension to the Selective Acknowledgement (SACK) Option for TCP Status of this Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2000). All Rights Reserved.
AbstractThis note defines an extension of the Selective Acknowledgement (SACK) Option [RFC2018] for TCP. RFC 2018 specified the use of the SACK option for acknowledging out-of-sequence data not covered by TCP's cumulative acknowledgement field. This note extends RFC 2018 by specifying the use of the SACK option for acknowledging duplicate packets. This note suggests that when duplicate packets are received, the first block of the SACK option field can be used to report the sequence numbers of the packet that triggered the acknowledgement. This extension to the SACK option allows the TCP sender to infer the order of packets received at the receiver, allowing the sender to infer when it has unnecessarily retransmitted a packet. A TCP sender could then use this information for more robust operation in an environment of reordered packets [BPS99], ACK loss, packet replication, and/or early retransmit timeouts.
RFC 2018 is used by the TCP data receiver to acknowledge non-contiguous blocks of data not covered by the Cumulative Acknowledgement field. However, RFC 2018 does not specify the use of the SACK option when duplicate segments are received. This note specifies the use of the SACK option when acknowledging the receipt of a duplicate packet [F99]. We use the term D-SACK (for duplicate-SACK) to refer to a SACK block that reports a duplicate segment. This document does not make any changes to TCP's use of the cumulative acknowledgement field, or to the TCP receiver's decision of *when* to send an acknowledgement packet. This document only concerns the contents of the SACK option when an acknowledgement is sent. This extension is compatible with current implementations of the SACK option in TCP. That is, if one of the TCP end-nodes does not implement this D-SACK extension and the other TCP end-node does, we believe that this use of the D-SACK extension by one of the end nodes will not introduce problems. The use of D-SACK does not require separate negotiation between a TCP sender and receiver that have already negotiated SACK capability. The absence of separate negotiation for D-SACK means that the TCP receiver could send D-SACK blocks when the TCP sender does not understand this extension to SACK. In this case, the TCP sender will simply discard any D-SACK blocks, and process the other SACK blocks in the SACK option field as it normally would.
RFC 2018 is as follows: +--------+--------+ | Kind=5 | Length | +--------+--------+--------+--------+ | Left Edge of 1st Block | +--------+--------+--------+--------+ | Right Edge of 1st Block | +--------+--------+--------+--------+ | | / . . . / | | +--------+--------+--------+--------+ | Left Edge of nth Block | +--------+--------+--------+--------+ | Right Edge of nth Block | +--------+--------+--------+--------+ The Selective Acknowledgement (SACK) option in the TCP header contains a number of SACK blocks, where each block specifies the left and right edge of a block of data received at the TCP receiver. In particular, a block represents a contiguous sequence space of data received and queued at the receiver, where the "left edge" of the block is the first sequence number of the block, and the "right edge" is the sequence number immediately following the last sequence number of the block. RFC 2018 implies that the first SACK block specify the segment that triggered the acknowledgement. From RFC 2018, when the data receiver chooses to send a SACK option, "the first SACK block ... MUST specify the contiguous block of data containing the segment which triggered this ACK, unless that segment advanced the Acknowledgment Number field in the header." However, RFC 2018 does not address the use of the SACK option when acknowledging a duplicate segment. For example, RFC 2018 specifies that "each block represents received bytes of data that are contiguous and isolated". RFC 2018 further specifies that "if sent at all, SACK options SHOULD be included in all ACKs which do not ACK the highest sequence number in the data receiver's queue." RFC 2018 does not specify the use of the SACK option when a duplicate segment is received, and the cumulative acknowledgement field in the ACK acknowledges all of the data in the data receiver's queue.
RFC 2018. The guidelines for reporting duplicate segments are summarized below: (1) A D-SACK block is only used to report a duplicate contiguous sequence of data received by the receiver in the most recent packet. (2) Each duplicate contiguous sequence of data received is reported in at most one D-SACK block. (I.e., the receiver sends two identical D-SACK blocks in subsequent packets only if the receiver receives two duplicate segments.) (3) The left edge of the D-SACK block specifies the first sequence number of the duplicate contiguous sequence, and the right edge of the D-SACK block specifies the sequence number immediately following the last sequence in the duplicate contiguous sequence. (4) If the D-SACK block reports a duplicate contiguous sequence from a (possibly larger) block of data in the receiver's data queue above the cumulative acknowledgement, then the second SACK block in that SACK option should specify that (possibly larger) block of data. (5) Following the SACK blocks described above for reporting duplicate segments, additional SACK blocks can be used for reporting additional blocks of data, as specified in RFC 2018. Note that because each duplicate segment is reported in only one ACK packet, information about that duplicate segment will be lost if that ACK packet is dropped in the network.
RFC 2018. Because of several lost ACK packets, the sender then retransmits a data packet. The receiver receives the duplicate packet, and reports it in the first, D-SACK block: Transmitted Received ACK Sent Segment Segment (Including SACK Blocks) 3000-3499 3000-3499 3500 (ACK dropped) 3500-3999 3500-3999 4000 (ACK dropped) 4000-4499 (data packet dropped) 4500-4999 4500-4999 4000, SACK=4500-5000 (ACK dropped) 3000-3499 3000-3499 4000, SACK=3000-3500, 4500-5000 ---------
Section 4 above apply to reporting partial as well as full duplicate segments. This section gives examples of these guidelines when reporting partial duplicate segments. When the SACK option is used for reporting partial duplicate segments, the first D-SACK block reports the first duplicate sub- segment. If the data packet being acknowledged contains multiple partial duplicate sub-segments, then only the first such duplicate sub-segment is reported in the SACK option. We illustrate this with the examples below.
Transmitted Received ACK Sent Segment Segment (Including SACK Blocks) 500-999 500-999 1000 1000-1499 (delayed) 1500-1999 (data packet dropped) 2000-2499 2000-2499 1000, SACK=2000-2500 1000-2000 1000-1499 1500, SACK=2000-2500 1000-2000 2500, SACK=1000-1500 ---------
4.2.2. Example 5: Two non-contiguous duplicate subsegments covered by the cumulative acknowledgement.After the sender increases its packet size from 500 bytes to 1500 bytes, the receiver receives a packet containing two non-contiguous duplicate 500-byte subsegments which are less than the cumulative acknowledgement field. The receiver reports the first such duplicate segment in a single D-SACK block. Transmitted Received ACK Sent Segment Segment (Including SACK Blocks) 500-999 500-999 1000 1000-1499 (delayed) 1500-1999 (data packet dropped) 2000-2499 (delayed) 2500-2999 (data packet dropped) 3000-3499 3000-3499 1000, SACK=3000-3500 1000-2499 1000-1499 1500, SACK=3000-3500 2000-2499 1500, SACK=2000-2500, 3000-3500 1000-2499 2500, SACK=1000-1500, 3000-3500 ---------
Transmitted Received ACK Sent Segment Segment (Including SACK Blocks) 500-999 500-999 1000 1000-1499 (data packet dropped) 1500-1999 (delayed) 2000-2499 (data packet dropped) 2500-2999 (delayed) 3000-3499 (data packet dropped) 3500-3999 3500-3999 1000, SACK=3500-4000 1000-1499 (data packet dropped) 1500-2999 1500-1999 1000, SACK=1500-2000, 3500-4000 2000-2499 1000, SACK=2000-2500, 1500-2000, 3500-4000 1500-2999 1000, SACK=1500-2000, 1500-3000, --------- 3500-4000 RFC1323] specifies an algorithm for Protection Against Wrapped Sequence Numbers (PAWS). PAWS gives a method for distinguishing between sequence numbers for new data, and sequence numbers from a previous cycle through the sequence number space. Duplicate segments might be detected by PAWS as belonging to a previous cycle through the sequence number space. RFC 1323 specifies that for such packets, the receiver should do the following: Send an acknowledgement in reply as specified in RFC 793 page 69, and drop the segment. Since PAWS still requires sending an ACK, there is no harmful interaction between PAWS and the use of D-SACK. The D-SACK block can be included in the SACK option of the ACK, as outlined in Section 4, independently of the use of PAWS by the TCP receiver, and independently of the determination by PAWS of the validity or invalidity of the data segment. TCP senders receiving D-SACK blocks should be aware that a segment reported as a duplicate segment could possibly have been from a prior cycle through the sequence number space. This is independent of the use of PAWS by the TCP data receiver. We do not anticipate that this will present significant problems for senders using D-SACK information.
SCWA99]. In order for the sender to check that the first (D)SACK block of an acknowledgement in fact acknowledges duplicate data, the sender should compare the sequence space in the first SACK block to the cumulative ACK which is carried IN THE SAME PACKET. If the SACK sequence space is less than this cumulative ACK, it is an indication that the segment identified by the SACK block has been received more than once by the receiver. An implementation MUST NOT compare the sequence space in the SACK block to the TCP state variable snd.una (which carries the total cumulative ACK), as this may result in the wrong conclusion if ACK packets are reordered. If the sequence space in the first SACK block is greater than the cumulative ACK, then the sender next compares the sequence space in the first SACK block with the sequence space in the second SACK block, if there is one. This comparison can determine if the first SACK block is reporting duplicate data that lies above the cumulative ACK. TCP implementations which follow RFC 2581 [RFC2581] could see duplicate packets in each of the following four situations. This document does not specify what action a TCP implementation should take in these cases. The extension to the SACK option simply enables the sender to detect each of these cases. Note that these four conditions are not an exhaustive list of possible cases for duplicate packets, but are representative of the most common/likely cases. Subsequent documents will describe experimental proposals for sender responses to the detection of unnecessary retransmits due to reordering, lost ACKS, or early retransmit timeouts.
Transmitted Received ACK Sent Segment Segment (Including SACK Blocks) 500-999 500-999 1000 1000-1499 (delayed) 1500-1999 1500-1999 1000, SACK=1500-2000 2000-2499 2000-2499 1000, SACK=1500-2500 2500-2999 2500-2999 1000, SACK=1500-3000 1000-1499 1000-1499 3000 1000-1499 3000, SACK=1000-1500 --------- In this case, an ACK containing a SACK block which is lower than its ACK field and identical to a previously retransmitted segment is indicative of a significant reordering followed by a false (unnecessary) retransmission. WITHOUT D-SACK: With the use of D-SACK illustrated above, the sender knows that either the first transmission of segment 1000-1499 was delayed in the network, or the first transmission of segment 1000-1499 was dropped and the second transmission of segment 1000-1499 was duplicated. Given that no other segments have been duplicated in the network, this second option can be considered unlikely. Without the use of D-SACK, the sender would only know that either the first transmission of segment 1000-1499 was delayed in the network, or that either one of the data segments or the final ACK was duplicated in the network. Thus, the use of D-SACK allows the sender to more reliably infer that the first transmission of segment 1000-1499 was not dropped. [AP99], [L99], and [LK00] note that the sender could unambiguously detect an unnecessary retransmit with the use of the timestamp option. [LK00] proposes a timestamp-based algorithm that minimizes the penalty for an unnecessary retransmit. [AP99] proposes a heuristic for detecting an unnecessary retransmit in an environment with neither timestamps nor SACK. [L99] also proposes a two-bit field as an alternate to the timestamp option for unambiguously marking the first three retransmissions of a packet. A similar idea was proposed in [ISO8073]. RESEARCH ISSUES: The use of D-SACK allows the sender to detect some cases (e.g., when no ACK packets have been lost) when a a Fast Retransmit was due to packet reordering instead of packet loss. This allows the TCP sender
to adjust the duplicate acknowledgment threshold, to prevent such unnecessary Fast Retransmits in the future. Coupled with this, when the sender determines, after the fact, that it has made an unnecessary window reduction, the sender has the option of "undoing" that reduction in the congestion window by resetting ssthresh to the value of the old congestion window, and slow-starting until the congestion window has reached that point. Any proposal for "undoing" a reduction in the congestion window would have to address the possibility that the TCP receiver could be lying in its reports of received packets [SCWA99]. BPK97], this ability to distinguish between dropped data packets and dropped ACK packets would be particularly useful. In this case, the connection could implement congestion control for the return (ACK) path independently from the congestion control on the forward (data) path.
some data or ACK packet had been replicated in the network, or that an early retransmission timeout had occurred (or that the receiver is lying). RESEARCH ISSUES: After the sender determines that an unnecessary (i.e., early) retransmit timeout has occurred, the sender could adjust parameters for setting the RTO, to prevent more unnecessary retransmit timeouts. Coupled with this, when the sender determines, after the fact, that it has made an unnecessary window reduction, the sender has the option of "undoing" that reduction in the congestion window. [AP99] Mark Allman and Vern Paxson, On Estimating End-to-End Network Path Properties, SIGCOMM 99, August 1999. URL "http://www.acm.org/sigcomm/sigcomm99/papers/session7- 3.html". [BPS99] J.C.R. Bennett, C. Partridge, and N. Shectman, Packet Reordering is Not Pathological Network Behavior, IEEE/ACM Transactions on Networking, Vol. 7, No. 6, December 1999, pp. 789-798. [BPK97] Hari Balakrishnan, Venkata Padmanabhan, and Randy H. Katz, The Effects of Asymmetry on TCP Performance, Third ACM/IEEE Mobicom Conference, Budapest, Hungary, Sep 1997. URL "http://www.cs.berkeley.edu/~padmanab/ index.html#Publications". [F99] Floyd, S., Re: TCP and out-of-order delivery, Message ID <199902030027.QAA06775@owl.ee.lbl.gov> to the end-to-end- interest mailing list, February 1999. URL "http://www.aciri.org/floyd/notes/TCP_Feb99.email".
[ISO8073] ISO/IEC, Information-processing systems - Open Systems Interconnection - Connection Oriented Transport Protocol Specification, Internation Standard ISO/IEC 8073, December 1988. [L99] Reiner Ludwig, A Case for Flow Adaptive Wireless links, Technical Report UCB//CSD-99-1053, May 1999. URL "http://iceberg.cs.berkeley.edu/papers/Ludwig- FlowAdaptive/". [LK00] Reiner Ludwig and Randy H. Katz, The Eifel Algorithm: Making TCP Robust Against Spurious Retransmissions, SIGCOMM Computer Communication Review, V. 30, N. 1, January 2000. URL "http://www.acm.org/sigcomm/ccr/archive/ccr-toc/ccr- toc-2000.html". [RFC1323] Jacobson, V., Braden, R. and D. Borman, "TCP Extensions for High Performance", RFC 1323, May 1992. [RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP Selective Acknowledgement Options", RFC 2018, April 1996. [RFC2581] Allman, M., Paxson,V. and W. Stevens, "TCP Congestion Control", RFC 2581, April 1999. [SCWA99] Stefan Savage, Neal Cardwell, David Wetherall, Tom Anderson, TCP Congestion Control with a Misbehaving Receiver, ACM Computer Communications Review, pp. 71-78, V. 29, N. 5, October, 1999. URL "http://www.acm.org/sigcomm/ccr/archive/ccr-toc/ccr-toc- 99.html".
http://www.aciri.org/floyd/ Jamshid Mahdavi Novell Phone: 1-408-967-3806 EMail: email@example.com Matt Mathis Pittsburgh Supercomputing Center Phone: 412 268-3319 EMail: firstname.lastname@example.org URL: http://www.psc.edu/~mathis/ Matthew Podolsky UC Berkeley Electrical Engineering & Computer Science Dept. Phone: 510-649-8914 EMail: email@example.com URL: http://www.eecs.berkeley.edu/~podolsky
Full Copyright Statement Copyright (C) The Internet Society (2000). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society.