Tech-invite3GPPspaceIETFspace
959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7323

TCP Extensions for High Performance

Pages: 49
Proposed Standard
Errata
Obsoletes:  1323
Part 3 of 3 – Pages 27 to 49
First   Prev   None

Top   ToC   RFC7323 - Page 27   prevText

6. Conclusions and Acknowledgments

This memo presented a set of extensions to TCP to provide efficient operation over large bandwidth * delay product paths and reliable operation over very high-speed paths. These extensions are designed to provide compatible interworking with TCP stacks that do not implement the extensions. These mechanisms are implemented using TCP options for scaled windows and timestamps. The timestamps are used for two distinct mechanisms: RTTM and PAWS. The Window Scale option was originally suggested by Mike St. Johns of USAF/DCA. The present form of the option was suggested by Mike Karels of UC Berkeley in response to a more cumbersome scheme defined by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism description in [RFC1185]. Finally, much of this work originated as the result of discussions within the End-to-End Task Force on the theoretical limitations of transport protocols in general and TCP in particular. Task force members and others on the end2end-interest list have made valuable contributions by pointing out flaws in the algorithms and the documentation. Continued discussion and development since the publication of [RFC1323] originally occurred in the IETF TCP Large Windows Working Group, later on in the End-to-End Task Force, and most recently in the IETF TCP Maintenance Working Group. The authors are grateful for all these contributions.

7. Security Considerations

The TCP sequence space is a fixed size, and as the window becomes larger, it becomes easier for an attacker to generate forged packets that can fall within the TCP window and be accepted as valid segments. While use of timestamps and PAWS can help to mitigate this, when using PAWS, if an attacker is able to forge a packet that is acceptable to the TCP connection, a timestamp that is in the future would cause valid segments to be dropped due to PAWS checks. Hence, implementers should take care to not open the TCP window drastically beyond the requirements of the connection.
Top   ToC   RFC7323 - Page 28
   See [RFC5961] for mitigation strategies to blind in-window attacks.

   A naive implementation that derives the timestamp clock value
   directly from a system uptime clock may unintentionally leak this
   information to an attacker.  This does not directly compromise any of
   the mechanisms described in this document.  However, this may be
   valuable information to a potential attacker.  It is therefore
   RECOMMENDED to generate a random, per-connection offset to be used
   with the clock source when generating the Timestamps option value
   (see Section 5.4).  By carefully choosing this random offset, further
   improvements as described in [RFC6191] are possible.

   Expanding the TCP window beyond 64 KiB for IPv6 allows Jumbograms
   [RFC2675] to be used when the local network supports packets larger
   than 64 KiB.  When larger TCP segments are used, the TCP checksum
   becomes weaker.

   Mechanisms to protect the TCP header from modification should also
   protect the TCP options.

   Middleboxes and TCP options:

      Some middleboxes have been known to remove the TCP options
      described in this document from TCP segments [Honda11].
      Middleboxes that remove TCP options described in this document
      from the <SYN> segment interfere with the selection of parameters
      appropriate for the session.  Removing any of these options in a
      <SYN,ACK> segment will leave the end hosts in a state that
      destroys the proper operation of the protocol.

      *  If a Window Scale option is removed from a <SYN,ACK> segment,
         the end hosts will not negotiate the window scaling factor
         correctly.  Middleboxes must not remove or modify the Window
         Scale option from <SYN,ACK> segments.

      *  If a stateful firewall uses the window field to detect whether
         a received segment is inside the current window, and does not
         support the Window Scale option, it will not be able to
         correctly determine whether or not a packet is in the window.
         These middle boxes must also support the Window Scale option
         and apply the scale factor when processing segments.  If the
         window scale factor cannot be determined, it must not do
         window-based processing.
Top   ToC   RFC7323 - Page 29
      *  If the Timestamps option is removed from the <SYN> or <SYN,ACK>
         segments, high speed connections that need PAWS would not have
         that protection.  Successful negotiation of the Timestamps
         option enforces a stricter verification of incoming segments at
         the receiver.  If the Timestamps option was removed from a
         subsequent data segment after a successful negotiation (e.g.,
         as part of resegmentation), the segment is discarded by the
         receiver without further processing.  Middleboxes should not
         remove the Timestamps option.

      *  It must be noted that [RFC1323] doesn't address the case of the
         Timestamps option being dropped or selectively omitted after
         being negotiated, and that the update in this document may
         cause some broken middlebox behavior to be detected
         (potentially unresponsive TCP sessions).

   Implementations that depend on PAWS could provide a mechanism for the
   application to determine whether or not PAWS is in use on the
   connection and choose to terminate the connection if that protection
   doesn't exist.  This is not just to protect the connection against
   middleboxes that might remove the Timestamps option, but also against
   remote hosts that do not have Timestamp support.

7.1. Privacy Considerations

The TCP options described in this document do not expose individual user's data. However, a naive implementation simply using the system clock as a source for the Timestamps option will reveal characteristics of the TCP, potentially allowing more targeted attacks. It is therefore RECOMMENDED to generate a random, per- connection offset to be used with the clock source when generating the Timestamps option value (see Section 5.4). Furthermore, the combination, relative ordering, and padding of the TCP options described in Sections 2.2 and 3.2 will reveal additional clues to allow the fingerprinting of the system.

8. IANA Considerations

The described TCP options are well known from the superceded [RFC1323]. IANA has updated the "TCP Option Kind Numbers" table under "TCP Parameters" to list this document (RFC 7323) as the reference for "Window Scale" and "Timestamps".
Top   ToC   RFC7323 - Page 30

9. References

9.1. Normative References

[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981. [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

9.2. Informative References

[Allman99] Allman, M. and V. Paxson, "On Estimating End-to-End Network Path Properties", Proceedings of the ACM SIGCOMM Technical Symposium, Cambridge, MA, September 1999, <http://aciri.org/mallman/papers/estimation-la.pdf>. [Floyd05] Floyd, S., "Subject: Re: [tcpm] RFC 1323: Timestamps option", message to the TCPM mailing list, 26 January 2007, <http://www.ietf.org/mail-archive/web/tcpm/current/ msg02508.html>. [Garlick77] Garlick, L., Rom, R., and J. Postel, "Issues in Reliable Host-to-Host Protocols", Proceedings of the Second Berkeley Workshop on Distributed Data Management and Computer Networks, March 1977, <http://www.rfc-editor.org/ien/ien12.txt>. [Honda11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., Handley, M., and H. Tokuda, "Is it Still Possible to Extend TCP?", Proceedings of the ACM Internet Measurement Conference (IMC) '11, November 2011. [Jacobson88a] Jacobson, V., "Congestion Avoidance and Control", SIGCOMM '88, Stanford, CA, August 1988, <http://ee.lbl.gov/papers/congavoid.pdf>. [Jacobson90a] Jacobson, V., "4BSD Header Prediction", ACM Computer Communication Review, April 1990.
Top   ToC   RFC7323 - Page 31
   [Jacobson90c]
              Jacobson, V., "Subject: modified TCP congestion avoidance
              algorithm", message to the End2End-Interest mailing list,
              30 April 1990, <ftp://ftp.isi.edu/end2end/
              end2end-interest-1990.mail>.

   [Karn87]   Karn, P. and C. Partridge, "Estimating Round-Trip Times in
              Reliable Transport Protocols", Proceedings of SIGCOMM '87,
              August 1987.

   [Kuehlewind10]
              Kuehlewind, M. and B. Briscoe, "Chirping for Congestion
              Control - Implementation Feasibility", November 2010,
              <http://bobbriscoe.net/projects/netsvc_i-f/
              chirp_pfldnet10.pdf>.

   [Kuzmanovic03]
              Kuzmanovic, A. and E. Knightly, "TCP-LP: Low-Priority
              Service via End-Point Congestion Control", 2003,
              <www.cs.northwestern.edu/~akuzma/doc/TCP-LP-ToN.pdf>.

   [Ludwig00] Ludwig, R. and K. Sklower, "The Eifel Retransmission
              Timer", ACM SIGCOMM Computer Communication Review Volume
              30 Issue 3, July 2000,
              <http://ccr.sigcomm.org/archive/2000/july00/
              LudwigFinal.pdf>.

   [Martin03] Martin, D., "Subject: [Tsvwg] RFC 1323.bis", message to
              the TSVWG mailing list, 30 September 2003,
              <http://www.ietf.org/mail-archive/web/tsvwg/current/
              msg04435.html>.

   [Medina04] Medina, A., Allman, M., and S. Floyd, "Measuring
              Interactions Between Transport Protocols and Middleboxes",
              Proceedings of the ACM SIGCOMM/USENIX Internet Measurement
              Conference, October 2004,
              <http://www.icir.net/tbit/tbit-Aug2004.pdf>.

   [Medina05] Medina, A., Allman, M., and S. Floyd, "Measuring the
              Evolution of Transport Protocols in the Internet", ACM
              Computer Communication Review Volume 35, No. 2, April
              2005,
              <http://icir.net/floyd/papers/TCPevolution-Mar2005.pdf>.
Top   ToC   RFC7323 - Page 32
   [RE-1323BIS]
              Oppermann, A., "Subject: Re: [tcpm] I-D Action: draft-
              ietf.tcpm-1323bis-13.txt", message to the TCPM mailing
              list, 01 June 2013, <http://www.ietf.org/
              mail-archive/web/tcpm/current/msg08001.html>.

   [RFC1072]  Jacobson, V. and R. Braden, "TCP extensions for long-delay
              paths", RFC 1072, October 1988.

   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
              Communication Layers", STD 3, RFC 1122, October 1989.

   [RFC1185]  Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for
              High-Speed Paths", RFC 1185, October 1990.

   [RFC1323]  Jacobson, V., Braden, B., and D. Borman, "TCP Extensions
              for High Performance", RFC 1323, May 1992.

   [RFC1981]  McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery
              for IP version 6", RFC 1981, August 1996.

   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
              Selective Acknowledgment Options", RFC 2018, October 1996.

   [RFC2675]  Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms",
              RFC 2675, August 1999.

   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
              Extension to the Selective Acknowledgement (SACK) Option
              for TCP", RFC 2883, July 2000.

   [RFC3522]  Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
              for TCP", RFC 3522, April 2003.

   [RFC4015]  Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm
              for TCP", RFC 4015, February 2005.

   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
              Discovery", RFC 4821, March 2007.

   [RFC4963]  Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly
              Errors at High Data Rates", RFC 4963, July 2007.

   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
              Control", RFC 5681, September 2009.
Top   ToC   RFC7323 - Page 33
   [RFC5961]  Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's
              Robustness to Blind In-Window Attacks", RFC 5961, August
              2010.

   [RFC6191]  Gont, F., "Reducing the TIME-WAIT State Using TCP
              Timestamps", BCP 159, RFC 6191, April 2011.

   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
              "Computing TCP's Retransmission Timer", RFC 6298, June
              2011.

   [RFC6528]  Gont, F. and S. Bellovin, "Defending against Sequence
              Number Attacks", RFC 6528, February 2012.

   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
              and Y. Nishida, "A Conservative Loss Recovery Algorithm
              Based on Selective Acknowledgment (SACK) for TCP", RFC
              6675, August 2012.

   [RFC6691]  Borman, D., "TCP Options and Maximum Segment Size (MSS)",
              RFC 6691, July 2012.

   [RFC6817]  Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind,
              "Low Extra Delay Background Transport (LEDBAT)", RFC 6817,
              December 2012.
Top   ToC   RFC7323 - Page 34

Appendix A. Implementation Suggestions

TCP Option Layout The following layout is recommended for sending options on non-<SYN> segments to achieve maximum feasible alignment of 32-bit and 64-bit machines. +--------+--------+--------+--------+ | NOP | NOP | TSopt | 10 | +--------+--------+--------+--------+ | TSval timestamp | +--------+--------+--------+--------+ | TSecr timestamp | +--------+--------+--------+--------+ Interaction with the TCP Urgent Pointer The TCP Urgent Pointer, like the TCP window, is a 16-bit value. Some of the original discussion for the TCP Window Scale option included proposals to increase the Urgent Pointer to 32 bits. As it turns out, this is unnecessary. There are two observations that should be made: (1) With IP version 4, the largest amount of TCP data that can be sent in a single packet is 65495 bytes (64 KiB - 1 - size of fixed IP and TCP headers). (2) Updates to the Urgent Pointer while the user is in "urgent mode" are invisible to the user. This means that if the Urgent Pointer points beyond the end of the TCP data in the current segment, then the user will remain in urgent mode until the next TCP segment arrives. That segment will update the Urgent Pointer to a new offset, and the user will never have left urgent mode. Thus, to properly implement the Urgent Pointer, the sending TCP only has to check for overflow of the 16-bit Urgent Pointer field before filling it in. If it does overflow, than a value of 65535 should be inserted into the Urgent Pointer. The same technique applies to IP version 6, except in the case of IPv6 Jumbograms. When IPv6 Jumbograms are supported, [RFC2675] requires additional steps for dealing with the Urgent Pointer; these steps are described in Section 5.2 of [RFC2675].
Top   ToC   RFC7323 - Page 35

Appendix B. Duplicates from Earlier Connection Incarnations

There are two cases to be considered: (1) a system crashing (and losing connection state) and restarting, and (2) the same connection being closed and reopened without a loss of host state. These will be described in the following two sections.

B.1. System Crash with Loss of State

TCP's quiet time of one MSL upon system startup handles the loss of connection state in a system crash/restart. For an explanation, see, for example, "Knowing When to Keep Quiet" in the TCP protocol specification [RFC0793]. The MSL that is required here does not depend upon the transfer speed. The current TCP MSL of 2 minutes seemed acceptable as an operational compromise, when many host systems used to take this long to boot after a crash. Current host systems can boot considerably faster. The Timestamps option may be used to ease the MSL requirements (or to provide additional security against data corruption). If timestamps are being used and if the timestamp clock can be guaranteed to be monotonic over a system crash/restart, i.e., if the first value of the sender's timestamp clock after a crash/restart can be guaranteed to be greater than the last value before the restart, then a quiet time is unnecessary. To dispense totally with the quiet time would require that the host clock be synchronized to a time source that is stable over the crash/ restart period, with an accuracy of one timestamp clock tick or better. We can back off from this strict requirement to take advantage of approximate clock synchronization. Suppose that the clock is always resynchronized to within N timestamp clock ticks and that booting (extended with a quiet time, if necessary) takes more than N ticks. This will guarantee monotonicity of the timestamps, which can then be used to reject old duplicates even without an enforced MSL.

B.2. Closing and Reopening a Connection

When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]). Applications built upon TCP that close one connection and open a new one (e.g., an FTP data transfer connection using Stream mode) must choose a new socket pair each time. The TIME-WAIT delay serves two different purposes:
Top   ToC   RFC7323 - Page 36
   (a)  Implement the full-duplex reliable close handshake of TCP.

        The proper time to delay the final close step is not really
        related to the MSL; it depends instead upon the RTO for the FIN
        segments and, therefore, upon the RTT of the path.  (It could be
        argued that the side that is sending a FIN knows what degree of
        reliability it needs, and therefore it should be able to
        determine the length of the TIME-WAIT delay for the FIN's
        recipient.  This could be accomplished with an appropriate TCP
        option in FIN segments.)

        Although there is no formal upper bound on RTT, common network
        engineering practice makes an RTT greater than 1 minute very
        unlikely.  Thus, the 4-minute delay in TIME-WAIT state works
        satisfactorily to provide a reliable full-duplex TCP close.
        Note again that this is independent of MSL enforcement and
        network speed.

        The TIME-WAIT state could cause an indirect performance problem
        if an application needed to repeatedly close one connection and
        open another at a very high frequency, since the number of
        available TCP ports on a host is less than 2^16.  However, high
        network speeds are not the major contributor to this problem;
        the RTT is the limiting factor in how quickly connections can be
        opened and closed.  Therefore, this problem will be no worse at
        high transfer speeds.

   (b)  Allow old duplicate segments to expire.

        To replace this function of TIME-WAIT state, a mechanism would
        have to operate across connections.  PAWS is defined strictly
        within a single connection; the last timestamp (TS.Recent) is
        kept in the connection control block and discarded when a
        connection is closed.

        An additional mechanism could be added to the TCP, a per-host
        cache of the last timestamp received from any connection.  This
        value could then be used in the PAWS mechanism to reject old
        duplicate segments from earlier incarnations of the connection,
        if the timestamp clock can be guaranteed to have ticked at least
        once since the old connection was open.  This would require that
        the TIME-WAIT delay plus the RTT together must be at least one
        tick of the sender's timestamp clock.  Such an extension is not
        part of the proposal of this RFC.

        Note that this is a variant on the mechanism proposed by
        Garlick, Rom, and Postel [Garlick77], which required each host
        to maintain connection records containing the highest sequence
Top   ToC   RFC7323 - Page 37
        numbers on every connection.  Using timestamps instead, it is
        only necessary to keep one quantity per remote host, regardless
        of the number of simultaneous connections to that host.

Appendix C. Summary of Notation

The following notation has been used in this document. Options WSopt: TCP Window Scale option TSopt: TCP Timestamps option Option Fields shift.cnt: Window scale byte in WSopt TSval: 32-bit Timestamp Value field in TSopt TSecr: 32-bit Timestamp Reply field in TSopt Option Fields in Current Segment SEG.TSval: TSval field from TSopt in current segment SEG.TSecr: TSecr field from TSopt in current segment SEG.WSopt: 8-bit value in WSopt Clock Values my.TSclock: System-wide source of 32-bit timestamp values my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) Snd.TSoffset: An offset for randomizing Snd.TSclock Snd.TSclock: my.TSclock + Snd.TSoffset Per-Connection State Variables TS.Recent: Latest received Timestamp Last.ACK.sent: Last ACK field sent Snd.TS.OK: 1-bit flag Snd.WS.OK: 1-bit flag Rcv.Wind.Shift: Receive window scale exponent Snd.Wind.Shift: Send window scale exponent Start.Time: Snd.TSclock value when the segment being timed was sent (used by code from before RFC 1323). Procedure Update_SRTT(m) Procedure to update the smoothed RTT and RTT variance estimates, using the rules of [Jacobson88a], given m, a new RTT measurement
Top   ToC   RFC7323 - Page 38
   Send Sequence Variables

      SND.UNA:          Send unacknowledged
      SND.NXT:          Send next
      SND.WND:          Send window
      ISS:              Initial send sequence number

   Receive Sequence Variables

      RCV.NXT:          Receive next
      RCV.WND:          Receive window
      IRS:              Initial receive sequence number

Appendix D. Event Processing Summary

This appendix attempts to specify the algorithms unambiguously by presenting modifications to the Event Processing rules in Section 3.9 of RFC 793. The change bars ("|") indicate lines that are different from RFC 793. OPEN Call ... An initial send sequence number (ISS) is selected. Send a <SYN> | segment of the form: | | <SEQ=ISS><CTL=SYN><TSval=Snd.TSclock><WSopt=Rcv.Wind.Shift> ... SEND Call CLOSED STATE (i.e., TCB does not exist) ... LISTEN STATE If active and the foreign socket is specified, then change the connection from passive to active, select an ISS. Send a SYN | segment containing the options: <TSval=Snd.TSclock> and | <WSopt=Rcv.Wind.Shift>. Set SND.UNA to ISS, SND.NXT to ISS+1. Enter SYN-SENT state. ... SYN-SENT STATE SYN-RECEIVED STATE
Top   ToC   RFC7323 - Page 39
         ...

      ESTABLISHED STATE
      CLOSE-WAIT STATE

         Segmentize the buffer and send it with a piggybacked
         acknowledgment (acknowledgment value = RCV.NXT).  ...

         If the urgent flag is set ...

 |       If the Snd.TS.OK flag is set, then include the TCP Timestamps
 |       option <TSval=Snd.TSclock,TSecr=TS.Recent> in each data
 |       segment.
 |
 |       Scale the receive window for transmission in the segment
 |       header:
 |
 |               SEG.WND = (RCV.WND >> Rcv.Wind.Shift).

   SEGMENT ARRIVES

      ...

      If the state is LISTEN then

         first check for an RST

            ...

         second check for an ACK

            ...

         third check for a SYN

            If the SYN bit is set, check the security.  If the ...

               ...

            If the SEG.PRC is less than the TCB.PRC then continue.

 |          Check for a Window Scale option (WSopt); if one is found,
 |          save SEG.WSopt in Snd.Wind.Shift and set Snd.WS.OK flag on.
 |          Otherwise, set both Snd.Wind.Shift and Rcv.Wind.Shift to
 |          zero and clear Snd.WS.OK flag.
 |
 |          Check for a TSopt option; if one is found, save SEG.TSval in
 |          the variable TS.Recent and turn on the Snd.TS.OK bit.
Top   ToC   RFC7323 - Page 40
            Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any
            other control or text should be queued for processing later.
            ISS should be selected and a SYN segment sent of the form:

                    <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>

 |           If the Snd.WS.OK bit is on, include a WSopt
 |           <WSopt=Rcv.Wind.Shift> in this segment.  If the Snd.TS.OK
 |           bit is on, include a TSopt <TSval=Snd.TSclock,
 |           TSecr=TS.Recent> in this segment.  Last.ACK.sent is set to
 |           RCV.NXT.

            SND.NXT is set to ISS+1 and SND.UNA to ISS.  The connection
            state should be changed to SYN-RECEIVED.  Note that any
            other incoming control or data (combined with SYN) will be
            processed in the SYN-RECEIVED state, but processing of SYN
            and ACK should not be repeated.  If the listen was not fully
            specified (i.e., the foreign socket was not fully
            specified), then the unspecified fields should be filled in
            now.

         fourth other text or control

            ...

      If the state is SYN-SENT then

         first check the ACK bit

            ...

         ...

         fourth check the SYN bit

            ...

            If the SYN bit is on and the security/compartment and
            precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1,
            IRS is set to SEG.SEQ.  SND.UNA should be advanced to equal
            SEG.ACK (if there is an ACK), and any segments on the
            retransmission queue which are thereby acknowledged should
            be removed.

 |          Check for a Window Scale option (WSopt); if it is found,
 |          save SEG.WSopt in Snd.Wind.Shift; otherwise, set both
 |          Snd.Wind.Shift and Rcv.Wind.Shift to zero.
 |
Top   ToC   RFC7323 - Page 41
 |          Check for a TSopt option; if one is found, save SEG.TSval in
 |          variable TS.Recent and turn on the Snd.TS.OK bit in the
 |          connection control block.  If the ACK bit is set, use
 |          Snd.TSclock - SEG.TSecr as the initial RTT estimate.

            If SND.UNA > ISS (our SYN has been ACKed), change the
            connection state to ESTABLISHED, form an <ACK> segment:

                    <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

 |          and send it.  If the Snd.TS.OK bit is on, include a TSopt
 |          option <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK>
 |          segment.  Last.ACK.sent is set to RCV.NXT.

            Data or controls that were queued for transmission may be
            included.  If there are other controls or text in the
            segment, then continue processing at the sixth step below
            where the URG bit is checked; otherwise, return.

            Otherwise, enter SYN-RECEIVED, form a <SYN,ACK> segment:

                    <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>

 |          and send it.  If the Snd.TS.OK bit is on, include a TSopt
 |          option <TSval=Snd.TSclock,TSecr=TS.Recent> in this segment.
 |          If the Snd.WS.OK bit is on, include a WSopt option
 |          <WSopt=Rcv.Wind.Shift> in this segment.  Last.ACK.sent is
 |          set to RCV.NXT.

            If there are other controls or text in the segment, queue
            them for processing after the ESTABLISHED state has been
            reached, return.

         fifth, if neither of the SYN or RST bits is set then drop the
         segment and return.

      Otherwise

      first check the sequence number

         SYN-RECEIVED STATE
         ESTABLISHED STATE
         FIN-WAIT-1 STATE
         FIN-WAIT-2 STATE
         CLOSE-WAIT STATE
         CLOSING STATE
         LAST-ACK STATE
         TIME-WAIT STATE
Top   ToC   RFC7323 - Page 42
            Segments are processed in sequence.  Initial tests on
            arrival are used to discard old duplicates, but further
            processing is done in SEG.SEQ order.  If a segment's
            contents straddle the boundary between old and new, only the
            new parts should be processed.

 |          Rescale the received window field:
 |
 |                TrueWindow = SEG.WND << Snd.Wind.Shift,
 |
 |          and use "TrueWindow" in place of SEG.WND in the following
 |          steps.
 |
 |          Check whether the segment contains a Timestamps option and
 |          if bit Snd.TS.OK is on.  If so:
 |
 |             If SEG.TSval < TS.Recent and the RST bit is off:
 |
 |                If the connection has been idle more than 24 days,
 |                save SEG.TSval in variable TS.Recent, else the segment
 |                is not acceptable; follow the steps below for an
 |                unacceptable segment.
 |
 |             If SEG.TSval >= TS.Recent and SEG.SEQ <= Last.ACK.sent,
 |             then save SEG.TSval in variable TS.Recent.

            There are four cases for the acceptability test for an
            incoming segment:

               ...

            If an incoming segment is not acceptable, an acknowledgment
            should be sent in reply (unless the RST bit is set; if so
            drop the segment and return):

                    <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

 |          Last.ACK.sent is set to SEG.ACK of the acknowledgment.  If
 |          the Snd.TS.OK bit is on, include the Timestamps option
 |          <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment.
            Set Last.ACK.sent to SEG.ACK and send the <ACK> segment.
            After sending the acknowledgment, drop the unacceptable
            segment and return.

      ...
Top   ToC   RFC7323 - Page 43
      fifth check the ACK field,

         if the ACK bit is off drop the segment and return

         if the ACK bit is on

            ...

            ESTABLISHED STATE

               If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <-
 |             SEG.ACK.  Also compute a new estimate of round-trip time.
 |             If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr;
 |             otherwise, use the elapsed time since the first segment
 |             in the retransmission queue was sent.  Any segments on
               the retransmission queue that are thereby entirely
               acknowledged...

      ...

      seventh, process the segment text,

         ESTABLISHED STATE
         FIN-WAIT-1 STATE
         FIN-WAIT-2 STATE

            ...

            Send an acknowledgment of the form:

                    <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

 |          If the Snd.TS.OK bit is on, include the Timestamps option
 |          <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment.
 |          Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send
 |          it.  This acknowledgment should be piggybacked on a segment
            being transmitted if possible without incurring undue delay.

            ...
Top   ToC   RFC7323 - Page 44

Appendix E. Timestamps Edge Cases

While the rules laid out for when to calculate RTTM produce the correct results most of the time, there are some edge cases where an incorrect RTTM can be calculated. All of these situations involve the loss of segments. It is felt that these scenarios are rare, and that if they should happen, they will cause a single RTTM measurement to be inflated, which mitigates its effects on RTO calculations. [Martin03] cites two similar cases when the returning <ACK> is lost, and before the retransmission timer fires, another returning <ACK> segment arrives, which acknowledges the data. In this case, the RTTM calculated will be inflated: clock tc=1 <A, TSval=1> -------------------> tc=2 (lost) <---- <ACK(A), TSecr=1, win=n> (RTTM would have been 1) (receive window opens, window update is sent) tc=5 <---- <ACK(A), TSecr=1, win=m> (RTTM is calculated at 4) One thing to note about this situation is that it is somewhat bounded by RTO + RTT, limiting how far off the RTTM calculation will be. While more complex scenarios can be constructed that produce larger inflations (e.g., retransmissions are lost), those scenarios involve multiple segment losses, and the connection will have other more serious operational problems than using an inflated RTTM in the RTO calculation.

Appendix F. Window Retraction Example

Consider an established TCP connection using a scale factor of 128, Snd.Wind.Shift=7 and Rcv.Wind.Shift=7, that is running with a very small window because the receiver is bottlenecked and both ends are doing small reads and writes. Consider the ACKs coming back: SEG.ACK SEG.WIN computed SND.WIN receiver's actual window 1000 2 1256 1300 The sender writes 40 bytes and receiver ACKs: 1040 2 1296 1300
Top   ToC   RFC7323 - Page 45
   The sender writes 5 additional bytes and the receiver has a problem.
   Two choices:

   1045     2       1301               1300   - BEYOND BUFFER

   1045     1       1173               1300   - RETRACTED WINDOW

   This is a general problem and can happen any time the sender does a
   write, which is smaller than the window scale factor.

   In most stacks, it is at least partially obscured when the window
   size is larger than some small number of segments because the stacks
   prefer to announce windows that are an integral number of segments,
   rounded up to the next scale factor.  This plus silly window
   suppression tends to cause less frequent, larger window updates.  If
   the window was rounded down to a segment size, there is more
   opportunity to advance the window, the BEYOND BUFFER case above,
   rather than retracting it.

Appendix G. RTO Calculation Modification

Taking multiple RTT samples per window would shorten the history calculated by the RTO mechanism in [RFC6298], and the below algorithm aims to maintain a similar history as originally intended by [RFC6298]. It is roughly known how many samples a congestion window worth of data will yield, not accounting for ACK compression, and ACK losses. Such events will result in more history of the path being reflected in the final value for RTO, and are uncritical. This modification will ensure that a similar amount of time is taken into account for the RTO estimation, regardless of how many samples are taken per window: ExpectedSamples = ceiling(FlightSize / (SMSS * 2)) alpha' = alpha / ExpectedSamples beta' = beta / ExpectedSamples Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs".
Top   ToC   RFC7323 - Page 46
   Instead of using alpha and beta in the algorithm of [RFC6298], use
   alpha' and beta' instead:

      RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'|

      SRTT <- (1 - alpha') * SRTT + alpha' * R'

      (for each sample R')

Appendix H. Changes from RFC 1323

Several important updates and clarifications to the specification in RFC 1323 are made in this document. The technical changes are summarized below: (a) A wrong reference to SND.WND was corrected to SEG.WND in Section 2.3. (b) Section 2.4 was added describing the unavoidable window retraction issue and explicitly describing the mitigation steps necessary. (c) In Section 3.2, the wording how the Timestamps option negotiation is to be performed was updated with RFC2119 wording. Further, a number of paragraphs were added to clarify the expected behavior with a compliant implementation using TSopt, as RFC 1323 left room for interpretation -- e.g., potential late enablement of TSopt. (d) The description of which TSecr values can be used to update the measured RTT has been clarified. Specifically, with timestamps, the Karn algorithm [Karn87] is disabled. The Karn algorithm disables all RTT measurements during retransmission, since it is ambiguous whether the <ACK> is for the original segment, or the retransmitted segment. With timestamps, that ambiguity is removed since the TSecr in the <ACK> will contain the TSval from whichever data segment made it to the destination. (e) RTTM update processing explicitly excludes segments not updating SND.UNA. The original text could be interpreted to allow taking RTT samples when SACK acknowledges some new, non-continuous data.
Top   ToC   RFC7323 - Page 47
   (f)  In RFC 1323, Section 3.4, step (2) of the algorithm to control
        which timestamp is echoed was incorrect in two regards:

        (1)  It failed to update TS.Recent for a retransmitted segment
             that resulted from a lost <ACK>.

        (2)  It failed if SEG.LEN = 0.

        In the new algorithm, the case of SEG.TSval >= TS.Recent is
        included for consistency with the PAWS test.

   (g)  It is now recommended that the Timestamps option is included in
        <RST> segments if the incoming segment contained a Timestamps
        option.

   (h)  <RST> segments are explicitly excluded from PAWS processing.

   (i)  Added text to clarify the precedence between regular TCP
        [RFC0793] and this document's Timestamps option / PAWS
        processing.  Discussion about combined acceptability checks are
        ongoing.

   (j)  Snd.TSoffset and Snd.TSclock variables have been added.
        Snd.TSclock is the sum of my.TSclock and Snd.TSoffset.  This
        allows the starting points for timestamp values to be randomized
        on a per-connection basis.  Setting Snd.TSoffset to zero yields
        the same results as [RFC1323].  Text was added to guide
        implementers to the proper selection of these offsets, as
        entirely random offsets for each new connection will conflict
        with PAWS.

   (k)  Appendix A has been expanded with information about the TCP
        Urgent Pointer.  An earlier revision contained text around the
        TCP MSS option, which was split off into [RFC6691].

   (l)  One correction was made to the Event Processing Summary in
        Appendix D.  In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
        fill in the SEG.WND value, not SND.WND.

   (m)  Appendix G was added to exemplify how an RTO calculation might
        be updated to properly take the much higher RTT sampling
        frequency enabled by the Timestamps option into account.
Top   ToC   RFC7323 - Page 48
   Editorial changes to the document, that don't impact the
   implementation or function of the mechanisms described in this
   document, include:

   (a)  Removed much of the discussion in Section 1 to streamline the
        document.  However, detailed examples and discussions in
        Sections 2, 3, and 5 are kept as guidelines for implementers.

   (b)  Added short text that the use of WS increases the chances of
        sequence number wrap, thus the PAWS mechanism is required in
        certain environments.

   (c)  Removed references to "new" options, as the options were
        introduced in [RFC1323] already.  Changed the text in
        Section 1.3 to specifically address TS and WS options.

   (d)  Section 1.4 was added for [RFC2119] wording.  Normative text was
        updated with the appropriate phrases.

   (e)  Added < > brackets to mark specific types of segments, and
        replaced most occurrences of "packet" with "segment", where TCP
        segments are referred to.

   (f)  Updated the text in Section 3 to take into account what has been
        learned since [RFC1323].

   (g)  Removed some unused references.

   (h)  Removed the list of changes between [RFC1323] and prior
        versions.  These changes are mentioned in Appendix C of
        [RFC1323].

   (i)  Moved "Changes from RFC 1323" to the end of the appendices for
        easier lookup.  In addition, the entries were split into a
        technical and an editorial part, and sorted to roughly
        correspond with the sections in the text where they apply.
Top   ToC   RFC7323 - Page 49

Authors' Addresses

David Borman Quantum Corporation Mendota Heights, MN 55120 USA EMail: david.borman@quantum.com Bob Braden University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292 USA EMail: braden@isi.edu Van Jacobson Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 USA EMail: vanj@google.com Richard Scheffenegger (editor) NetApp, Inc. Am Euro Platz 2 Vienna, 1120 Austria EMail: rs@netapp.com