RFC 6077

Open Research Issues in Internet Congestion Control

Pages: 51
Informational

Part 2 of 3 – Pages 10 to 32

RFC6077 - Page 10 prevText

3.  Detailed Challenges

3.1.  Challenge 1: Network Support

   This challenge is perhaps the most critical to get right.  Changes to
   the balance of functions between the endpoints and network equipment
   could require a change to the per-datagram data plane interface

RFC6077 - Page 11

   between the transport and network layers.  Network equipment vendors
   need to be assured that any new interface is stable enough (on decade
   timescales) to build into firmware and hardware, and operating-system
   vendors will not use a new interface unless it is likely to be widely
   deployed.

   Network components can be involved in congestion control in two ways:
   first, they can implicitly optimize their functions, such as queue
   management and scheduling strategies, in order to support the
   operation of end-to-end congestion control.  Second, network
   components can participate in congestion control via explicit
   signaling mechanisms.  Explicit signaling mechanisms, whether in-band
   or out-of-band, require a communication between network components
   and end-systems.  Signals realized within or over the IP layer are
   only meaningful to network components that process IP packets.  This
   always includes routers and potentially also middleboxes, but not
   pure link layer devices.  The following section distinguishes clearly
   between the term "network component" and the term "router"; the term
   "router" is used whenever the processing of IP packets is explicitly
   required.  One fundamental challenge of network-supported congestion
   control is that typically not all network components along a path are
   routers (cf. Section 3.1.3).

   The first (optimizing) category of implicit mechanisms can be
   implemented in any network component that processes and stores
   packets.  Various approaches have been proposed and also deployed,
   such as different AQM techniques.  Even though these implicit
   techniques are known to improve network performance during congestion
   phases, they are still only partly deployed in the Internet.  This
   may be due to the fact that finding optimal and robust
   parameterizations for these mechanisms is a non-trivial problem.
   Indeed, the problem with various AQM schemes is the difficulty in
   identifying correct values of the parameters that affect the
   performance of the queuing scheme (due to variation in the number of
   sources, the capacity, and the feedback delay) [Firoiu00] [Hollot01]
   [Zhang03].  Many AQM schemes (RED, REM, BLUE, and PI-Controller, but
   also Adaptive Virtual Queue (AVQ)) do not define a systematic rule
   for setting their parameters.

   The second class of approaches uses explicit signaling.  By using
   explicit feedback from the network, connection endpoints can obtain
   more accurate information about the current network characteristics
   on the path.  This allows endpoints to make more precise decisions
   that can better control congestion.

RFC6077 - Page 12

   Explicit feedback techniques fall into three broad categories:

   -  Explicit congestion feedback: one-bit Explicit Congestion
      Notification (ECN) [RFC3168] or proposals for more than one bit
      [Xia05];

   -  Explicit per-datagram rate feedback: the eXplicit Control Protocol
      (XCP) [Katabi02] [Falk07], or the Rate Control Protocol (RCP)
      [Dukki05];

   -  Explicit rate feedback: by means of in-band signaling, such as by
      Quick-Start [RFC4782], or by means of out-of-band signaling, e.g.,
      Congestion Avoidance with Distributed Proportional
      Control/Performance Transparency Protocol (CADPC/PTP) [Welzl03].

   Explicit router feedback can address some of the inherent
   shortcomings of TCP.  For instance, XCP was developed to overcome the
   inefficiency and instability that TCP suffers from when the per-flow
   bandwidth-delay product increases.  By decoupling resource
   utilization/congestion control from fairness control, XCP achieves
   equal bandwidth allocation, high utilization, a small standing queue
   size, and near-zero packet drops, with both steady and highly varying
   traffic.  Importantly, XCP does not maintain any per-flow state in
   routers and requires few CPU cycles per packet, hence making it
   potentially applicable in high-speed routers.  However, XCP is still
   subject to research: as [Andrew05] has pointed out, XCP is locally
   stable but globally unstable when the maximum RTT of a flow is much
   larger than the mean RTT.  This instability can be removed by
   changing the update strategy for the estimation interval, but this
   makes the system vulnerable to erroneous RTT advertisements.  The
   authors of [Pap02] have shown that when flows with different RTTs are
   applied, XCP sometimes discriminates among heterogeneous traffic
   flows, even if XCP generally equalizes rates among different flows.
   [Low05] provides for a complete characterization of the XCP
   equilibrium properties.

   Several other explicit router feedback schemes have been developed
   with different design objectives.  For instance, RCP uses per-packet
   feedback similar to XCP.  But unlike XCP, RCP focuses on the
   reduction of flow completion times [Dukki06], taking an optimistic
   approach to flows likely to arrive in the next RTT and tolerating
   larger instantaneous queue sizes [Dukki05].  XCP, on the other hand,
   gives very poor flow completion times for short flows.

   Both implicit and explicit router support should be considered in the
   context of the end-to-end argument [Saltzer84], which is one of the
   key design principles of the Internet.  It suggests that functions
   that can be realized both in the end-systems and in the network

RFC6077 - Page 13

   should be implemented in the end-systems.  This principle ensures
   that the network provides a general service and that it remains as
   simple as possible (any additional complexity is placed above the IP
   layer, i.e., at the edges) so as to ensure evolvability, reliability,
   and robustness.  Furthermore, the fate-sharing principle ([Clark88],
   "Design Philosophy of the DARPA Internet Protocols") mandates that an
   end-to-end Internet protocol design should not rely on the
   maintenance of any per-flow state (i.e., information about the state
   of the end-to-end communication) inside the network and that the
   network state (e.g., routing state) maintained by the Internet shall
   minimize its interaction with the states maintained at the
   endpoints/hosts [RFC1958].

   However, as discussed in [Moors02] for instance, congestion control
   cannot be realized as a pure end-to-end function only.  Congestion is
   an inherent network phenomenon and can only be resolved efficiently
   by some cooperation of end-systems and the network.  Congestion
   control in today's Internet protocols follows the end-to-end design
   principle insofar as only minimal feedback from the network is used,
   e.g., packet loss and delay.  The end-systems only decide how to
   react and how to avoid congestion.  The crux is that on the one hand,
   there would be substantial benefit by further assistance from the
   network, but, on the other hand, such network support could lead to
   duplication of functions, which might even harmfully interact with
   end-to-end protocol mechanisms.  The different requirements of
   applications (cf. the fairness discussion in Section 2.3) call for a
   variety of different congestion control approaches, but putting such
   per-flow behavior inside the network should be avoided, as such a
   design would clearly be at odds with the end-to-end and fate-sharing
   design principles.

   The end-to-end and fate-sharing principles are generally regarded as
   the key ingredients for ensuring a scalable and survivable network
   design.  In order to ensure that new congestion control mechanisms
   are scalable, violating these principles must therefore be avoided.

   For instance, protocols like XCP and RCP seem not to require flow
   state in the network, but this is only the case if the network trusts
   i) the receiver not to lie when feeding back the network's delta to
   the requested rate; ii) the source not to lie when declaring its
   rate; and iii) the source not to cheat when setting its rate in
   response to the feedback [Katabi04].

RFC6077 - Page 14

   Solving these problems for non-cooperative environments like the
   public Internet requires flow state, at least on a sampled basis.
   However, because flows can create new identifiers whenever they want,
   sampling does not provide a deterrent -- a flow can simply cheat
   until it is discovered and then switch to a whitewashed identifier
   [Feldman04], and continue cheating until it is discovered again
   ([Bri09], S7.3).

   However, holding flow state in the network only seems to solve these
   policing problems in single autonomous system settings.  A
   multi-domain system would seem to require a completely different
   protocol structure, as the information required for policing is only
   seen as packets leave the internetwork, but the networks where
   packets enter will also want to police compliance.

   Even if a new protocol structure were found, it seems unlikely that
   network flow state could be avoided given the network's per-packet
   flow rate instructions would need to be compared against variations
   in the actual flow rate, which is inherently not a per-packet metric.
   These issues have been outstanding ever since integrated services
   (IntServ) was identified as unscalable in 1997 [RFC2208].  All
   subsequent attempts to involve network elements in limiting flow
   rates (XCP, RCP, etc.) will run up against the same open issue if
   anyone attempts to standardize them for use on the public Internet.

   In general, network support of congestion control raises many issues
   that have not been completely solved yet.

3.1.1.  Performance and Robustness

   Congestion control is subject to some tradeoffs: on the one hand, it
   must allow high link utilizations and fair resource sharing, but on
   the other hand, the algorithms must also be robust.

   Router support can help to improve performance, but it can also
   result in additional complexity and more control loops.  This
   requires a careful design of the algorithms in order to ensure
   stability and avoid, e.g., oscillations.  A further challenge is the
   fact that feedback information may be imprecise.  For instance,
   severe congestion can delay feedback signals.  Also, in-network
   measurement of parameters such as RTTs or data rates may contain
   estimation errors.  Even though there has been significant progress
   in providing fundamental theoretical models for such effects,
   research has not completely explored the whole problem space yet.

RFC6077 - Page 15

   Open questions are:

   -  How much can network elements theoretically improve performance in
      the complete range of communication scenarios that exist in the
      Internet without damaging or impacting end-to-end mechanisms
      already in place?

   -  Is it possible to design robust congestion control mechanisms that
      offer significant benefits with minimum additional risks, even if
      Internet traffic patterns will change in the future?

   -  What is the minimum support that is needed from the network in
      order to achieve significantly better performance than with end-
      to-end mechanisms and the current IP header limitations that
      provide at most unary ECN signals?

3.1.2.  Granularity of Network Component Functions

   There are several degrees of freedom concerning the involvement of
   network entities, ranging from some few additional functions in
   network management procedures on the one end to additional per-packet
   processing on the other end of the solution space.  Furthermore,
   different amounts of state can be kept in routers (no per-flow state,
   partial per-flow state, soft state, or hard state).  The additional
   router processing is a challenge for Internet scalability and could
   also increase end-to-end latencies.

   Although there are many research proposals that do not require
   per-flow state and thus do not cause a large processing overhead,
   there are no known full solutions (i.e., including anti-cheating)
   that do not require per-flow processing.  Also, scalability issues
   could be caused, for instance, by synchronization mechanisms for
   state information among parallel processing entities, which are,
   e.g., used in high-speed router hardware designs.

   Open questions are:

   -  What granularity of router processing can be realized without
      affecting Internet scalability?

   -  How can additional processing efforts be kept to a minimum?

RFC6077 - Page 16

3.1.3.  Information Acquisition

   In order to support congestion control, network components have to
   obtain at least a subset of the following information.  Obtaining
   that information may result in complex tasks.

   1. Capacity of (outgoing) links

      Link characteristics depend on the realization of lower protocol
      layers.  Routers operating at the IP layer do not necessarily know
      the link layer network topology and link capacities, and these are
      not always constant (e.g., on shared wireless links or bandwidth-
      on-demand links).  Depending on the network technology, there can
      be queues or bottlenecks that are not directly visible at the IP
      networking layer.

      Difficulties also arise when using IP-in-IP tunnels [RFC2003],
      IPsec tunnels [RFC4301], IP encapsulated in the Layer Two
      Tunneling Protocol (L2TP) [RFC2661], Generic Routing Encapsulation
      (GRE) [RFC1701] [RFC2784], the Point-to-Point Tunneling Protocol
      (PPTP) [RFC2637], or Multiprotocol Label Switching (MPLS)
      [RFC3031] [RFC3032].  In these cases, link information could be
      determined by cross-layer information exchange, but this requires
      interfaces capable of processing link layer technology specific
      information.  An alternative could be online measurements, but
      this can cause significant additional network overhead.  It is an
      open research question as to how much, if any, online traffic
      measurement would be acceptable (at run-time).  Encapsulation and
      decapsulation of explicit congestion information have been
      specified for IP-in-IP tunnelling [RFC6040] and for MPLS-in-MPLS
      or MPLS-in-IP [RFC5129].

   2. Traffic carried over (outgoing) links

      Accurate online measurement of data rates is challenging when
      traffic is bursty.  For instance, measuring a "current link load"
      requires defining the right measurement interval / sampling
      interval.  This is a challenge for proposals that require
      knowledge, e.g., about the current link utilization.

   3. Internal buffer statistics

      Some proposals use buffer statistics such as a virtual queue
      length to trigger feedback.  However, network components can
      include multiple distributed buffer stages that make it difficult
      to obtain such metrics.

RFC6077 - Page 17

   Open questions are:

   -  Can and should this information be made available, e.g., by
      additional interfaces or protocols?

   -  Which information is so important to higher-layer controllers that
      machine architecture research should focus on designing to
      provide it?

3.1.4.  Feedback Signaling

   Explicit notification mechanisms can be realized either by in-band
   signaling (notifications piggybacked along with the data traffic) or
   by out-of-band signaling [Sarola07].  The latter case requires
   additional protocols and a secure binding between the signals and the
   packets they refer to.  Out-of-band signaling can be further
   subdivided into path-coupled and path-decoupled approaches.

   Open questions concerning feedback signaling include:

   -  At which protocol layer should the feedback signaling occur
      (IP/network layer assisted, transport layer assisted, hybrid
      solutions, shim layer, intermediate sub-layer, etc.)?  Should the
      feedback signaling be path-coupled or path-decoupled?

   -  What is the optimal frequency of feedback (only in case of
      congestion events, per RTT, per packet, etc.)?

   -  What direction should feedback take (from network resource via
      receiver to sender, or directly back to sender)?

3.2.  Challenge 2: Corruption Loss

   It is common for congestion control mechanisms to interpret packet
   loss as a sign of congestion.  This is appropriate when packets are
   dropped in routers because of a queue that overflows, but there are
   other possible reasons for packet drops.  In particular, in wireless
   networks, packets can be dropped because of corruption loss,
   rendering the typical reaction of a congestion control mechanism
   inappropriate.  As a result, non-congestive loss may be more
   prevalent in these networks due to corruption loss (when the wireless
   link cannot be conditioned to properly control its error rate or due
   to transient wireless link interruption in areas of poor coverage).

   TCP over wireless and satellite is a topic that has been investigated
   for a long time [Krishnan04].  There are some proposals where the
   congestion control mechanism would react as if a packet had not been
   dropped in the presence of corruption (cf. TCP HACK [Balan01]), but

RFC6077 - Page 18

   discussions in the IETF have shown (see, for instance, the discussion
   that occurred in April 2003 on the Datagram Congestion Control
   Protocol (DCCP) working group list
   http://www.ietf.org/mail-archive/web/dccp/current/mail6.html) that
   there is no agreement that this type of reaction is appropriate.  For
   instance, it has been said that congestion can manifest itself as
   corruption on shared wireless links, and it is questionable whether a
   source that sends packets that are continuously impaired by link
   noise should keep sending at a high rate because it has lost the
   integrity of the feedback loop.

   Generally, two questions must be addressed when designing a
   congestion control mechanism that takes corruption loss into account:

   1. How is corruption detected?

   2. What should be the reaction?

   In addition to question 1 above, it may be useful to consider
   detecting the reason for corruption, but this has not yet been done
   to the best of our knowledge.

   Corruption detection can be done using an in-band or out-of-band
   signaling mechanism, much in the same way as described for
   Challenge 1.  Additionally, implicit detection can be considered:
   link layers sometimes retransmit erroneous frames, which can cause
   the end-to-end delay to increase -- but, from the perspective of a
   sender at the transport layer, there are many other possible reasons
   for such an effect.

   Header checksums provide another implicit detection possibility: if a
   checksum only covers all the necessary header fields and this
   checksum does not show an error, it is possible for errors to be
   found in the payload using a second checksum.  Such error detection
   is possible with UDP-Lite and DCCP; it was found to work well over a
   General Packet Radio Service (GPRS) network in a study [Chester04]
   and poorly over a WiFi network in another study [Rossi06] [Welzl08].
   Note that while UDP-Lite and DCCP enable the detection of corruption,
   the specifications of these protocols do not foresee any specific
   reaction to it for the time being.

RFC6077 - Page 19

   The idea of having a transport endpoint detecting and accordingly
   reacting (or not) to corruption poses a number of interesting
   questions regarding cross-layer interactions.  As IP is designed to
   operate over arbitrary link layers, it is therefore difficult to
   design a congestion control mechanism on top of it that appropriately
   reacts to corruption -- especially as the specific data link layers
   that are in use along an end-to-end path are typically unknown to
   entities at the transport layer.

   While the IETF has not yet specified how a congestion control
   mechanism should react to corruption, proposals exist in the
   literature, e.g., [Tickoo05].  For instance, TCP Westwood [Mascolo01]
   sets the congestion window equal to the measured bandwidth at the
   time of congestion in response to three DupACKs or a timeout.  This
   measurement is obtained by counting and filtering the ACK rate.  This
   setting provides a significant goodput improvement in noisy channels
   because the "blind" by half window reduction of standard TCP is
   avoided, i.e., the window is not reduced by too much.

   Open questions concerning corruption loss include:

   -  How should corruption loss be detected?

   -  How should a source react when it is known that corruption has
      occurred?

   -  Can an ECN-capable flow infer that loss must be due to corruption
      just from lack of explicit congestion notifications around a loss
      episode [Tickoo05]?  Or could this inference be dangerous, given
      the transport does not know whether all queues on the path are
      ECN-capable or not?

3.3.  Challenge 3: Packet Size

   TCP does not take packet size into account when responding to losses
   or ECN.  Over past years, the performance of TCP congestion avoidance
   algorithms has been extensively studied.  The well-known "square root
   formula" provides an estimation of the performance of the TCP
   congestion avoidance algorithm for TCP Reno [RFC2581].  [Padhye98]
   enhances the model to account for timeouts, receiver window, and
   delayed ACKs.

   For the sake of the present discussion, we will assume that the TCP
   throughput is expressed using the simplified formula.  Using this
   formula, the TCP throughput B is proportional to the segment size and
   inversely proportional to the RTT and the square root of the drop
   probability:

RFC6077 - Page 20

                S     1
         B ~ C --- -------
               RTT sqrt(p)

    where

         C     is a constant
         S     is the TCP segment size (in bytes)
         RTT   is the end-to-end round-trip time of the TCP
               connection (in seconds)
         p     is the packet drop probability

   Neglecting the fact that the TCP rate linearly depends on it,
   choosing the ideal packet size is a tradeoff between high throughput
   (the larger a packet, the smaller the relative header overhead) and
   low packet latency (the smaller a packet, the shorter the time that
   is needed until it is filled with data).  Observing that TCP is not
   optimal for applications with streaming media (since reliable
   in-order delivery and congestion control can cause arbitrarily long
   delays), this tradeoff has not usually been considered for TCP
   applications.  Therefore, the influence of the packet size on the
   sending rate has not typically been seen as a significant issue,
   given there are still few paths through the Internet that support
   packets larger than the 1500 bytes common with Ethernet.

   The situation is already different for the Datagram Congestion
   Control Protocol (DCCP) [RFC4340], which has been designed to enable
   unreliable but congestion-controlled datagram transmission, avoiding
   the arbitrary delays associated with TCP.  DCCP is intended for
   applications such as streaming media that can benefit from control
   over the tradeoffs between delay and reliable in-order delivery.

   DCCP provides for a choice of modular congestion control mechanisms.
   DCCP uses Congestion Control Identifiers (CCIDs) to specify the
   congestion control mechanism.  Three profiles are currently
   specified:

   -  DCCP Congestion Control ID 2 (CCID 2) [RFC4341]:  TCP-like
      Congestion Control.  CCID 2 sends data using a close approximation
      of TCP's congestion control as well as incorporating a variant of
      Selective Acknowledgment (SACK) [RFC2018] [RFC3517].  CCID 2 is
      suitable for senders that can adapt to the abrupt changes in the
      congestion window typical of TCP's AIMD congestion control, and
      particularly useful for senders that would like to take advantage
      of the available bandwidth in an environment with rapidly changing
      conditions.

RFC6077 - Page 21

   -  DCCP Congestion Control ID 3 (CCID 3) [RFC4342]: TCP-Friendly Rate
      Control (TFRC) [RFC5348] is a congestion control mechanism
      designed for unicast flows operating in a best-effort Internet
      environment.  When competing for bandwidth, its window is similar
      to TCP flows but has a much lower variation of throughput over
      time than TCP, making it more suitable for applications such as
      streaming media where a relatively smooth sending rate is of
      importance.  CCID 3 is appropriate for flows that would prefer to
      minimize abrupt changes in the sending rate, including streaming
      media applications with small or moderate receiver buffering
      before playback.

   -  DCCP Congestion Control ID 4 (CCID 4) [RFC5622]: TFRC Small
      Packets (TFRC-SP) [RFC4828], a variant of the TFRC mechanism, has
      been designed for applications that exchange small packets.  The
      objective of TFRC-SP is to achieve the same bandwidth in bits per
      second as a TCP flow using packets of up to 1500 bytes.  TFRC-SP
      enforces a minimum interval of 10 ms between data packets to
      prevent a single flow from sending small packets arbitrarily
      frequently.  CCID 4 has been designed to be used either by
      applications that use a small fixed segment size, or by
      applications that change their sending rate by varying the segment
      size.  Because CCID 4 is intended for applications that use a
      fixed small segment size, or that vary their segment size in
      response to congestion, the transmit rate derived from the TCP
      throughput equation is reduced by a factor that accounts for the
      packet header size, as specified in [RFC4828].

   The resulting open questions are:

   -  How does TFRC-SP operate under various network conditions?

   -  How can congestion control be designed so as to scale with packet
      size (dependency of congestion algorithm on packet size)?

   Today, many network resources are designed so that packet processing
   cannot be overloaded even for incoming loads at the maximum bit rate
   of the line.  If packet processing can handle sustained load r
   [packet per second] and the minimum packet size is h [bit] (i.e.,
   frame, packet, and transport headers with no payload), then a line
   rate of x [bit per second] will never be able to overload packet
   processing as long as x =< r*h.

   However, realistic equipment is often designed to only cope with a
   near-worst-case workload with a few larger packets in the mix, rather
   than the worst-case scenario of all minimum-size packets.  In this
   case, x = r*(h + e) for some small value of e.  Therefore, packet
   congestion is not impossible for runs of small packets (e.g., TCP

RFC6077 - Page 22

   ACKs or denial-of-service (DoS) attacks with TCP SYNs or small UDP
   datagrams).  But absent such anomalous workloads, equipment vendors
   at the 2008 ICCRG meeting believed that equipment could still be
   designed so that any congestion should be due to bit overload and not
   packet overload.

   This observation raises additional open issues:

   -  Can bit congestion remain prevalent?

      Being able to assume that congestion is generally due to excess
      bits and not excess packets is a useful simplifying assumption in
      the design of congestion control protocols.  Can we rely on this
      assumption for the future?  An alternative view is that in-network
      processing will become commonplace, so that per-packet processing
      will as likely be the bottleneck as per-bit transmission [Shin08].

      Over the last three decades, performance gains have mainly been
      achieved through increased packet rates and not bigger packets.
      But if bigger maximum segment sizes do become more prevalent, tiny
      segments (e.g., ACKs) will not stop being widely used -- leading
      to a widening range of packet sizes.

      The open question is thus whether or not packet processing rates
      (r) will keep up with growth in transmission rates (x).  A
      superficial look at Moore's Law-type trends would suggest that
      processing (r) will continue to outstrip growth in transmission
      (x).  But predictions based on actual knowledge of technology
      futures would be useful.  Another open question is whether there
      are likely to be more small packets in the average packet mix.  If
      the answers to either of these questions predict that packet
      congestion could become prevalent, congestion control protocols
      will have to be more complicated.

   -  Confusable causes of loss

      There is a considerable body of research on how to distinguish
      whether packet drops are due to transmission corruption or to
      congestion.  But the full list of confusable causes of loss is
      longer and includes transmission corruption loss, congestion loss
      (bit congestion and packet congestion), and policing loss.

      If congestion is due to excess bits, the bit rate should be
      reduced.  If congestion is due to excess packets, the packet rate
      can be reduced without reducing the bit rate -- by using larger
      packets.  However, if the transport cannot tell which of these
      causes led to a specific packet drop, its only safe response is to
      reduce the bit rate.  This is why the Internet would be more

RFC6077 - Page 23

      complicated if packet congestion were prevalent, as reducing the
      bit rate normally also reduces the packet rate, while reducing the
      packet rate does not necessarily reduce the bit rate.

      Given distinguishing between corruption loss and congestion is
      already an open issue (Section 3.2), if that problem is ever
      solved, a further open issue would be whether to standardize a
      solution that distinguishes all the above causes of loss, and not
      just two of them.

      Nonetheless, even if we find a way for network equipment to
      explicitly distinguish which sort of loss has occurred, we will
      never be able to assume that such a smart AQM solution is deployed
      at every congestible resource throughout the Internet -- at every
      higher-layer device like firewalls, proxies, and servers; and at
      every lower-layer device like low-end hubs, DSLAMs, Wireless LAN
      (WLAN) cards, cellular base-stations, and so on.  Thus, transport
      protocols will always have to cope with packet drops due to
      unpredictable causes, so we should always treat AQM as an
      optimization, given it will never be ubiquitous throughout the
      public Internet.

   -  What does a congestion notification on a packet of a certain size
      mean?

      The open issue here is whether a loss or explicit congestion mark
      should be interpreted as a single congestion event irrespective of
      the size of the packet lost or marked, or whether the strength of
      the congestion notification is weighted by the size of the packet.
      This issue is discussed at length in [Bri10], along with other
      aspects of packet size and congestion control.

      [Bri10] makes the strong recommendation that network equipment
      should drop or mark packets with a probability independent of each
      specific packet's size, while congestion controls should respond
      to dropped or marked packets in proportion to the packet's size.

   -  Packet size and congestion control protocol design

      If the above recommendation is correct -- that the packet size of
      a congestion notification should be taken into account when the
      transport reads, and not when the network writes, the notification
      -- it opens up a significant problem of protocol engineering and
      re-engineering.  Indeed, TCP does not take packet size into
      account when responding to losses or ECN.  At present, this is not
      a pressing problem because use of 1500 byte data segments is very
      prevalent for TCP, and the incidence of alternative maximum

RFC6077 - Page 24

      segment sizes is not large.  However, we should design the
      Internet's protocols so they will scale with packet size.  So, an
      open issue is whether we should evolve TCP to be sensitive to
      packet size, or expect new protocols to take over.

      As we continue to standardize new congestion control protocols, we
      must then face the issue of how they should account for packet
      size.  It is still an open research issue to establish whether TCP
      was correct in not taking packet size into account.  If it is
      determined that TCP was wrong in this respect, we should
      discourage future protocol designs from following TCP's example.
      For example, as explained above, the small-packet variant of TCP-
      friendly rate control (TFRC-SP [RFC4828]) is an experimental
      protocol that aims to take packet size into account.  Whatever
      packet size it uses, it ensures that its rate approximately equals
      that of a TCP using 1500 byte segments.  This raises the further
      question of whether TCP with 1500 byte segments will be a suitable
      long-term gold standard, or whether we need a more thorough review
      of what it means for a congestion control mechanism to scale with
      packet size.

3.4.  Challenge 4: Flow Startup

   The beginning of data transmissions imposes some further, unique
   challenges: when a connection to a new destination is established,
   the end-systems have hardly any information about the characteristics
   of the path in between and the available bandwidth.  In this flow
   startup situation, there is no obvious choice as to how to start to
   send.  A similar problem also occurs after relatively long idle
   times, since the congestion control state then no longer reflects
   current information about the state of the network (flow restart
   problem).

   Van Jacobson [Jacobson88] suggested using the slow-start mechanism
   both for the flow startup and the flow restart, and this is today's
   standard solution [RFC2581] [RFC5681].  Per [RFC5681], the slow-start
   algorithm is used when the congestion window (cwnd) < slow-start
   threshold (ssthresh), whose initial value is set arbitrarily high
   (e.g., to the size of the largest possible advertised window) and
   reduced in response to congestion.  During slow-start, TCP increments
   the cwnd by at most Sender Maximum Segment Size (MSS) bytes for each
   ACK received that cumulatively acknowledges new data.  Slow-start
   ends when cwnd exceeds ssthresh or when congestion is observed.
   However, the slow-start is not optimal in many situations.  First, it
   can take quite a long time until a sender can fully utilize the
   available bandwidth on a path.  Second, the exponential increase may
   be too aggressive and cause multiple packet loss if large congestion

RFC6077 - Page 25

   windows are reached (slow-start overshooting).  Finally, the slow-
   start does not ensure that new flows converge quickly to a reasonable
   share of resources, particularly when the new flows compete with
   long-lived flows and come out of slow-start early (slow-start vs
   overshoot tradeoff).  This convergence problem may even worsen if
   more aggressive congestion control variants are widely used.

   The slow-start and its interaction with the congestion avoidance
   phase was largely designed by intuition [Jacobson88].  So far, little
   theory has been developed to understand the flow startup problem and
   its implication on congestion control stability and fairness.  There
   is also no established methodology to evaluate whether new flow
   startup mechanisms are appropriate or not.

   As a consequence, it is a non-trivial task to address the
   shortcomings of the slow-start algorithm.  Several experimental
   enhancements have been proposed, such as congestion window validation
   [RFC2861] and limited slow-start [RFC3742].  There are also ongoing
   research activities, focusing, e.g., on bandwidth estimation
   techniques, delay-based congestion control, or rate-pacing
   mechanisms.  However, any alternative end-to-end flow startup
   approach has to cope with the inherent problem that there is no or
   only little information about the path at the beginning of a data
   transfer.  This uncertainty could be reduced by more expressive
   feedback signaling (cf. Section 3.1).  For instance, a source could
   learn the path characteristics faster with the Quick-Start mechanism
   [RFC4782].  But even if the source knew exactly what rate it should
   aim for, it would still not necessarily be safe to jump straight to
   that rate.  The end-system still does not know how a change in its
   own rate will affect the path, which also might become congested in
   less than one RTT.  Further research would be useful to understand
   the effect of decreasing the uncertainty by explicit feedback
   separately from control theoretic stability questions.  Furthermore,
   flow startup also raises fairness questions.  For instance, it is
   unclear whether it could be reasonable to use a faster startup when
   an end-system detects that a path is currently not congested.

   In summary, there are several topics for further research concerning
   flow startup:

   -  Better theoretical understanding of the design and evaluation of
      flow startup mechanisms, concerning their impact on congestion
      risk, stability, and fairness.

   -  Evaluating whether it may be appropriate to allow alternative
      starting schemes, e.g., to allow higher initial rates under
      certain constraints [Chu10]; this also requires refining the
      definition of fairness for startup situations.

RFC6077 - Page 26

   -  Better theoretical models for the effects of decreasing
      uncertainty by additional network feedback, particularly if the
      path characteristics are very dynamic.

3.5.  Challenge 5: Multi-Domain Congestion Control

   Transport protocols such as TCP operate over the Internet, which is
   divided into autonomous systems.  These systems are characterized by
   their heterogeneity as IP networks are realized by a multitude of
   technologies.

3.5.1.  Multi-Domain Transport of Explicit Congestion Notification

   Different conditions and their variations lead to correlation effects
   between policers that regulate traffic against certain conformance
   criteria.

   With the advent of techniques allowing for early detection of
   congestion, packet loss is no longer the sole metric of congestion.
   ECN (Explicit Congestion Notification) marks packets -- set by active
   queue management techniques -- to convey congestion information,
   trying to prevent packet losses (packet loss and the number of
   packets marked gives an indication of the level of congestion).
   Using TCP ACKs to feed back that information allows the hosts to
   realign their transmission rate and thus encourages them to
   efficiently use the network.  In IP, ECN uses the two least
   significant bits of the (former) IPv4 Type of Service (TOS) octet or
   the (former) IPv6 Traffic Class octet [RFC2474] [RFC3260].  Further,
   ECN in TCP uses two bits in the TCP header that were previously
   defined as reserved [RFC793].

   ECN [RFC3168] is an example of a congestion feedback mechanism from
   the network toward hosts.  The congestion-based feedback scheme,
   however, has limitations when applied on an inter-domain basis.
   Indeed, Sections 8 and 19 of [RFC3168] detail the implications of two
   possible attacks:

   1. non-compliance: a network erasing a Congestion Experienced (CE)
      codepoint introduced earlier on the path, and

   2. subversion: a network changing Not ECN-Capable Transport (Not-ECT)
      to ECT.

   Both of these problems could allow an attacking network to cause
   excess congestion in an upstream network, even if the transports were
   behaving correctly.  There are to date two possible solutions to the
   non-compliance problem (number 1 above): the ECN-nonce [RFC3540] and
   the [CONEX] work item inspired by the re-ECN incentive system

RFC6077 - Page 27

   [Bri09].  Nevertheless, accidental rather than malicious erasure of
   ECN is an issue for IPv6 where the absence of an IPv6 header checksum
   implies that corruption of ECN could be more impacting than in the
   IPv4 case.

   Fragmentation is another issue: the ECN-nonce cannot protect against
   misbehaving receivers that conceal marked fragments; thus, some
   protection is lost in situations where path MTU discovery is
   disabled.  Note also that ECN-nonce wouldn't protect against the
   subversion issue (number 2 above) because, by definition, a Not-ECT
   packet comes from a source without ECN enabled, and therefore without
   the ECN-nonce enabled.  So, there is still room for improvement on
   the ECN mechanism when operating in multi-domain networks.

   Operational/deployment experience is nevertheless required to
   determine the extent of these problems.  The second problem is mainly
   related to deployment and usage practices and does not seem to result
   in any specific research challenge.

   Another controversial solution in a multi-domain environment may be
   the TCP rate controller (TRC), a traffic conditioner that regulates
   the TCP flow at the ingress node in each domain by controlling packet
   drops and delays of the packets in a flow.  The outgoing traffic from
   a TRC-controlled domain is shaped in such a way that no packets are
   dropped at the policer.  However, the TRC interferes with the end-to-
   end TCP model, and thus it would interfere with past and future
   diversity of TCP implementations (violating the end-to-end
   principle).  In particular, the TRC embeds the flow rate equality
   view of fairness in the network, and would prevent evolution to forms
   of fairness based on congestion-volume (Section 2.3).

3.5.2.  Multi-Domain Exchange of Topology or Explicit Rate Information

   Security is a challenge for multi-domain exchange of explicit rate
   signals, whether in-band or out-of-band.  At domain boundaries,
   authentication and authorization issues can arise whenever congestion
   control information is exchanged.  From this perspective, the
   Internet does not so far have any security architecture for this
   problem.

   The future evolution of Internet inter-domain operation has to show
   whether more multi-domain information exchange can be effectively
   realized.  This is of particular importance for congestion control
   schemes that make use of explicit per-datagram rate feedback (e.g.,
   RCP or XCP) or explicit rate feedback that uses in-band congestion
   signaling (e.g., Quick-Start) or out-of-band signaling (e.g.,
   CADPC/PTP).  Explicit signaling exchanges at the inter-domain level
   that result in local domain triggers are currently absent from the

RFC6077 - Page 28

   Internet.  From this perspective, security issues resulting from
   limited trust between different administrative units result in policy
   enforcement that exacerbates the difficulty encountered when explicit
   feedback congestion control information is exchanged between domains.
   Note that even though authentication mechanisms could be extended for
   this purpose (by recognizing that explicit rate schemes such as RCP
   or XCP have the same inter-domain security requirements and structure
   as IntServ), they suffer from the same scalability problems as
   identified in [RFC2208].  Indeed, in-band rate signaling or out-of-
   band per-flow traffic specification signaling (like in the Resource
   Reservation Protocol (RSVP)) results in similar scalability issues
   (see Section 3.1).

   Also, many autonomous systems only exchange some limited amount of
   information about their internal state (topology hiding principle),
   even though having more precise information could be highly
   beneficial for congestion control.  Indeed, revealing the internal
   network structure is highly sensitive in multi-domain network
   operations and thus also a concern when it comes to the deployability
   of congestion control schemes.  For instance, a network-assisted
   congestion control scheme with explicit signaling could reveal more
   information about the internal network dimensioning than TCP does
   today.

3.5.3.  Multi-Domain Pseudowires

   Extending pseudowires across multiple domains poses specific issues.
   Pseudowires (PWs) [RFC3985] may carry non-TCP data flows (e.g., Time-
   Division Multiplexing (TDM) traffic or Constant Bit Rate (CBR) ATM
   traffic) over a multi-domain IP network.  Structure-Agnostic TDM over
   Packet (SAToP) [RFC4553], Circuit Emulation Service over Packet
   Switched Network (CESoPSN) [RFC5086], and TDM over IP (TDMoIP)
   [RFC5087] are not responsive to congestion control as discussed in
   [RFC2914] (see also [RFC5033]).  The same observation applies to ATM
   circuit emulating services (CESs) interconnecting CBR equipment
   (e.g., Private Branch Exchanges (PBX)) across a Packet Switched
   Network (PSN).

   Moreover, it is not possible to simply reduce the flow rate of a TDM
   PW or an ATM PW when facing packet loss.  Providers can rate-control
   corresponding incoming traffic, but they may not be able to detect
   that PWs carry TDM or CBR ATM traffic (mechanisms for characterizing
   the traffic's temporal properties may not necessarily be supported).

RFC6077 - Page 29

   This can be illustrated with the following example.

                ...........       ............
               .           .     .
        S1 --- E1 ---      .     .
               .     |     .     .
               .      === E5 === E7 ---
               .     |     .     .     |
        S2 --- E2 ---      .     .     |
               .           .     .     |      |
                ...........      .     |      v
   .                                    ----- R --->
                ...........      .     |      ^
               .           .     .     |      |
        S3 --- E3 ---      .     .     |
               .     |     .     .     |
               .      === E6 === E8 ---
               .     |     .     .
        S4 --- E4 ---      .     .
               .           .     .
                ...........       ............

               \---- P1 ---/     \---------- P2 -----

   Sources S1, S2, S3, and S4 are originating TDM over IP traffic.  P1
   provider edges E1, E2, E3, and E4 are rate-limiting such traffic.
   The Service Level Agreement (SLA) of provider P1 with transit
   provider P2 is such that the latter assumes a BE traffic pattern and
   that the distribution shows the typical properties of common BE
   traffic (elastic, non-real time, non-interactive).

   The problem arises for transit provider P2 because it is not able to
   detect that IP packets are carrying constant-bit-rate service traffic
   for which the only useful congestion control mechanism would rely on
   implicit or explicit admission control, meaning self-blocking or
   enforced blocking, respectively.

   Assuming P1 providers are rate-limiting BE traffic, a transit P2
   provider router R may be subject to serious congestion as all TDM PWs
   cross the same router.  TCP-friendly traffic (e.g., each flow within
   another PW) would follow TCP's AIMD algorithm of reducing the sending
   rate by half, in response to each packet drop.  Nevertheless, the PWs
   carrying TDM traffic could take all the available capacity while
   other more TCP-friendly or generally congestion-responsive traffic
   reduced itself to nothing.  Note here that the situation may simply
   occur because S4 suddenly turns on additional TDM channels.

RFC6077 - Page 30

   It is neither possible nor desirable to assume that edge routers will
   soon have the ability to detect the responsiveness of the carried
   traffic, but it is still important for transit providers to be able
   to police a fair, robust, responsive, and efficient congestion
   control technique in order to avoid impacting congestion-responsive
   Internet traffic.  However, we must not require only certain specific
   responses to congestion to be embedded within the network, which
   would harm evolvability.  So designing the corresponding mechanisms
   in the data and control planes still requires further investigation.

3.6.  Challenge 6: Precedence for Elastic Traffic

   Traffic initiated by so-called elastic applications adapts to the
   available bandwidth using feedback about the state of the network.

   For elastic applications, the transport dynamically adjusts the data
   traffic sending rate to different network conditions.  Examples
   encompass short-lived elastic traffic including HTTP and instant-
   messaging traffic, as well as long file transfers with FTP and
   applications targeted by [LEDBAT].  In brief, elastic data
   applications can show extremely different requirements and traffic
   characteristics.

   The idea to distinguish several classes of best-effort traffic types
   is rather old, since it would be beneficial to address the relative
   delay sensitivities of different elastic applications.  The notion of
   traffic precedence was already introduced in [RFC791], and it was
   broadly defined as "An independent measure of the importance of this
   datagram".  For instance, low-precedence traffic should experience
   lower average throughput than higher-precedence traffic.  Several
   questions arise here: What is the meaning of "relative"?  What is the
   role of the transport layer?

   The preferential treatment of higher-precedence traffic combined with
   appropriate congestion control mechanisms is still an open issue that
   may, depending on the proposed solution, impact both the host and the
   network precedence awareness, and thereby congestion control.
   [RFC2990] points out that the interactions between congestion control
   and DiffServ [RFC2475] remained unaddressed until recently.

   Recently, a study and a potential solution have been proposed that
   introduce Guaranteed TFRC (gTFRC) [Lochin06].  gTFRC is an adaptation
   of TCP-Friendly Rate Control providing throughput guarantees for
   unicast flows over the DiffServ/Assured Forwarding (AF) class.  The
   purpose of gTFRC is to distinguish the guaranteed part from the best-
   effort part of the traffic resulting from AF conditioning.  The
   proposed congestion control has been specified and tested inside
   DCCP/CCID 3 for DiffServ/AF networks [Lochin07] [Jourjon08].

RFC6077 - Page 31

   Nevertheless, there is still work to be performed regarding lower-
   precedence traffic -- data transfers that are useful, yet not
   important enough to warrant significantly impairing other traffic.
   Examples of applications that could make use of such traffic are web
   caches and web browsers (e.g., for pre-fetching) as well as peer-to-
   peer applications.  There are proposals for achieving low precedence
   on a pure end-to-end basis (e.g., TCP Low Priority (TCP-LP)
   [Kuzmanovic03]), and there is a specification for achieving it via
   router mechanisms [RFC3662].  It seems, however, that network-based
   lower-precedence mechanisms are not yet a common service on the
   Internet.  Since early 2010, end-to-end mechanisms for lower
   precedence, e.g., [Shal10], have become common -- at least when
   competing with other traffic as part of its own queues (e.g., in a
   home router).  But it is less clear whether users will be willing to
   make their background traffic yield to other people's foreground
   traffic, unless the appropriate incentives are created.

   There is an issue over how to reconcile two divergent views of the
   relation between traffic class precedence and congestion control.
   One view considers that congestion signals (losses or explicit
   notifications) in one traffic class are independent of those in
   another.  The other relates marking of the classes together within
   the active queue management (AQM) mechanism [Gibbens02].  In the
   independent case, using a higher-precedence class of traffic gives a
   higher scheduling precedence and generally lower congestion level.
   In the linked case, using a higher-precedence class of traffic still
   gives higher scheduling precedence, but results in a higher level of
   congestion.  This higher congestion level reflects the extra
   congestion higher-precedence traffic causes to both classes combined.
   The linked case separates scheduling precedence from rate control.
   The end-to-end congestion control algorithm can separately choose to
   take a higher rate by responding less to the higher level of
   congestion.  This second approach could become prevalent if weighted
   congestion controls were common.  However, it is an open issue how
   the two approaches might co-exist or how one might evolve into the
   other.

3.7.  Challenge 7: Misbehaving Senders and Receivers

   In the current Internet architecture, congestion control depends on
   parties acting against their own interests.  It is not in a
   receiver's interest to honestly return feedback about congestion on
   the path, effectively requesting a slower transfer.  It is not in the
   sender's interest to reduce its rate in response to congestion if it
   can rely on others to do so.  Additionally, networks may have
   strategic reasons to make other networks appear congested.

RFC6077 - Page 32

   Numerous strategies to improve congestion control have already been
   identified.  The IETF has particularly focused on misbehaving TCP
   receivers that could confuse a compliant sender into assigning
   excessive network and/or server resources to that receiver (e.g.,
   [Savage99], [RFC3540]).  But, although such strategies are worryingly
   powerful, they do not yet seem common (however, evidence of attack
   prevalence is itself a research requirement).

   A growing proportion of Internet traffic comes from applications
   designed not to use congestion control at all, or worse, applications
   that add more forward error correction as they experience more
   losses.  Some believe the Internet was designed to allow such
   freedom, so it can hardly be called misbehavior.  But others consider
   it misbehavior to abuse this freedom [RFC3714], given one person's
   freedom can constrain the freedom of others (congestion represents
   this conflict of interests).  Indeed, leaving freedom unchecked might
   result in congestion collapse in parts of the Internet.
   Proportionately, large volumes of unresponsive voice traffic could
   represent such a threat, particularly for countries with less
   generous provisioning [RFC3714].  Also, Internet video on demand
   services that transfer much greater data rates without congestion
   control are becoming popular.  In general, it is recommended that
   such UDP applications use some form of congestion control [RFC5405].

   Note that the problem is not just misbehavior driven by a self-
   interested desire for more bandwidth.  Indeed, congestion control may
   be attacked by someone who makes no gain for themselves, other than
   the satisfaction of harming others (see Security Considerations in
   Section 4).

   Open research questions resulting from these considerations are:

   -  By design, new congestion control protocols need to enable one end
      to check the other for protocol compliance.  How would such
      mechanisms be designed?

   -  Which congestion control primitives could safely satisfy more
      demanding applications (smoother than TFRC, faster than high-speed
      TCPs), so that application developers and users do not turn off
      congestion control to get the rate they expect and need?

   Note also that self-restraint could disappear from the Internet.  So,
   it may no longer be sufficient to rely on developers/users
   voluntarily submitting themselves to congestion control.  As a
   consequence, mechanisms to enforce fairness (see Sections 2.3, 3.4,
   and 3.5) need to have more emphasis within the research agenda.

(next page on part 3)