Evol_cc] [RFC3426]. Given there are many good reasons why larger path maximum transmission units (PMTUs) would help solve a number of scaling issues, we do not want to create any bias against large packets that is greater than their true cost. Imagine a scenario where the same bit rate of packets will contribute the same to bit congestion of a link irrespective of whether it is sent as fewer larger packets or more smaller packets. A protocol design that caused larger packets to be more likely to be dropped than smaller ones would be dangerous in both of the following cases: Malicious transports: A queue that gives an advantage to small packets can be used to amplify the force of a flooding attack. By sending a flood of small packets, the attacker can get the queue to discard more large-packet traffic, allowing more attack traffic to get through to cause further damage. Such a queue allows attack traffic to have a disproportionately large effect on regular traffic without the attacker having to do much work. Non-malicious transports: Even if an application designer is not actually malicious, if over time it is noticed that small packets tend to go faster, designers will act in their own interest and use smaller packets. Queues that give advantage to small packets create an evolutionary pressure for applications or transports to send at the same bit rate but break their data stream down into tiny segments to reduce their drop rate. Encouraging a high volume of tiny packets might in turn unnecessarily overload a completely unrelated part of the system, perhaps more limited by header processing than bandwidth. Imagine that two unresponsive flows arrive at a bit-congestible transmission link each with the same bit rate, say 1 Mbps, but one consists of 1,500 B and the other 60 B packets, which are 25x smaller. Consider a scenario where gentle RED [gentle_RED] is used,
along with the variant of RED we advise against, i.e., where the RED algorithm is configured to adjust the drop probability of packets in proportion to each packet's size (byte-mode packet drop). In this case, RED aims to drop 25x more of the larger packets than the smaller ones. Thus, for example, if RED drops 25% of the larger packets, it will aim to drop 1% of the smaller packets (but, in practice, it may drop more as congestion increases; see Appendix B.4 of [RFC4828]). Even though both flows arrive with the same bit rate, the bit rate the RED queue aims to pass to the line will be 750 kbps for the flow of larger packets but 990 kbps for the smaller packets (because of rate variations, it will actually be a little less than this target). Note that, although the byte-mode drop variant of RED amplifies small-packet attacks, tail-drop queues amplify small-packet attacks even more (see Security Considerations in Section 6). Wherever possible, neither should be used. Section 4.2.3).
flows with different packet sizes. However, in order to do this, the queuing algorithm has to make assumptions about the transport, which become embedded in the network. Specifically: o The queuing algorithm has to assume how aggressively the transport will respond to congestion (see Section 4.2.4). If the network assumes the transport responds as aggressively as TCP NewReno, it will be wrong for Compound TCP and differently wrong for Cubic TCP, etc. To achieve equal bit rates, each transport then has to guess what assumption the network made, and work out how to replace this assumed aggressiveness with its own aggressiveness. o Also, if the network biases congestion notification by packet size, it has to assume a baseline packet size -- all proposed algorithms use the local MTU (for example, see the byte-mode loss probability formula in Table 1). Then if the non-Reno transports mentioned above are trying to reverse engineer what the network assumed, they also have to guess the MTU of the congested link. Even though reducing the drop probability of small packets (e.g., RED's byte-mode drop) helps ensure TCP flows with different packet sizes will achieve similar bit rates, we argue that this correction should be made to any future transport protocols based on TCP, not to the network in order to fix one transport, no matter how predominant it is. Effectively, favouring small packets is reverse engineering of network equipment around one particular transport protocol (TCP), contrary to the excellent advice in [RFC3426], which asks designers to question "Why are you proposing a solution at this layer of the protocol stack, rather than at another layer?" In contrast, if the network never takes packet size into account, the transport can be certain it will never need to guess any assumptions that the network has made. And the network passes two pieces of information to the transport that are sufficient in all cases: i) congestion notification on the packet and ii) the size of the packet. Both are available for the transport to combine (by taking packet size into account when responding to congestion) or not. Appendix B checks that these two pieces of information are sufficient for all relevant scenarios. When the network does not take packet size into account, it allows transport protocols to choose whether or not to take packet size into account. However, if the network were to bias congestion notification by packet size, transport protocols would have no choice; those that did not take into account packet size themselves would unwittingly become dependent on packet size, and those that already took packet size into account would end up taking it into account twice.
the size of the packet in question. Nonetheless, in general, if we wanted networks to do size-dependent drop, we would need universal deployment of (packet-size dependent) AQM code, which is currently unrealistic. A host transport cannot know whether any particular drop was a deliberate signal from an AQM or a sign of a queue shedding packets due to buffer exhaustion. Therefore, because the network cannot universally do size-dependent drop, it should not do it all. Whereas universality is desirable in the network, diversity is desirable between different transport-layer protocols -- some, like standards track TCP congestion control [RFC5681], may not choose to make their rate response proportionate to the size of each dropped packet, while others will (e.g., TCP-Friendly Rate Control for Small Packets (TFRC-SP) [RFC4828]). Table 1). Taking packet size into account at the transport rather than in the network ensures that neither the network nor the transport needs to do a multiply operation -- multiplication by packet size is effectively achieved as a repeated add when the transport adds to its count of marked bytes as each congestion event is fed to it. Also, the work to do the biasing is spread over many hosts, rather than concentrated in just the congested network element. These aren't principled reasons in themselves, but they are a happy consequence of the other principled reasons. RED93] proposed two options for the RED active queue management algorithm: packet mode and byte mode. Packet mode measured the queue length in packets and dropped (or marked) individual packets with a probability independent of their size. Byte mode measured the queue length in bytes and marked an individual packet with probability in proportion to its size (relative to the maximum packet size). In the paper's outline of further work, it was stated that no recommendation had been made on whether the queue size should be measured in bytes or packets, but noted that the difference could be significant.
When RED was recommended for general deployment in 1998 [RFC2309], the two modes were mentioned implying the choice between them was a question of performance, referring to a 1997 email [pktByteEmail] for advice on tuning. A later addendum to this email introduced the insight that there are in fact two orthogonal choices: o whether to measure queue length in bytes or packets (Section 4.1), and o whether the drop probability of an individual packet should depend on its own size (Section 4.2). The rest of this section is structured accordingly. RFC 2309. It is now well understood that queues for bit- congestible resources should be measured in bytes, and queues for packet-congestible resources should be measured in packets [pktByteEmail]. Congestion in some legacy bit-congestible buffers is only measured in packets not bytes. In such cases, the operator has to take into account a typical mix of packet sizes when setting the thresholds. Any AQM algorithm on such a buffer will be oversensitive to high proportions of small packets, e.g., a DoS attack, and under-sensitive to high proportions of large packets. However, there is no need to make allowances for the possibility of such a legacy in future protocol design. This is safe because any under-sensitivity during unusual traffic mixes cannot lead to congestion collapse given that the buffer will eventually revert to tail drop, which discards proportionately more large packets.
queuing and transmission are in fixed MTU-size units. Therefore, the queue length in packets is a good model of congestion of the link. o More commonly, hardware with fixed-size packet buffers transmits packets to the line without padding. This implies a hybrid forwarding system with transmission congestion dependent on the size of packets but queue congestion dependent on the number of packets, irrespective of their size. Nonetheless, there would be no queue at all unless the line had become congested -- the root cause of any congestion is too many bytes arriving for the line. Therefore, the AQM should measure the queue length as the sum of all the packet sizes in bytes that are queued up waiting to be serviced by the line, irrespective of whether each packet is held in a fixed-size buffer. In the (unlikely) first case where use of padding means the queue should be measured in packets, further confusion is likely because the fixed buffers are rarely all one size. Typically, pools of different-sized buffers are provided (Cisco uses the term 'buffer carving' for the process of dividing up memory into these pools [IOSArch]). Usually, if the pool of small buffers is exhausted, arriving small packets can borrow space in the pool of large buffers, but not vice versa. However, there is no need to consider all this complexity, because the root cause of any congestion is still line overload -- buffer consumption is only the symptom. Therefore, the length of the queue should be measured as the sum of the bytes in the queue that will be transmitted to the line, including any padding. In the (unusual) case of transmission with padding, this means the sum of the sizes of the small buffers queued plus the sum of the sizes of the large buffers queued. We will return to borrowing of fixed-size buffers when we discuss biasing the drop/marking probability of a specific packet because of its size in Section 4.2.1. But here, we can repeat the simple rule for how to measure the length of queues of fixed buffers: no matter how complicated the buffering scheme is, ultimately a transmission line is nearly always bit-congestible so the number of bytes queued up waiting for the line measures how congested the line is, and it is rarely important to measure how congested the buffering system is.
limited resources are usually bit-congestible if energy is primarily required for transmission rather than header processing, but it is rare for a link protocol to build a queue as it approaches maximum power. Nonetheless, AQM algorithms do not require a queue in order to work. For instance, spectrum congestion can be modelled by signal quality using the target bit-energy-to-noise-density ratio. And, to model radio power exhaustion, transmission-power levels can be measured and compared to the maximum power available. [ECNFixedWireless] proposes a practical and theoretically sound way to combine congestion notification for different bit-congestible resources at different layers along an end-to-end path, whether wireless or wired, and whether with or without queues. In wireless protocols that use request to send / clear to send (RTS / CTS) control, such as some variants of IEEE802.11, it is reasonable to base an AQM on the time spent waiting for transmission opportunities (TXOPs) even though the wireless spectrum is usually regarded as congested by bits (for a given coding scheme). This is because requests for TXOPs queue up as the spectrum gets congested by all the bits being transferred. So the time that TXOPs are queued directly reflects bit congestion of the spectrum. pktByteEmail] referred to by [RFC2309] advised that most scarce resources in the Internet were bit-congestible, which is still believed to be true (Section 1.1). But it went on to offer advice that is updated by this memo. It said that drop probability should depend on the size of the packet being considered for drop if the resource is bit-congestible, but not if it is packet-congestible. The argument continued that if packet drops were inflated by packet size (byte-mode dropping), "a flow's fraction of the packet drops is then a good indication of that flow's fraction of the link bandwidth in bits per second". This was consistent with a referenced policing mechanism being worked on at the time for detecting unusually high bandwidth flows, eventually published in 1999 [pBox]. However, the problem could and should have been solved by making the policing mechanism count the volume of bytes randomly dropped, not the number of packets.
A few months before RFC 2309 was published, an addendum was added to the above archived email referenced from the RFC, in which the final paragraph seemed to partially retract what had previously been said. It clarified that the question of whether the probability of dropping/marking a packet should depend on its size was not related to whether the resource itself was bit-congestible, but a completely orthogonal question. However, the only example given had the queue measured in packets but packet drop depended on the size of the packet in question. No example was given the other way round. In 2000, Cnodder et al. [REDbyte] pointed out that there was an error in the part of the original 1993 RED algorithm that aimed to distribute drops uniformly, because it didn't correctly take into account the adjustment for packet size. They recommended an algorithm called RED_4 to fix this. But they also recommended a further change, RED_5, to adjust the drop rate dependent on the square of the relative packet size. This was indeed consistent with one implied motivation behind RED's byte-mode drop -- that we should reverse engineer the network to improve the performance of dominant end-to-end congestion control mechanisms. This memo makes a different recommendations in Section 2. By 2003, a further change had been made to the adjustment for packet size, this time in the RED algorithm of the ns2 simulator. Instead of taking each packet's size relative to a 'maximum packet size', it was taken relative to a 'mean packet size', intended to be a static value representative of the 'typical' packet size on the link. We have not been able to find a justification in the literature for this change; however, Eddy and Allman conducted experiments [REDbias] that assessed how sensitive RED was to this parameter, amongst other things. This changed algorithm can often lead to drop probabilities of greater than 1 (which gives a hint that there is probably a mistake in the theory somewhere). On 10-Nov-2004, this variant of byte-mode packet drop was made the default in the ns2 simulator. It seems unlikely that byte-mode drop has ever been implemented in production networks (Appendix A); therefore, any conclusions based on ns2 simulations that use RED without disabling byte-mode drop are likely to behave very differently from RED in production networks.
But also, queues with fixed-size buffers reduce the probability that small packets will be dropped if (and only if) they allow small packets to borrow buffers from the pools for larger packets (see Section 4.1.1). Borrowing effectively makes the maximum queue size for small packets greater than that for large packets, because more buffers can be used by small packets while less will fit large packets. Incidentally, the bias towards small packets from buffer borrowing is nothing like as large as that of RED's byte-mode drop. Nonetheless, fixed-buffer memory with tail drop is still prone to lock out large packets, purely because of the tail-drop aspect. So, fixed-size packet buffers should be augmented with a good AQM algorithm and packet-mode drop. If an AQM is too complicated to implement with multiple fixed buffer pools, the minimum necessary to prevent large-packet lockout is to ensure that smaller packets never use the last available buffer in any of the pools for larger packets. RFC5348]), which is called TFRC-SP [RFC4828]. Essentially, it proposes a rate equation that inflates the flow rate by the ratio of a typical TCP segment size (1,500 B including TCP header) over the actual segment size [PktSizeEquCC]. (There are also other important differences of detail relative to TFRC, such as using virtual packets [CCvarPktSize] to avoid responding to multiple losses per round trip and using a minimum inter-packet interval.) Section 4.5.1 of the TFRC-SP specification discusses the implications of operating in an environment where queues have been configured to drop smaller packets with proportionately lower probability than larger ones. But it only discusses TCP operating in such an environment, only mentioning TFRC-SP briefly when discussing how to define fairness with TCP. And it only discusses the byte-mode dropping version of RED as it was before Cnodder et al. pointed out that it didn't sufficiently bias towards small packets to make TCP independent of packet size.
So the TFRC-SP specification doesn't address the issue of whether the network or the transport _should_ handle fairness between different packet sizes. In Appendix B.4 of RFC 4828, it discusses the possibility of both TFRC-SP and some network buffers duplicating each other's attempts to deliberately bias towards small packets. But the discussion is not conclusive, instead reporting simulations of many of the possibilities in order to assess performance but not recommending any particular course of action. The paper originally proposing TFRC with virtual packets (VP-TFRC) [CCvarPktSize] proposed that there should perhaps be two variants to cater for the different variants of RED. However, as the TFRC-SP authors point out, there is no way for a transport to know whether some queues on its path have deployed RED with byte-mode packet drop (except if an exhaustive survey found that no one has deployed it! -- see Appendix A). Incidentally, VP-TFRC also proposed that byte-mode RED dropping should really square the packet-size compensation factor (like that of Cnodder's RED_5, but apparently unaware of it). Pre-congestion notification [RFC5670] is an IETF technology to use a virtual queue for AQM marking for packets within one Diffserv class in order to give early warning prior to any real queuing. The PCN- marking algorithms have been designed not to take into account packet size when forwarding through queues. Instead, the general principle has been to take the sizes of marked packets into account when monitoring the fraction of marking at the edge of the network, as recommended here. RFC5562] [RFC5690]. In both cases, they note that the case for these two TCP changes would be weaker if RED were biased against dropping small packets. We argue here that these two proposals are a safer and more principled way to achieve TCP performance improvements than reverse engineering RED to benefit TCP. Although there are no known proposals, it would also be possible and perfectly valid to make control packets robust against drop by requesting a scheduling class with lower drop probability, which would be achieved by re-marking to a Diffserv code point [RFC2474] within the same behaviour aggregate. Although not brought to the IETF, a simple proposal from Wischik [DupTCP] suggests that the first three packets of every TCP flow should be routinely duplicated after a short delay. It shows that this would greatly improve the chances of short flows completing
quickly, but it would hardly increase traffic levels on the Internet, because Internet bytes have always been concentrated in the large flows. It further shows that the performance of many typical applications depends on completion of long serial chains of short messages. It argues that, given most of the value people get from the Internet is concentrated within short flows, this simple expedient would greatly increase the value of the best-effort Internet at minimal cost. A similar but more extensive approach has been evaluated on Google servers [GentleAggro]. The proposals discussed in this sub-section are experimental approaches that are not yet in wide operational use, but they are existence proofs that transports can make themselves robust against loss of control packets. The examples are all TCP-based, but applications over non-TCP transports could mitigate loss of control packets by making similar use of Diffserv, data duplication, FEC, etc. +-----------+-----------------+-----------------+-------------------+ | transport | RED_1 (packet- | RED_4 (linear | RED_5 (square | | cc | mode drop) | byte-mode drop) | byte-mode drop) | +-----------+-----------------+-----------------+-------------------+ | TCP or | s/sqrt(p) | sqrt(s/p) | 1/sqrt(p) | | TFRC | | | | | TFRC-SP | 1/sqrt(p) | 1/sqrt(s*p) | 1/(s*sqrt(p)) | +-----------+-----------------+-----------------+-------------------+ Table 2: Dependence of flow bit rate per RTT on packet size, s, and drop probability, p, when there is network and/or transport bias towards small packets to varying degrees Table 2 aims to summarise the potential effects of all the advice from different sources. Each column shows a different possible AQM behaviour in different queues in the network, using the terminology of Cnodder et al. outlined earlier (RED_1 is basic RED with packet- mode drop). Each row shows a different transport behaviour: TCP [RFC5681] and TFRC [RFC5348] on the top row with TFRC-SP [RFC4828] below. Each cell shows how the bits per round trip of a flow depends on packet size, s, and drop probability, p. In order to declutter the formulae to focus on packet-size dependence, they are all given per round trip, which removes any RTT term. Let us assume that the goal is for the bit rate of a flow to be independent of packet size. Suppressing all inessential details, the table shows that this should either be achievable by not altering the TCP transport in a RED_5 network, or using the small packet TFRC-SP
transport (or similar) in a network without any byte-mode dropping RED (top right and bottom left). Top left is the 'do nothing' scenario, while bottom right is the 'do both' scenario in which the bit rate would become far too biased towards small packets. Of course, if any form of byte-mode dropping RED has been deployed on a subset of queues that congest, each path through the network will present a different hybrid scenario to its transport. Whatever the case, we can see that the linear byte-mode drop column in the middle would considerably complicate the Internet. Even if one believes the network should be doing the biasing, linear byte- mode drop is a half-way house that doesn't bias enough towards small packets. Section 2 recommends that _all_ bias in network equipment towards small packets should be turned off -- if indeed any equipment vendors have implemented it -- leaving packet-size bias solely as the preserve of the transport layer (solely the leftmost, packet-mode drop column). In practice, it seems that no deliberate bias towards small packets has been implemented for production networks. Of the 19% of vendors who responded to a survey of 84 equipment vendors, none had implemented byte-mode drop in RED (see Appendix A for details).