Network Working Group P. Karn, Ed. Request for Comments: 3819 Qualcomm BCP: 89 C. Bormann Category: Best Current Practice Universitaet Bremen TZI G. Fairhurst University of Aberdeen D. Grossman Motorola, Inc. R. Ludwig Ericsson Research J. Mahdavi Novell G. Montenegro Sun Microsystems Laboratories, Europe J. Touch USC/ISI L. Wood Cisco Systems July 2004 Advice for Internet Subnetwork Designers Status of this Memo This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2004).
AbstractThis document provides advice to the designers of digital communication equipment, link-layer protocols, and packet-switched local networks (collectively referred to as subnetworks), who wish to support the Internet protocols but may be unfamiliar with the Internet architecture and the implications of their design choices on the performance and efficiency of the Internet.
1. Introduction and Overview. . . . . . . . . . . . . . . . . . . 2 2. Maximum Transmission Units (MTUs) and IP Fragmentation . . . . 4 2.1. Choosing the MTU in Slow Networks. . . . . . . . . . . . 6 3. Framing on Connection-Oriented Subnetworks . . . . . . . . . . 7 4. Connection-Oriented Subnetworks. . . . . . . . . . . . . . . . 9 5. Broadcasting and Discovery . . . . . . . . . . . . . . . . . . 10 6. Multicasting . . . . . . . . . . . . . . . . . . . . . . . . . 11 7. Bandwidth on Demand (BoD) Subnets. . . . . . . . . . . . . . . 13 8. Reliability and Error Control. . . . . . . . . . . . . . . . . 14 8.1. TCP vs Link-Layer Retransmission . . . . . . . . . . . . 14 8.2. Recovery from Subnetwork Outages . . . . . . . . . . . . 17 8.3. CRCs, Checksums and Error Detection. . . . . . . . . . . 18 8.4. How TCP Works. . . . . . . . . . . . . . . . . . . . . . 20 8.5. TCP Performance Characteristics. . . . . . . . . . . . . 22 8.5.1. The Formulae . . . . . . . . . . . . . . . . . . 22 8.5.2. Assumptions. . . . . . . . . . . . . . . . . . . 23 8.5.3. Analysis of Link-Layer Effects on TCP Performance. . . . . . . . . . . . . . . . . . . 24 9. Quality-of-Service (QoS) Considerations. . . . . . . . . . . . 26 10. Fairness vs Performance. . . . . . . . . . . . . . . . . . . . 29 11. Delay Characteristics. . . . . . . . . . . . . . . . . . . . . 30 12. Bandwidth Asymmetries. . . . . . . . . . . . . . . . . . . . . 31 13. Buffering, Flow and Congestion Control . . . . . . . . . . . . 31 14. Compression. . . . . . . . . . . . . . . . . . . . . . . . . . 34 15. Packet Reordering. . . . . . . . . . . . . . . . . . . . . . . 36 16. Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 17. Routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 18. Security Considerations. . . . . . . . . . . . . . . . . . . . 41 19. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 44 20. Informative References . . . . . . . . . . . . . . . . . . . . 45 21. Contributors' Addresses. . . . . . . . . . . . . . . . . . . . 57 22. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 58 23. Full Copyright Statement . . . . . . . . . . . . . . . . . . . 60 RFC791] [RFC2460], is the core protocol of the Internet. IP defines a simple "connectionless" packet-switched network. The success of the Internet is largely attributed to IP's simplicity, the "end-to-end principle" [SRC81] on which the Internet is based, and the resulting ease of carrying IP on a wide variety of subnetworks, not necessarily designed with IP in mind. A subnetwork refers to any network operating immediately below the IP layer to connect two or more systems using IP (i.e., end hosts or routers). In its simplest form, this may be a direct connection between the IP systems (e.g., using a length of cable or a wireless medium).
This document defines a subnetwork as a layer 2 network, which is a network that does not rely upon the services of IP routers to forward packets between parts of the subnetwork. However, IP routers may bridge frames at Layer 2 between parts of a subnetwork. Sometimes, it is convenient to aggregate a group of such subnetworks into a single logical subnetwork. IP routing protocols (e.g., OSPF, IS-IS, and PIM) can be configured to support this aggregation, but typically present a layer-3 subnetwork rather than a layer-2 subnetwork. This may also result in a specific packet passing several times over the same layer-2 subnetwork via an intermediate layer-3 gateway (router). Because that aggregation requires layer-3 components, issues thereof are beyond the scope of this document. However, while many subnetworks carry IP, they do not necessarily do so with maximum efficiency, minimum complexity, or cost, nor do they implement certain features to efficiently support newer Internet features of increasing importance, such as multicasting or quality of service. With the explosive growth of the Internet, IP packets comprise an increasingly large fraction of the traffic carried by the world's telecommunications networks. It therefore makes sense to optimize both existing and new subnetwork technologies for IP as much as possible. Optimizing a subnetwork for IP involves three complementary considerations: 1. Providing functionality sufficient to carry IP. 2. Eliminating unnecessary functions that increase cost or complexity. 3. Choosing subnetwork parameters that maximize the performance of the Internet protocols. Because IP is so simple, consideration 2 is more of an issue than consideration 1. That is to say, subnetwork designers make many more errors of commission than errors of omission. However, certain enhancements to Internet features, such as multicasting and quality- of-service, benefit significantly from support given by the underlying subnetworks beyond that necessary to carry "traditional" unicast, best-effort IP.
A major consideration in the efficient design of any layered communication network is the appropriate layer(s) in which to implement a given function. This issue was first addressed in the seminal paper, "End-to-End Arguments in System Design" [SRC81]. That paper argued that many functions can be implemented properly *only* on an end-to-end basis, i.e., at the highest protocol layers, outside the subnetwork. These functions include ensuring the reliable delivery of data and the use of cryptography to provide confidentiality and message integrity. Such functions cannot be provided solely by the concatenation of hop-by-hop services; duplicating these functions at the lower protocol layers (i.e., within the subnetwork) can be needlessly redundant or even harmful to cost and performance. However, partial duplication of functionality in a lower layer can *sometimes* be justified by performance, security, or availability considerations. Examples include link-layer retransmission to improve the performance of an unusually lossy channel, e.g., mobile radio, link-level encryption intended to thwart traffic analysis, and redundant transmission links to improve availability, increase throughput, or to guarantee performance for certain classes of traffic. Duplication of protocol functions should be done only with an understanding of system-level implications, including possible interactions with higher-layer mechanisms. The original architecture of the Internet was influenced by the end-to-end principle [SRC81], and has been, in our view, part of the reason for the Internet's success. The remainder of this document discusses the various subnetwork design issues that the authors consider relevant to efficient IP support. DIX82] (not IEEE 802.3 [IEEE8023]) header, which lacks a length
field to indicate the true data length when the packet is padded to a minimum of 60 bytes. This is not a problem for uncompressed IP because each IP packet carries its own length field. If optional header compression [RFC1144] [RFC2507] [RFC2508] [RFC3095] is used, however, it is required that the link framing indicate frame length because that is needed for the reconstruction of the original header. In IP version 4 (the version now in widespread use), fragmentation can occur at either the sending host or in an intermediate router, and fragments can be further fragmented at subsequent routers if necessary. In IP version 6 [RFC2460], fragmentation can occur only at the sending host; it cannot occur in a router (called "router fragmentation" in this document). Both IPv4 and IPv6 provide a "path MTU discovery" procedure [RFC1191] [RFC1435] [RFC1981] that allows the sending host to avoid fragmentation by discovering the minimum MTU along a given path and reduce its packet sizes accordingly. This procedure is optional in IPv4 and IPv6. Path MTU discovery is widely deployed, but it sometimes encounters problems. Some routers fail to generate the ICMP messages that convey path MTU information to the sender, and sometimes the ICMP messages are blocked by overly restrictive firewalls. The result can be a "Path MTU Black Hole" [RFC2923] [RFC1435]. The Path MTU Discovery procedure, the persistence of path MTU black holes, and the deletion of router fragmentation in IPv6 reflect a consensus of the Internet technical community that router fragmentation is best avoided. This requires that subnetworks support MTUs that are "reasonably" large. All IPv4 end hosts are required to accept and reassemble IP packets of size 576 bytes [RFC791], but such a small value would clearly be inefficient. Because IPv6 omits fragmentation by routers, [RFC2460] specifies a larger minimum MTU of 1280 bytes. Any subnetwork with an internal packet payload smaller than 1280 bytes must implement a mechanism that performs fragmentation/reassembly of IP packets to/from subnetwork frames if it is to support IPv6. If a subnetwork cannot directly support a "reasonable" MTU with native framing mechanisms, it should internally fragment. That is, it should transparently break IP packets into internal data elements and reassemble them at the other end of the subnetwork.
This leaves the question of what is a "reasonable" MTU. Ethernet (10 and 100 Mb/s) has an MTU of 1500 bytes, and because of the ubiquity of Ethernet few Internet paths currently have MTUs larger than this value. This severely limits the utility of larger MTUs provided by other subnetworks. Meanwhile, larger MTUs are increasingly desirable on high-speed subnetworks to reduce the per-packet processing overhead in host computers, and implementers are encouraged to provide them even though they may not be usable when Ethernet is also in the path. Various "tunneling" schemes, such as GRE [RFC2784] or IP Security in tunnel mode [RFC2406], treat IP as a subnetwork for IP. Since tunneling adds header overhead, it can trigger fragmentation, even when the same physical subnetworks (e.g., Ethernet) are used on both sides of the host performing IPsec encapsulation. Tunneling has made it more difficult to avoid router fragmentation and has increased the incidence of path MTU black holes [RFC2401] [RFC2923]. Larger subnetwork MTUs may help to alleviate this problem.
the transmission of a lower-priority packet when a higher priority packet arrives in the queue. However, the link resources used to send the aborted packet are lost, and overall throughput will decrease. Another way to limit delay is to implement a link-level multiplexing scheme that allows several packets to be in progress simultaneously, with transmission priority given to segments of higher-priority IP packets. For links using the Point-To-Point Protocol (PPP) [RFC1661], multi-class multilink [RFC2686] [RFC2687] [RFC2689] provides such a facility. ATM (asynchronous transfer mode), where SNDUs are fragmented and interleaved across smaller 53-byte ATM cells, is another example of this technique. However, ATM is generally used on high-speed links where the store-and-forward delays are already minimal, and it introduces significant (~9%) increases in overhead due to the addition of 5-byte cell overhead to each 48-byte ATM cell. A third example is the Data-Over-Cable Service Interface Specification (DOCSIS) with typical upstream bandwidths of 2.56 Mb/s or 5.12 Mb/s. To reduce the impact of a 1500-byte MTU in DOCSIS 1.0 [DOCSIS1], a data link layer fragmentation mechanism is specified in DOCSIS 1.1 [DOCSIS2]. To accommodate the installed base, DOCSIS 1.1 must be backward compatible with DOCSIS 1.0 cable modems, which generally do not support fragmentation. Under the co-existence of DOCSIS 1.0 and DOCSIS 1.1, the unfragmented large data packets from DOCSIS 1.0 cable modems may affect the quality of service for voice packets from DOCSIS 1.1 cable modems. In this case, it has been shown in [DOCSIS3] that the use of bandwidth allocation algorithms can mitigate this effect. To summarize, there is a fundamental tradeoff between efficiency and latency in the design of a subnetwork, and the designer should keep this tradeoff in mind.
and 4. Asynchronous Transfer Mode (ATM) networks carrying an asynchronous stream of fixed-sized "cells". The Internet community has defined packet framing methods for all these subnetworks. The Point-To-Point Protocol (PPP) [RFC1661], which uses a variant of HDLC, is applicable to bit synchronous, octet-synchronous, and octet asynchronous links (i.e., examples 1-3 above). PPP is one preferred framing method for IP, since a large number of systems interoperate with PPP. ATM has its own framing methods, described in [RFC2684] [RFC2364]. At high speeds, a subnetwork should provide a framed interface capable of carrying asynchronous, variable-length IP datagrams. The maximum packet size supported by this interface is discussed above in the MTU/Fragmentation section. The subnetwork may implement this facility in any convenient manner. IP packet boundaries need not coincide with any framing or synchronization mechanisms internal to the subnetwork. When the subnetwork implements variable sized data units, the most straightforward approach is to place exactly one IP packet into each subnetwork data unit (SNDU), and to rely on the subnetwork's existing ability to delimit SNDUs to also delimit IP packets. A good example is Ethernet. However, some subnetworks have SNDUs of one or more fixed sizes, as dictated by switching, forward error correction and/or interleaving considerations. Examples of such subnetworks include ATM, with a single cell payload size of 48 octets plus a 5- octet header, and IS-95 digital cellular, with two "rate sets" of four fixed frame sizes each that may be selected on 20 millisecond boundaries. Because IP packets are of variable length, they may not necessarily fit into an integer multiple of fixed-sized SNDUs. An "adaptation layer" is needed to convert IP packets into SNDUs while marking the boundary between each IP packet in some manner. There are several approaches to this problem. The first is to encode each IP packet into one or more SNDUs with no SNDU containing pieces of more than one IP packet, and to pad out the last SNDU of the packet as needed. Bits in a control header added to each SNDU indicate where the data segment belongs in the IP packet. If the subnetwork provides in-order, at-most-once delivery, the header can be as simple as a pair of bits indicating whether the SNDU is the first and/or the last in the IP packet. Alternatively, for subnetworks that do not reorder the fragments of an SNDU, only the last SNDU of the packet could be marked, as this would implicitly
indicate the next SNDU as the first in a new IP packet. The AAL5 (ATM Adaptation Layer 5) scheme used with ATM is an example of this approach, though it adds other features, including a payload length field and a payload CRC. In AAL5, the ATM User-User Indication, which is encoded in the Payload Type field of an ATM cell, indicates the last cell of a packet. The packet trailer is located at the end of the SNDU and contains the packet length and a CRC. Another framing technique is to insert per-segment overhead to indicate the presence of a segment option. When present, the option carries a pointer to the end of the packet. This differs from AAL5 in that it permits another packet to follow within the same segment. MPEG-2 Transport Streams [EN301192] [ISO13818] support this style of fragmentation, and may either use padding (limiting each MPEG transport stream packet to carry only part of one IP packet), or allow a second IP packet to start in the same Transport Stream packet (no padding). A third approach is to insert a special flag sequence into the data stream between each IP packet, and to pack the resulting data stream into SNDUs without regard to SNDU boundaries. This may have implications when frames are lost. The flag sequence can also pad unused space at the end of an SNDU. If the special flag appears in the user data, it is escaped to an alternate sequence (usually larger than a flag) to avoid being misinterpreted as a flag. The HDLC-based framing schemes used in PPP are all examples of this approach. All three adaptation schemes introduce overhead; how much depends on the distribution of IP packet sizes, the size(s) of the SNDUs, and in the HDLC-like approaches, the content of the IP packet (since flag- like sequences occurring in the packet must be escaped, which expands them). The designer must also weigh implementation complexity and performance in the choice and design of an adaptation layer. RFC793], the Transmission Control Protocol, running atop IP on an end-to-end basis. Connection-oriented subnetworks can be (and are widely) used to carry IP, but often with considerable complexity. Subnetworks consisting of few nodes can simply open a permanent connection between each pair of nodes. This is frequently done with ATM. However, the number of connections increases as the square of the number of nodes, so this
is clearly impractical for large subnetworks. A "shim" layer between IP and the subnetwork is therefore required to manage connections. This is one of the most common functions of a Subnetwork Dependent Convergence Function (SNDCF) sublayer between IP and a subnetwork. SNDCFs typically open subnetwork connections as needed when an IP packet is queued for transmission and close them after an idle timeout. There is no relation between subnetwork connections and any connections that may exist at higher layers (e.g., TCP). Because Internet traffic is typically bursty and transaction- oriented, it is often difficult to pick an optimal idle timeout. If the timeout is too short, subnetwork connections are opened and closed rapidly, possibly over-stressing the subnetwork connection management system (especially if it was designed for voice traffic call holding times). If the timeout is too long, subnetwork connections are idle much of the time, wasting any resources dedicated to them by the subnetwork. Purely connectionless subnets (such as Ethernet), which have no state and dynamically share resources, are optimal for supporting best- effort IP, which is stateless and dynamically shares resources. Connection-oriented packet networks (such as ATM and Frame Relay), which have state and dynamically share resources, are less optimal, since best-effort IP does not benefit from the overhead of creating and maintaining state. Connection-oriented circuit-switched networks (including the PSTN and ISDN) have state and statically allocate resources for a call, and thus require state creation and maintenance overhead, but do not benefit from the efficiencies of statistical multiplexing sharing of capacity inherent in IP. In any event, if an SNDCF that opens and closes subnet connections is used to support IP, care should be taken to make sure that connection processing in the subnet can keep up with relatively short holding times. MYR95], ATM). Switched subnetworks handle broadcast by copying broadcast packets, providing each interface that supports one, or more, systems (hosts or routers) with a copy of each packet.
Several Internet protocols for IPv4 make use of broadcast capabilities, including link-layer address lookup (ARP), auto- configuration (RARP, BOOTP, DHCP), and routing (RIP). A lack of broadcast capability can impede the performance of these protocols, or render them inoperable (e.g., DHCP). ARP-like link address lookup can be provided by a centralized database, but at the expense of potentially higher response latency and the need for nodes to have explicit knowledge of the ARP server address. Shared links should support native, link-layer subnet broadcast. A corresponding set of IPv6 protocols uses multicasting (see next section) instead of broadcasting to provide similar functions with improved scaling in large networks. RFC1112] [RFC3376] [RFC2710]. Multicast is an option in IPv4, but a standard feature of IPv6. IPv4 multicast is currently used by multimedia, teleconferencing, gaming, and file distribution (web, peer-to-peer sharing) applications, as well as by some key network and host protocols (e.g., RIPv2, OSPF, NTP). IPv6 additionally relies on multicast for network configuration (DHCP-like autoconfiguration) and link-layer address discovery [RFC2461] (replacing ARP). In the case of IPv6, this can allow autoconfiguration and address discovery to span across routers, whereas the IPv4 broadcast-based services cannot without ad-hoc router support [RFC1812]. Multicast-enabled IP routers organize each multicast group into a spanning tree, and route multicast packets by making copies of each multicast packet and forwarding the copies to each output interface that includes at least one downstream member of the multicast group. Multicasting is considerably more efficient when a subnetwork explicitly supports it. For example, a router relaying a multicast packet onto an Ethernet segment need send only one copy of the packet, no matter how many members of the multicast group are connected to the segment. Without native multicast support, routers and switches on shared links would need to use broadcast with software filters, such that every multicast packet sent incurs software overhead for every node on the subnetwork, even if a node is not a member of the multicast group. Alternately, the router would transmit a separate copy to every member of the multicast group on the segment, as is done on multicast-incapable switched subnets.
Subnetworks using shared channels (e.g., radio LANs, Ethernets) are especially suitable for native multicasting, and their designers should make every effort to support it. This involves designating a section of the subnetwork's own address space for multicasting. On these networks, multicast is basically broadcast on the medium, with Layer-2 receiver filters. Subnet interfaces also need to be designed to accept packets addressed to some number of multicast addresses, in addition to the unicast packets specifically addressed to them. The number of multicast addresses that needs to be supported by a host depends on the requirements of the associated host; at least several dozen will meet most current needs. On low-speed networks, the multicast address recognition function may be readily implemented in host software, but on high-speed networks, it should be implemented in subnetwork hardware. This hardware need not be complete; for example, many Ethernet interfaces implement a "hashing" function where the IP layer receives all of the multicast (and unicast) traffic to which the associated host subscribes, plus some small fraction of multicast traffic to which the host does not subscribe. Host/router software then has to discard the unwanted packets that pass the Layer-2 multicast address filter [RFC1112]. There does not need to be a one-to-one mapping between a Layer-2 multicast address and an IP multicast address. An address overlap may significantly degrade the filtering capability of a receiver's hardware multicast address filter. A subnetwork supporting only broadcast should use this service for multicast and must rely on software filtering. Switched subnetworks must also provide a mechanism for copying multicast packets to ensure the packets reach at least all members of a multicast group. One option is to "flood" multicast packets in the same manner as broadcast. This can lead to unnecessary transmissions on some subnetwork links (notably non-multicast-aware Ethernet switches). Some subnetworks therefore allow multicast filter tables to control which links receive packets belonging to a specific group. To configure this automatically requires access to Layer-3 group membership information (e.g., IGMP [RFC3376], or MLD [RFC2710]). Various implementation options currently exist to provide a subnet node with a list of mappings of multicast addresses to ports/interfaces. These employ a range of approaches, including signaling from end hosts (e.g., IEEE 802 GARP/GMRP [802.1p]), signaling from switches (e.g., CGMP [CGMP] and RGMP [RFC3488]), interception and proxy of IP group membership packets (e.g., IGMP/MLD Proxy [MAGMA-PROXY]), and enabling Layer-2 devices to snoop/inspect/peek into forwarded Layer-3 protocol headers (e.g.,
IGMP, MLD, PIM) so that they may infer Layer-3 multicast group membership [MAGMA-SNOOP]. These approaches differ in their complexity, flexibility, and ability to support new protocols. AR02]). One item that has been studied is TCP's retransmission timer [KY02]. BoD
systems can cause spurious timeouts when adjusting from a relatively high data rate, to a relatively low data rate. In this case, TCP's transmitted data takes longer to get through the network than predicted by the TCP sender's computed retransmission timeout. Therefore, the TCP sender is prone to resending a segment prematurely.