Network Working Group P. Karn, Ed.
Request for Comments: 3819 Qualcomm
BCP: 89 C. Bormann
Category: Best Current Practice Universitaet Bremen TZI
University of Aberdeen
Sun Microsystems Laboratories, Europe
July 2004 Advice for Internet Subnetwork Designers
Status of this Memo
This document specifies an Internet Best Current Practices for the
Internet Community, and requests discussion and suggestions for
improvements. Distribution of this memo is unlimited.
Copyright (C) The Internet Society (2004).
This document provides advice to the designers of digital
communication equipment, link-layer protocols, and packet-switched
local networks (collectively referred to as subnetworks), who wish to
support the Internet protocols but may be unfamiliar with the
Internet architecture and the implications of their design choices on
the performance and efficiency of the Internet.
This document defines a subnetwork as a layer 2 network, which is a
network that does not rely upon the services of IP routers to forward
packets between parts of the subnetwork. However, IP routers may
bridge frames at Layer 2 between parts of a subnetwork. Sometimes,
it is convenient to aggregate a group of such subnetworks into a
single logical subnetwork. IP routing protocols (e.g., OSPF, IS-IS,
and PIM) can be configured to support this aggregation, but typically
present a layer-3 subnetwork rather than a layer-2 subnetwork. This
may also result in a specific packet passing several times over the
same layer-2 subnetwork via an intermediate layer-3 gateway (router).
Because that aggregation requires layer-3 components, issues thereof
are beyond the scope of this document.
However, while many subnetworks carry IP, they do not necessarily do
so with maximum efficiency, minimum complexity, or cost, nor do they
implement certain features to efficiently support newer Internet
features of increasing importance, such as multicasting or quality of
With the explosive growth of the Internet, IP packets comprise an
increasingly large fraction of the traffic carried by the world's
telecommunications networks. It therefore makes sense to optimize
both existing and new subnetwork technologies for IP as much as
Optimizing a subnetwork for IP involves three complementary
1. Providing functionality sufficient to carry IP.
2. Eliminating unnecessary functions that increase cost or
3. Choosing subnetwork parameters that maximize the performance of
the Internet protocols.
Because IP is so simple, consideration 2 is more of an issue than
consideration 1. That is to say, subnetwork designers make many more
errors of commission than errors of omission. However, certain
enhancements to Internet features, such as multicasting and quality-
of-service, benefit significantly from support given by the
underlying subnetworks beyond that necessary to carry "traditional"
unicast, best-effort IP.
A major consideration in the efficient design of any layered
communication network is the appropriate layer(s) in which to
implement a given function. This issue was first addressed in the
seminal paper, "End-to-End Arguments in System Design" [SRC81]. That
paper argued that many functions can be implemented properly *only*
on an end-to-end basis, i.e., at the highest protocol layers, outside
the subnetwork. These functions include ensuring the reliable
delivery of data and the use of cryptography to provide
confidentiality and message integrity.
Such functions cannot be provided solely by the concatenation of
hop-by-hop services; duplicating these functions at the lower
protocol layers (i.e., within the subnetwork) can be needlessly
redundant or even harmful to cost and performance.
However, partial duplication of functionality in a lower layer can
*sometimes* be justified by performance, security, or availability
considerations. Examples include link-layer retransmission to
improve the performance of an unusually lossy channel, e.g., mobile
radio, link-level encryption intended to thwart traffic analysis, and
redundant transmission links to improve availability, increase
throughput, or to guarantee performance for certain classes of
traffic. Duplication of protocol functions should be done only with
an understanding of system-level implications, including possible
interactions with higher-layer mechanisms.
The original architecture of the Internet was influenced by the
end-to-end principle [SRC81], and has been, in our view, part of the
reason for the Internet's success.
The remainder of this document discusses the various subnetwork
design issues that the authors consider relevant to efficient IP
2. Maximum Transmission Units (MTUs) and IP Fragmentation
IPv4 packets (datagrams) vary in size, from 20 bytes (the size of the
IPv4 header alone) to a maximum of 65535 bytes. Subnetworks need not
support maximum-sized (64KB) IP packets, as IP provides a scheme that
breaks packets that are too large for a given subnetwork into
fragments that travel as independent IP packets and are reassembled
at the destination. The maximum packet size supported by a
subnetwork is known as its Maximum Transmission Unit (MTU).
Subnetworks may, but are not required to, indicate the length of each
packet they carry. One example is Ethernet with the widely used DIX
[DIX82] (not IEEE 802.3 [IEEE8023]) header, which lacks a length
field to indicate the true data length when the packet is padded to a
minimum of 60 bytes. This is not a problem for uncompressed IP
because each IP packet carries its own length field.
If optional header compression [RFC1144] [RFC2507] [RFC2508]
[RFC3095] is used, however, it is required that the link framing
indicate frame length because that is needed for the reconstruction
of the original header.
In IP version 4 (the version now in widespread use), fragmentation
can occur at either the sending host or in an intermediate router,
and fragments can be further fragmented at subsequent routers if
In IP version 6 [RFC2460], fragmentation can occur only at the
sending host; it cannot occur in a router (called "router
fragmentation" in this document).
Both IPv4 and IPv6 provide a "path MTU discovery" procedure [RFC1191]
[RFC1435] [RFC1981] that allows the sending host to avoid
fragmentation by discovering the minimum MTU along a given path and
reduce its packet sizes accordingly. This procedure is optional in
IPv4 and IPv6.
Path MTU discovery is widely deployed, but it sometimes encounters
problems. Some routers fail to generate the ICMP messages that
convey path MTU information to the sender, and sometimes the ICMP
messages are blocked by overly restrictive firewalls. The result can
be a "Path MTU Black Hole" [RFC2923] [RFC1435].
The Path MTU Discovery procedure, the persistence of path MTU black
holes, and the deletion of router fragmentation in IPv6 reflect a
consensus of the Internet technical community that router
fragmentation is best avoided. This requires that subnetworks
support MTUs that are "reasonably" large. All IPv4 end hosts are
required to accept and reassemble IP packets of size 576 bytes
[RFC791], but such a small value would clearly be inefficient.
Because IPv6 omits fragmentation by routers, [RFC2460] specifies a
larger minimum MTU of 1280 bytes. Any subnetwork with an internal
packet payload smaller than 1280 bytes must implement a mechanism
that performs fragmentation/reassembly of IP packets to/from
subnetwork frames if it is to support IPv6.
If a subnetwork cannot directly support a "reasonable" MTU with
native framing mechanisms, it should internally fragment. That is,
it should transparently break IP packets into internal data elements
and reassemble them at the other end of the subnetwork.
This leaves the question of what is a "reasonable" MTU. Ethernet (10
and 100 Mb/s) has an MTU of 1500 bytes, and because of the ubiquity
of Ethernet few Internet paths currently have MTUs larger than this
value. This severely limits the utility of larger MTUs provided by
other subnetworks. Meanwhile, larger MTUs are increasingly desirable
on high-speed subnetworks to reduce the per-packet processing
overhead in host computers, and implementers are encouraged to
provide them even though they may not be usable when Ethernet is also
in the path.
Various "tunneling" schemes, such as GRE [RFC2784] or IP Security in
tunnel mode [RFC2406], treat IP as a subnetwork for IP. Since
tunneling adds header overhead, it can trigger fragmentation, even
when the same physical subnetworks (e.g., Ethernet) are used on both
sides of the host performing IPsec encapsulation. Tunneling has made
it more difficult to avoid router fragmentation and has increased the
incidence of path MTU black holes [RFC2401] [RFC2923]. Larger
subnetwork MTUs may help to alleviate this problem.
2.1. Choosing the MTU in Slow Networks
In slow networks, the largest possible packet may take a considerable
amount of time to send. This is known as channelisation or
serialisation delay. Total end-to-end interactive response time
should not exceed the well-known human factors limit of 100 to 200
ms. This includes all sources of delay: electromagnetic propagation
delay, queuing delay, serialisation delay, and the store-and-forward
time, i.e., the time to transmit a packet at link speed.
At low link speeds, store-and-forward delays can dominate total
end-to-end delay; these are in turn directly influenced by the
maximum transmission unit (MTU) size. Even when an interactive
packet is given a higher queuing priority, it may have to wait for a
large bulk transfer packet to finish transmission. This worst-case
wait can be set by an appropriate choice of MTU.
For example, if the MTU is set to 1500 bytes, then an MTU-sized
packet will take about 8 milliseconds to send on a T1 (1.536 Mb/s)
link. But if the link speed is 19.2kb/s, then the transmission time
becomes 625 ms -- well above our 100-200ms limit. A 256-byte MTU
would lower this delay to a little over 100 ms. However, care should
be taken not to lower the MTU excessively, as this will increase
header overhead and trigger frequent router fragmentation (if Path
MTU discovery is not in use). This is likely to be the case with
multicast, where Path MTU discovery is ineffective.
One way to limit delay for interactive traffic without imposing a
small MTU is to give priority to this traffic and to preempt (abort)
the transmission of a lower-priority packet when a higher priority
packet arrives in the queue. However, the link resources used to
send the aborted packet are lost, and overall throughput will
Another way to limit delay is to implement a link-level multiplexing
scheme that allows several packets to be in progress simultaneously,
with transmission priority given to segments of higher-priority IP
packets. For links using the Point-To-Point Protocol (PPP)
[RFC1661], multi-class multilink [RFC2686] [RFC2687] [RFC2689]
provides such a facility.
ATM (asynchronous transfer mode), where SNDUs are fragmented and
interleaved across smaller 53-byte ATM cells, is another example of
this technique. However, ATM is generally used on high-speed links
where the store-and-forward delays are already minimal, and it
introduces significant (~9%) increases in overhead due to the
addition of 5-byte cell overhead to each 48-byte ATM cell.
A third example is the Data-Over-Cable Service Interface
Specification (DOCSIS) with typical upstream bandwidths of 2.56 Mb/s
or 5.12 Mb/s. To reduce the impact of a 1500-byte MTU in DOCSIS 1.0
[DOCSIS1], a data link layer fragmentation mechanism is specified in
DOCSIS 1.1 [DOCSIS2]. To accommodate the installed base, DOCSIS 1.1
must be backward compatible with DOCSIS 1.0 cable modems, which
generally do not support fragmentation. Under the co-existence of
DOCSIS 1.0 and DOCSIS 1.1, the unfragmented large data packets from
DOCSIS 1.0 cable modems may affect the quality of service for voice
packets from DOCSIS 1.1 cable modems. In this case, it has been
shown in [DOCSIS3] that the use of bandwidth allocation algorithms
can mitigate this effect.
To summarize, there is a fundamental tradeoff between efficiency and
latency in the design of a subnetwork, and the designer should keep
this tradeoff in mind.
3. Framing on Connection-Oriented Subnetworks
IP requires that subnetworks mark the beginning and end of each
variable-length, asynchronous IP packet. Some examples of links and
subnetworks that do not provide this as an intrinsic feature include:
1. leased lines carrying a synchronous bit stream;
2. ISDN B-channels carrying a synchronous octet stream;
3. dialup telephone modems carrying an asynchronous octet stream;
4. Asynchronous Transfer Mode (ATM) networks carrying an
asynchronous stream of fixed-sized "cells".
The Internet community has defined packet framing methods for all
these subnetworks. The Point-To-Point Protocol (PPP) [RFC1661],
which uses a variant of HDLC, is applicable to bit synchronous,
octet-synchronous, and octet asynchronous links (i.e., examples 1-3
above). PPP is one preferred framing method for IP, since a large
number of systems interoperate with PPP. ATM has its own framing
methods, described in [RFC2684] [RFC2364].
At high speeds, a subnetwork should provide a framed interface
capable of carrying asynchronous, variable-length IP datagrams. The
maximum packet size supported by this interface is discussed above in
the MTU/Fragmentation section. The subnetwork may implement this
facility in any convenient manner.
IP packet boundaries need not coincide with any framing or
synchronization mechanisms internal to the subnetwork. When the
subnetwork implements variable sized data units, the most
straightforward approach is to place exactly one IP packet into each
subnetwork data unit (SNDU), and to rely on the subnetwork's existing
ability to delimit SNDUs to also delimit IP packets. A good example
is Ethernet. However, some subnetworks have SNDUs of one or more
fixed sizes, as dictated by switching, forward error correction
and/or interleaving considerations. Examples of such subnetworks
include ATM, with a single cell payload size of 48 octets plus a 5-
octet header, and IS-95 digital cellular, with two "rate sets" of
four fixed frame sizes each that may be selected on 20 millisecond
Because IP packets are of variable length, they may not necessarily
fit into an integer multiple of fixed-sized SNDUs. An "adaptation
layer" is needed to convert IP packets into SNDUs while marking the
boundary between each IP packet in some manner.
There are several approaches to this problem. The first is to encode
each IP packet into one or more SNDUs with no SNDU containing pieces
of more than one IP packet, and to pad out the last SNDU of the
packet as needed. Bits in a control header added to each SNDU
indicate where the data segment belongs in the IP packet. If the
subnetwork provides in-order, at-most-once delivery, the header can
be as simple as a pair of bits indicating whether the SNDU is the
first and/or the last in the IP packet. Alternatively, for
subnetworks that do not reorder the fragments of an SNDU, only the
last SNDU of the packet could be marked, as this would implicitly
indicate the next SNDU as the first in a new IP packet. The AAL5
(ATM Adaptation Layer 5) scheme used with ATM is an example of this
approach, though it adds other features, including a payload length
field and a payload CRC.
In AAL5, the ATM User-User Indication, which is encoded in the
Payload Type field of an ATM cell, indicates the last cell of a
packet. The packet trailer is located at the end of the SNDU and
contains the packet length and a CRC.
Another framing technique is to insert per-segment overhead to
indicate the presence of a segment option. When present, the option
carries a pointer to the end of the packet. This differs from AAL5
in that it permits another packet to follow within the same segment.
MPEG-2 Transport Streams [EN301192] [ISO13818] support this style of
fragmentation, and may either use padding (limiting each MPEG
transport stream packet to carry only part of one IP packet), or
allow a second IP packet to start in the same Transport Stream packet
A third approach is to insert a special flag sequence into the data
stream between each IP packet, and to pack the resulting data stream
into SNDUs without regard to SNDU boundaries. This may have
implications when frames are lost. The flag sequence can also pad
unused space at the end of an SNDU. If the special flag appears in
the user data, it is escaped to an alternate sequence (usually larger
than a flag) to avoid being misinterpreted as a flag. The HDLC-based
framing schemes used in PPP are all examples of this approach.
All three adaptation schemes introduce overhead; how much depends on
the distribution of IP packet sizes, the size(s) of the SNDUs, and in
the HDLC-like approaches, the content of the IP packet (since flag-
like sequences occurring in the packet must be escaped, which expands
them). The designer must also weigh implementation complexity and
performance in the choice and design of an adaptation layer.
4. Connection-Oriented Subnetworks
IP has no notion of a "connection"; it is a purely connectionless
protocol. When a connection is required by an application, it is
usually provided by TCP [RFC793], the Transmission Control Protocol,
running atop IP on an end-to-end basis.
Connection-oriented subnetworks can be (and are widely) used to carry
IP, but often with considerable complexity. Subnetworks consisting
of few nodes can simply open a permanent connection between each pair
of nodes. This is frequently done with ATM. However, the number of
connections increases as the square of the number of nodes, so this
is clearly impractical for large subnetworks. A "shim" layer between
IP and the subnetwork is therefore required to manage connections.
This is one of the most common functions of a Subnetwork Dependent
Convergence Function (SNDCF) sublayer between IP and a subnetwork.
SNDCFs typically open subnetwork connections as needed when an IP
packet is queued for transmission and close them after an idle
timeout. There is no relation between subnetwork connections and any
connections that may exist at higher layers (e.g., TCP).
Because Internet traffic is typically bursty and transaction-
oriented, it is often difficult to pick an optimal idle timeout. If
the timeout is too short, subnetwork connections are opened and
closed rapidly, possibly over-stressing the subnetwork connection
management system (especially if it was designed for voice traffic
call holding times). If the timeout is too long, subnetwork
connections are idle much of the time, wasting any resources
dedicated to them by the subnetwork.
Purely connectionless subnets (such as Ethernet), which have no state
and dynamically share resources, are optimal for supporting best-
effort IP, which is stateless and dynamically shares resources.
Connection-oriented packet networks (such as ATM and Frame Relay),
which have state and dynamically share resources, are less optimal,
since best-effort IP does not benefit from the overhead of creating
and maintaining state. Connection-oriented circuit-switched networks
(including the PSTN and ISDN) have state and statically allocate
resources for a call, and thus require state creation and maintenance
overhead, but do not benefit from the efficiencies of statistical
multiplexing sharing of capacity inherent in IP.
In any event, if an SNDCF that opens and closes subnet connections is
used to support IP, care should be taken to make sure that connection
processing in the subnet can keep up with relatively short holding
5. Broadcasting and Discovery
Subnetworks fall into two categories: point-to-point and shared. A
point-to-point subnet has exactly two endpoint components (hosts or
routers); a shared link has more than two endpoint components, using
either an inherently broadcast medium (e.g., Ethernet, radio) or a
switching layer hidden from the network layer (e.g., switched
Ethernet, Myrinet [MYR95], ATM). Switched subnetworks handle
broadcast by copying broadcast packets, providing each interface that
supports one, or more, systems (hosts or routers) with a copy of each
Several Internet protocols for IPv4 make use of broadcast
capabilities, including link-layer address lookup (ARP), auto-
configuration (RARP, BOOTP, DHCP), and routing (RIP).
A lack of broadcast capability can impede the performance of these
protocols, or render them inoperable (e.g., DHCP). ARP-like link
address lookup can be provided by a centralized database, but at the
expense of potentially higher response latency and the need for nodes
to have explicit knowledge of the ARP server address. Shared links
should support native, link-layer subnet broadcast.
A corresponding set of IPv6 protocols uses multicasting (see next
section) instead of broadcasting to provide similar functions with
improved scaling in large networks.
The Internet model includes "multicasting", where IP packets are sent
to all the members of a multicast group [RFC1112] [RFC3376]
[RFC2710]. Multicast is an option in IPv4, but a standard feature of
IPv6. IPv4 multicast is currently used by multimedia,
teleconferencing, gaming, and file distribution (web, peer-to-peer
sharing) applications, as well as by some key network and host
protocols (e.g., RIPv2, OSPF, NTP). IPv6 additionally relies on
multicast for network configuration (DHCP-like autoconfiguration) and
link-layer address discovery [RFC2461] (replacing ARP). In the case
of IPv6, this can allow autoconfiguration and address discovery to
span across routers, whereas the IPv4 broadcast-based services cannot
without ad-hoc router support [RFC1812].
Multicast-enabled IP routers organize each multicast group into a
spanning tree, and route multicast packets by making copies of each
multicast packet and forwarding the copies to each output interface
that includes at least one downstream member of the multicast group.
Multicasting is considerably more efficient when a subnetwork
explicitly supports it. For example, a router relaying a multicast
packet onto an Ethernet segment need send only one copy of the
packet, no matter how many members of the multicast group are
connected to the segment. Without native multicast support, routers
and switches on shared links would need to use broadcast with
software filters, such that every multicast packet sent incurs
software overhead for every node on the subnetwork, even if a node is
not a member of the multicast group. Alternately, the router would
transmit a separate copy to every member of the multicast group on
the segment, as is done on multicast-incapable switched subnets.
Subnetworks using shared channels (e.g., radio LANs, Ethernets) are
especially suitable for native multicasting, and their designers
should make every effort to support it. This involves designating a
section of the subnetwork's own address space for multicasting. On
these networks, multicast is basically broadcast on the medium, with
Layer-2 receiver filters.
Subnet interfaces also need to be designed to accept packets
addressed to some number of multicast addresses, in addition to the
unicast packets specifically addressed to them. The number of
multicast addresses that needs to be supported by a host depends on
the requirements of the associated host; at least several dozen will
meet most current needs.
On low-speed networks, the multicast address recognition function may
be readily implemented in host software, but on high-speed networks,
it should be implemented in subnetwork hardware. This hardware need
not be complete; for example, many Ethernet interfaces implement a
"hashing" function where the IP layer receives all of the multicast
(and unicast) traffic to which the associated host subscribes, plus
some small fraction of multicast traffic to which the host does not
subscribe. Host/router software then has to discard the unwanted
packets that pass the Layer-2 multicast address filter [RFC1112].
There does not need to be a one-to-one mapping between a Layer-2
multicast address and an IP multicast address. An address overlap
may significantly degrade the filtering capability of a receiver's
hardware multicast address filter. A subnetwork supporting only
broadcast should use this service for multicast and must rely on
Switched subnetworks must also provide a mechanism for copying
multicast packets to ensure the packets reach at least all members of
a multicast group. One option is to "flood" multicast packets in the
same manner as broadcast. This can lead to unnecessary transmissions
on some subnetwork links (notably non-multicast-aware Ethernet
switches). Some subnetworks therefore allow multicast filter tables
to control which links receive packets belonging to a specific group.
To configure this automatically requires access to Layer-3 group
membership information (e.g., IGMP [RFC3376], or MLD [RFC2710]).
Various implementation options currently exist to provide a subnet
node with a list of mappings of multicast addresses to
ports/interfaces. These employ a range of approaches, including
signaling from end hosts (e.g., IEEE 802 GARP/GMRP [802.1p]),
signaling from switches (e.g., CGMP [CGMP] and RGMP [RFC3488]),
interception and proxy of IP group membership packets (e.g., IGMP/MLD
Proxy [MAGMA-PROXY]), and enabling Layer-2 devices to
snoop/inspect/peek into forwarded Layer-3 protocol headers (e.g.,
IGMP, MLD, PIM) so that they may infer Layer-3 multicast group
membership [MAGMA-SNOOP]. These approaches differ in their
complexity, flexibility, and ability to support new protocols.
7. Bandwidth on Demand (BoD) Subnets
Some subnets allow a number of subnet nodes to share a channel
efficiently by assigning transmission opportunities dynamically.
Transmission opportunities are requested by a subnet node when it has
packets to send. The subnet schedules and grants transmission
opportunities sufficient to allow the transmitting subnet node to
send one or more packets (or packet fragments). We call these
subnets Bandwidth on Demand (BoD) subnets. Examples of BoD subnets
include Demand Assignment Multiple Access (DAMA) satellite and
terrestrial wireless networks, IEEE 802.11 point coordination
function (PCF) mode, and DOCSIS. A connection-oriented network (such
as the PSTN, ATM or Frame Relay) reserves resources on a much longer
timescale, and is therefore not a BoD subnet in our taxonomy.
The design parameters for BoD are similar to those in connection-
oriented subnetworks, although the implementations may vary
significantly. In BoD, the user typically requests access to the
shared channel for some duration. Access may be allocated for a
period of time at a specific rate, for a certain number of packets,
or until the user releases the channel. Access may be coordinated
through a central management entity or with a distributed algorithm
amongst the users. Examples of the resource that may be shared
include a terrestrial wireless hop, an upstream channel in a cable
television system, a satellite uplink, and an end-to-end satellite
Long-delay BoD subnets pose problems similar to connection-oriented
subnets in anticipating traffic. While connection-oriented subnets
hold idle channels open expecting new data to arrive, BoD subnets
request channel access based on buffer occupancy (or expected buffer
occupancy) on the sending port. Poor performance will likely result
if the sender does not anticipate additional traffic arriving at that
port during the time it takes to grant a transmission request. It is
recommended that the algorithm have the capability to extend a hold
on the channel for data that has arrived after the original request
was generated (this may be done by piggybacking new requests on user
There is a wide variety of BoD protocols available. However, there
has been relatively little comprehensive research on the interactions
between BoD mechanisms and Internet protocol performance. Research
on some specific mechanisms is available (e.g., [AR02]). One item
that has been studied is TCP's retransmission timer [KY02]. BoD
systems can cause spurious timeouts when adjusting from a relatively
high data rate, to a relatively low data rate. In this case, TCP's
transmitted data takes longer to get through the network than
predicted by the TCP sender's computed retransmission timeout.
Therefore, the TCP sender is prone to resending a segment