Network Working Group G. Huston Request for Comments: 3221 Internet Architecture Board Category: Informational December 2001 Commentary on Inter-Domain Routing in the Internet Status of this Memo This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved.
AbstractThis document examines the various longer term trends visible within the characteristics of the Internet's BGP table and identifies a number of operational practices and protocol factors that contribute to these trends. The potential impacts of these practices and protocol properties on the scaling properties of the inter-domain routing space are examined. This document is the outcome of a collaborative exercise on the part of the Internet Architecture Board. 1. Introduction................................................. 2 2. Network Scaling and Inter-Domain Routing ................... 2 3. Measurements of the total size of the BGP Table ............ 4 4. Related Measurements derived from BGP Table ................ 7 5. Current State of inter-AS routing in the Internet .......... 11 6. Future Requirements for the Exterior Routing System ........ 14 7. Architectural Approaches to a scalable Exterior Routing Protocol........................................... 15 8. Directions for Further Activity ............................ 21 9. Security Considerations .................................... 22 10. References ................................................. 23 11. Acknowledgements ........................................... 24 12. Author's Address ........................................... 24 13. Full Copyright Statement ................................... 25
This network-wide connectivity is described in the routing table used by the BGP4 protocol (referred to as the Routing Information Base, or RIB). Each entry in the table refers to a distinct route. The attributes of the route, together with local policy constraints, are used to determine the best path from the local AS to the AS that is originating the route. Determining the 'best path' in this case is determining which routing advertisement and associated next hop address is the most preferred by the local AS. Within each local BGP-speaking router this preferred route is then loaded into the local RIB (Loc-RIB). This information is coupled with information obtained from the local instance of the interior routing protocol to form a Forwarding Information Base (or FIB), for use by the local router's forwarding engine. The BGP routing system is not aware of finer level of topology of the network on a link-by-link basis within the local AS or within any remote AS. From this perspective BGP can be seen as an inter-AS connectivity maintenance protocol, as distinct from a link-level topology management protocol, and the BGP routing table can be viewed as a description of the current connectivity of the Internet using an AS as the basic element of connectivity computation. There is an associated dimension of policy determination within the routing table. If an AS advertises a route to a neighboring AS, the local AS is offering to accept traffic from the neighboring AS which is ultimately destined to addresses described by the advertised routing entry. If the local AS does not originate the route, then the inference is that the local AS is willing to undertake the role of transit provider for this traffic on behalf of some third party. Similarly, an AS may or may not choose to accept a route from a neighbor. Accepting a route implies that under some circumstances, as determined by the local route selection parameters, the local AS will use the neighboring AS to reach addresses spanned by the route. The BGP routing domain is intended to maintain a coherent view of the connectivity of the inter-AS domain, where connectivity is expressed as a preference for 'shortest paths' to reach any destination address as modulated by the connectivity policies expressed by each AS, and coherence is expressed as a global constraint that none of the paths contains loops or dead ends. The elements of the BGP routing domain are routing entries, expressed as a span of addresses. All addresses advertised within each routing entry share a common origin AS and a common connectivity policy. The total size of the BGP table is therefore a metric of the number of distinct routes within the Internet, where each route describes a contiguous set of addresses that share a common origin AS and a common reachability policy.
When the scaling properties of the Internet were studied in the early 1990s two critical factors identified in the study were, not surprisingly, routing and addressing . As more devices connect to the Internet they consume addresses, and the associated function of maintaining reachability information for these addresses, with an assumption of an associated growth in the number of distinct provider networks and the number of distinct connectivity policies, implies ever larger routing tables. The work in studying the limitations of the 32 bit IPv4 address space produced a number of outcomes, including the specification of IPv6 , as well as the refinement of techniques of network address translation  intended to allow some degree of transparent interaction between two networks using different address realms. Growth in the routing system is not directly addressed by these approaches, as the routing space is the cross product of the complexity of the inter-AS topology of the network, multiplied by the number of distinct connectivity policies multiplied by the degree of fragmentation of the address space. For example, use of NAT may reduce the pressure on the number of public addresses required by a single connected network, but it does not necessarily imply that the network's connectivity policies can be subsumed within the aggregated policy of a single upstream provider. When an AS advertises a block of addresses into the exterior routing space this entry is generally carried across the entire exterior routing domain of the Internet. To measure the common characteristics of the global routing table, it is necessary to establish a point in the default-free part of the exterior routing domain and examine the BGP routing table that is visible at that point. 5]. This effort was resumed in 1994 by Erik-Jan Bos at Surfnet in the Netherlands, who commenced measuring the size of the BGP table at hourly intervals in 1994. This measurement technique was adopted by the author in 1997, using a measurement point located at the edge of AS 1221 at Telstra in Australia, again using an hourly interval for the measurement. The initial measurements were of the number of routing entries contained within the set of selected best paths. These measurements were expanded to include the number of AS numbers, number of AS paths, and a set of measurements relating to the prefix size of routing table entries.
This data contains a view of the dynamics of the Internet's routing table growth that spans some 13 years in total and includes a very detailed view spanning the most recent seven years . Looking at just the total size of the BGP routing table over this period, it is possible to identify four distinct phases of inter-AS routing practice in the Internet. 5]. A concerted effort was undertaken in 1994 and 1995 to deploy CIDR routing in the Internet, based on encouraging deployment of the CIDR-capable version of the BGP protocol, BGP4 . The intention of CIDR was one of hierarchical provider address aggregation, where a network provider was allocated an address block from an address registry, and the provider announced this entire block into the exterior routing domain as a single entry with a single routing policy. Customers of the provider were encouraged to use a sub-allocation from the provider's address block, and these smaller routing elements were aggregated by the provider and not directly passed into the exterior routing domain. During 1994 the
size of the routing table remained relatively constant at some 20,000 entries as the growth in the number of providers announcing address blocks was matched by a corresponding reduction in the number of address announcements as a result of CIDR aggregation.
BGP table size data for the first half of 2001 shows different trends at various measurement points in the Internet. Some measurement points where the local AS has a relative larger number of more specific routes show a steady state for the first half of 2001 with no appreciable growth, while other measurement points where the local AS has had a lower number of more specific routes initially show a continuation of table size growth. There are a number of commonly observed discontinuities in the data for 2001, corresponding to events where a significant number of more specific entries have been replaced by an encompassing aggregate prefix. 8] While the protocol modifications are relatively straightforward, the major responsibility rests with the operations community to devise a transition plan that will allow gradual transition into this larger AS number space.
9] This explanation also supports the observation of smaller address fragments supporting distinct policies in the BGP table, as such small address blocks may encompass arbitrarily large networks located behind one or more NAT gateways. There are alternative explanations of this difference between the growth of the table and the growth of address space, including a trend towards discrete exterior routing policies being applied to finer address blocks. 11]. This trend towards finer-grained entries in the routing table is potentially cause for concern, as it implies the increasing spread
of traffic over greater numbers of increasingly smaller forwarding table entries. This, in turn, has implications for the design of high speed core routers, particularly when extensive use is made of a small number of very high speed cached forwarding entries within the switching subsystem of a router's design. A similar observation can be made regarding the number of addresses advertised per AS. In December 1999 each AS advertised an average of 161,900 addresses (equivalent to a prefix length /14.69, and in January 2001 this average has fallen to 115,800 addresses, an equivalent prefix length of /15.18. This points to increasingly finer levels of routing detail being announced into the global routing domain. This, in turn, supports the observation that the efficiencies of hierarchical routing structures are no longer being fully realized within the deployed Internet. Instead, increasingly finer levels of routing detail are being announced globally in the BGP tables. The most likely cause of this trend of finer levels of routing granularity is an increasingly dense interconnection mesh, where more networks are moving from a single-homed connection with hierarchical addressing and routing into multi-homed connections without any hierarchical structure. The spur for this increasingly dense connectivity mesh in the Internet may well be the declining unit costs of communications bearer services coupled with a common perception that richer sets of adjacencies yields greater levels of service resilience.
damping, nevertheless there is still a very high level of announcements and withdrawals of these entries in this particular area of the routing table when viewed using a perspective of route updates per prefix length. Given that the numbers of these small prefixes are growing rapidly, there is cause for some concern that the total level of BGP flux, in terms of the number of announcements and withdrawals per second may be increasing, despite the pressures from flap damping. This concern is coupled with the observation that, in terms of BGP stability under scaling pressure, it is not the absolute size of the BGP table that is of prime importance, but the rate of dynamic path re-computations that occur in the wake of announcements and withdrawals. Withdrawals are of particular concern due to the number of transient intermediate states that the BGP distance vector algorithm explores in processing a withdrawal. Current experimental observations indicate a typical convergence time of some 2 minutes to propagate a route withdrawal across the BGP domain.  An increase in the density of the BGP mesh, coupled with an increase in the rate of such dynamic changes, does have serious implications in maintaining the overall stability of the BGP system as it continues to grow. The registry allocation policies also have had some impact on the routing table prefix distribution. The original registry practice was to use a minimum allocation unit of a /19, and the 10,000 prefix entries in the /17 to /19 range are a consequence of this policy decision. More recently, the allocation policy now allows for a minimum allocation unit of a /20 prefix, and the /20 prefix is used by some 4,300 entries as of January 2001, and in relative terms is one of the fastest growing prefix sets. The number of entries corresponding to very small address blocks (smaller than a /24), while small in number as a proportion of the total BGP routing table, is the fastest growing in relative terms. The number of /25 through /32 prefixes in the routing table is growing faster, in terms of percentage change, than any other area of the routing table. If prefix length filtering were in widespread use, the practice of announcing a very small address block with a distinct routing policy would have no particular beneficial outcome, as the address block would not be passed throughout the global BGP routing domain and the propagation of the associated policy would be limited in scope. The growth of the number of these small address blocks, and the diversity of AS paths associated with these routing entries, points to a relatively limited use of prefix length filtering in today's Internet. In the absence of any corrective pressure in the form of widespread adoption of prefix length filtering, the very rapid growth of global announcements of very small address blocks is likely to continue. In percentage terms, the set of prefixes spanning /25 to /32 show the largest growth rates.
It may be appropriate to consider how to operate an Internet with a BGP routing table that has millions of small entries, rather than the expectation of a hierarchical routing space with at most tens of thousands of larger entries in the global routing table.
In the distance vector protocol a routing node gathers information from its neighbors, applies local policy to this information and then distributes this updated information to its neighbors. In this model the nature of the local policy applied to the routing information is not necessarily visible to the node's neighbors, and the process of converting received route advertisements into advertised route advertisements uses a local policy process whose policy rules are not visible externally. This scenario can be described as 'policy opaque'. The side effect of such an environment is that a third party cannot remotely compute which routes a network may accept and which may be re-advertised to each neighbor. In link state protocols a routing node effectively broadcasts its local adjacencies, and the policies it has with respect to these adjacencies, to all nodes within the link state domain. Every node can perform an identical computation upon this set of adjacencies and associated policies in order to compute the local forwarding table. The essential attribute of this environment is that the routing node has to announce its routing policies, in order to allow a remote node to compute which routes will be accepted from which neighbor, and which routes will be advertised to each neighbor and what, if any, attributes are placed on the advertisement. Within an interior routing domain the local policies are in effect metrics of each link and these polices can be announced within the routing domain without any consequent impact. In the exterior routing domain it is not the case that interconnection policies between networks are always fully transparent. Various permutations of supplier / customer relationships and peering relationships have associated policy qualifications that are not publicly announced for business competitive reasons. The current diversity of interconnection arrangements appears to be predicated on policy opaqueness, and to mandate a change to a model of open interconnection policies may be contrary to operational business imperatives. An inter-domain routing tool should be able to support models of interconnection where the policy associated with the interconnection is not visible to any third party. If the architectural choice is a constrained one between distance vector and link state, then this consideration would appear to favor the continued use of a distance vector approach to inter-domain routing. This choice, in turn, has implications on the convergence properties and stability of the inter-domain routing environment. If there is a broader spectrum of choice, the considerations of policy-opaqueness would still apply.
point). If explicit traffic engineering were undertaken within the inter-domain space, it is highly likely that the current structure would be altered. Instead of the downstream element attempting to constrain the path choices of an upstream element, a probable approach is the downstream element placing a number of advisory constraints on the upstream elements, and the upstream elements using a combination of these advisory constraints, dynamic information relating to path service characteristics and local policies to make an egress choice. From the perspective of the inter-domain routing environment, such measures offer the potential to remove the advertisement of specific routes for traffic engineering purposes. However, there is a need to adding traffic engineering information into advertised route blocks, requiring the definition of the syntax and semantics of traffic engineering attributes that can be attached to route objects.
section 6 of this document, one approach to the longer term requirements may be to preserve a number of attributes of the current BGP protocol, while refine other aspects of the protocol to improve its scaling and convergence properties. A minimal set of alterations could retain the Autonomous System concept to allow for boundaries of information summarization, as well as retaining the approach of associating each prefix advertisement with an originating AS. The concept of policy opaqueness would also be retained in such an approach, implying that each AS accepts a set of route advertisements, applies local policy constraints, and re-advertises those advertisements permitted by the local policy constraints. It could be feasible to consider alterations to the distance vector path selection algorithm, particularly as it relates to intermediate states during processing of a route withdrawal. It is also feasible to consider the use of compound route attributes, allowing a route object to include an
aggregate route, and a number of specifics of the aggregate route, and attach attributes that may apply to the aggregate or a specific address prefix. Such route attributes could be used to support multi-homing and inter-domain traffic engineering mechanisms. The overall intent of this approach is to address the major requirements in the inter-domain routing space without using an increasing set of globally propagated specific route objects. A potential applied research topic is to consider the feasibility of de-coupling the requirements of inter-domain connectivity management with the applications of policy constraints and the issues of sender- and/or receiver-managed traffic engineering requirements. Such an approach may use a link-state protocol as a means of maintaining a consistent view of the topology of inter-domain network, and then use some form of overlay protocol to negotiate policy requirements of each AS, and use a further overlay to support inter-domain traffic engineering requirements. The underlying assumption of such an approach is that by dividing up the functional role of inter-domain routing into distinct components each component will have superior scaling and convergence properties which in turn to result in superior properties for the entire routing system. Obviously, this assumption requires some testing. Research topics with potential longer term application include the approach of drawing a distinction between a network's identity, a network's location relative to other networks, and a feasible path between a source and destination network that satisfies various policy and traffic engineering constraints. Again the intent of such an approach would be to divide the current routing function into a number of distinct scalable components.
A protocol design should therefore consider how to minimize the damage to the overall routing computation that can be caused by a single or small set of misbehaving routers. The routing system itself needs to be resilient against accidental or malicious advertisements of a route object by a route server not entitled to generate such an advertisement. This implies several things, including the need for cryptographic validation of announcements, cryptographic protection of various critical routing messages and an accurate and trusted database of routing assignments via which authorization can be checked.  Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996.  Clark, D., Chapin, L., Cerf, V., Braden, R. and R. Hobby, "Towards the Future Internet Architecture", RFC 1287, December 1991.  Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification, RFC 2460, December 1998.  Srisuresh, P. and K. Egevang, "Traditional IP Network Address Translator (Traditional NAT)", RFC 3022, January 2001.  Fuller, V., Li, T., Yu, J. and K. Varadhan, "Classless Inter- Domain Routing (CIDR): an Address Assignment and Aggregation Strategy", RFC 1519, September 1993.  Huston, G., "The BGP Routing Table", The Internet Protocol Journal, vol. 4, No. 1, March 2001.  Rekhter, Y. and T. Li, "A Border Gateway Protocol 4 (BGP-4)", RFC 1771, March 1995.  Vohara, Q. and E. Chen, "BGP support for four-octet AS number space", Work in Progress.  Hain, T., "Architectural Implications of NAT", RFC 2993, November 2000.  Labovitz, C., Ahuja, A., Bose, A. and J. Jahanian, "Delayed Internet Routing Convergence", Proceedings ACM SIGCOMM 2000, August 2000.
 Lothberg, P., personal communication, December 2000.
Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society.