Internet Engineering Task Force (IETF) A. Sajassi, Ed. Request for Comments: 8365 Cisco Category: Standards Track J. Drake, Ed. ISSN: 2070-1721 Juniper N. Bitar Nokia R. Shekhar Juniper J. Uttaro AT&T W. Henderickx Nokia March 2018 A Network Virtualization Overlay Solution Using Ethernet VPN (EVPN)
AbstractThis document specifies how Ethernet VPN (EVPN) can be used as a Network Virtualization Overlay (NVO) solution and explores the various tunnel encapsulation options over IP and their impact on the EVPN control plane and procedures. In particular, the following encapsulation options are analyzed: Virtual Extensible LAN (VXLAN), Network Virtualization using Generic Routing Encapsulation (NVGRE), and MPLS over GRE. This specification is also applicable to Generic Network Virtualization Encapsulation (GENEVE); however, some incremental work is required, which will be covered in a separate document. This document also specifies new multihoming procedures for split-horizon filtering and mass withdrawal. It also specifies EVPN route constructions for VXLAN/NVGRE encapsulations and Autonomous System Border Router (ASBR) procedures for multihoming of Network Virtualization Edge (NVE) devices. Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 7841. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at https://www.rfc-editor.org/info/rfc8365.
Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
1. Introduction ....................................................4 2. Requirements Notation and Conventions ...........................5 3. Terminology .....................................................5 4. EVPN Features ...................................................7 5. Encapsulation Options for EVPN Overlays .........................8 5.1. VXLAN/NVGRE Encapsulation ..................................8 5.1.1. Virtual Identifiers Scope ...........................9 5.1.2. Virtual Identifiers to EVI Mapping .................11 5.1.3. Constructing EVPN BGP Routes .......................13 5.2. MPLS over GRE .............................................15 6. EVPN with Multiple Data-Plane Encapsulations ...................15 7. Single-Homing NVEs - NVE Residing in Hypervisor ................16 7.1. Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE ....16 7.2. Impact on EVPN Procedures for VXLAN/NVGRE Encapsulations ..17 8. Multihoming NVEs - NVE Residing in ToR Switch ..................18 8.1. EVPN Multihoming Features .................................18 8.1.1. Multihomed ES Auto-Discovery .......................18 8.1.2. Fast Convergence and Mass Withdrawal ...............18 8.1.3. Split-Horizon ......................................19 8.1.4. Aliasing and Backup Path ...........................19 8.1.5. DF Election ........................................20 8.2. Impact on EVPN BGP Routes and Attributes ..................20 8.3. Impact on EVPN Procedures .................................20 8.3.1. Split Horizon ......................................21 8.3.2. Aliasing and Backup Path ...........................22 8.3.3. Unknown Unicast Traffic Designation ................22 9. Support for Multicast ..........................................23 10. Data-Center Interconnections (DCIs) ...........................24 10.1. DCI Using GWs ............................................24 10.2. DCI Using ASBRs ..........................................24 10.2.1. ASBR Functionality with Single-Homing NVEs ........25 10.2.2. ASBR Functionality with Multihoming NVEs ..........26 11. Security Considerations .......................................28 12. IANA Considerations ...........................................29 13. References ....................................................29 13.1. Normative References .....................................29 13.2. Informative References ...................................30 Acknowledgements ..................................................32 Contributors ......................................................32 Authors' Addresses ................................................33
RFC7432] can be used as a Network Virtualization Overlay (NVO) solution and explores the various tunnel encapsulation options over IP and their impact on the EVPN control plane and procedures. In particular, the following encapsulation options are analyzed: Virtual Extensible LAN (VXLAN) [RFC7348], Network Virtualization using Generic Routing Encapsulation (NVGRE) [RFC7637], and MPLS over Generic Routing Encapsulation (GRE) [RFC4023]. This specification is also applicable to Generic Network Virtualization Encapsulation (GENEVE) [GENEVE]; however, some incremental work is required, which will be covered in a separate document [EVPN-GENEVE]. This document also specifies new multihoming procedures for split-horizon filtering and mass withdrawal. It also specifies EVPN route constructions for VXLAN/NVGRE encapsulations and Autonomous System Border Router (ASBR) procedures for multihoming of Network Virtualization Edge (NVE) devices. In the context of this document, an NVO is a solution to address the requirements of a multi-tenant data center, especially one with virtualized hosts, e.g., Virtual Machines (VMs) or virtual workloads. The key requirements of such a solution, as described in [RFC7364], are the following: - Isolation of network traffic per tenant - Support for a large number of tenants (tens or hundreds of thousands) - Extension of Layer 2 (L2) connectivity among different VMs belonging to a given tenant segment (subnet) across different Points of Delivery (PoDs) within a data center or between different data centers - Allowing a given VM to move between different physical points of attachment within a given L2 segment The underlay network for NVO solutions is assumed to provide IP connectivity between NVO endpoints.
This document describes how EVPN can be used as an NVO solution and explores applicability of EVPN functions and procedures. In particular, it describes the various tunnel encapsulation options for EVPN over IP and their impact on the EVPN control plane as well as procedures for two main scenarios: (a) single-homing NVEs - when an NVE resides in the hypervisor, and (b) multihoming NVEs - when an NVE resides in a Top-of-Rack (ToR) device. The possible encapsulation options for EVPN overlays that are analyzed in this document are: - VXLAN and NVGRE - MPLS over GRE Before getting into the description of the different encapsulation options for EVPN over IP, it is important to highlight the EVPN solution's main features, how those features are currently supported, and any impact that the encapsulation has on those features. RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. RFC7432] and [RFC7365]. VXLAN: Virtual Extensible LAN GRE: Generic Routing Encapsulation NVGRE: Network Virtualization using Generic Routing Encapsulation GENEVE: Generic Network Virtualization Encapsulation PoD: Point of Delivery NV: Network Virtualization
NVO: Network Virtualization Overlay NVE: Network Virtualization Edge VNI: VXLAN Network Identifier VSID: Virtual Subnet Identifier (for NVGRE) I-SID: Service Instance Identifier EVPN: Ethernet VPN EVI: EVPN Instance. An EVPN instance spanning the Provider Edge (PE) devices participating in that EVPN MAC-VRF: A Virtual Routing and Forwarding table for Media Access Control (MAC) addresses on a PE IP-VRF: A Virtual Routing and Forwarding table for Internet Protocol (IP) addresses on a PE ES: Ethernet Segment. When a customer site (device or network) is connected to one or more PEs via a set of Ethernet links, then that set of links is referred to as an 'Ethernet segment'. Ethernet Segment Identifier (ESI): A unique non-zero identifier that identifies an Ethernet segment is called an 'Ethernet Segment Identifier'. Ethernet Tag: An Ethernet tag identifies a particular broadcast domain, e.g., a VLAN. An EVPN instance consists of one or more broadcast domains. PE: Provider Edge Single-Active Redundancy Mode: When only a single PE, among all the PEs attached to an ES, is allowed to forward traffic to/from that ES for a given VLAN, then the Ethernet segment is defined to be operating in Single-Active redundancy mode. All-Active Redundancy Mode: When all PEs attached to an Ethernet segment are allowed to forward known unicast traffic to/from that ES for a given VLAN, then the ES is defined to be operating in All-Active redundancy mode. PIM-SM: Protocol Independent Multicast - Sparse-Mode
PIM-SSM: Protocol Independent Multicast - Source-Specific Multicast BIDIR-PIM: Bidirectional PIM RFC7432] was originally designed to support the requirements detailed in [RFC7209] and therefore has the following attributes which directly address control-plane scaling and ease of deployment issues. 1. Control-plane information is distributed with BGP and broadcast and multicast traffic is sent using a shared multicast tree or with ingress replication. 2. Control-plane learning is used for MAC (and IP) addresses instead of data-plane learning. The latter requires the flooding of unknown unicast and Address Resolution Protocol (ARP) frames; whereas, the former does not require any flooding. 3. Route Reflector (RR) is used to reduce a full mesh of BGP sessions among PE devices to a single BGP session between a PE and the RR. Furthermore, RR hierarchy can be leveraged to scale the number of BGP routes on the RR. 4. Auto-discovery via BGP is used to discover PE devices participating in a given VPN, PE devices participating in a given redundancy group, tunnel encapsulation types, multicast tunnel types, multicast members, etc. 5. All-Active multihoming is used. This allows a given Customer Edge (CE) device to have multiple links to multiple PEs, and traffic to/from that CE fully utilizes all of these links. 6. When a link between a CE and a PE fails, the PEs for that EVI are notified of the failure via the withdrawal of a single EVPN route. This allows those PEs to remove the withdrawing PE as a next hop for every MAC address associated with the failed link. This is termed "mass withdrawal". 7. BGP route filtering and constrained route distribution are leveraged to ensure that the control-plane traffic for a given EVI is only distributed to the PEs in that EVI.
8. When an IEEE 802.1Q [IEEE.802.1Q] interface is used between a CE and a PE, each of the VLAN IDs (VIDs) on that interface can be mapped onto a bridge table (for up to 4094 such bridge tables). All these bridge tables may be mapped onto a single MAC-VRF (in case of VLAN-aware bundle service). 9. VM Mobility mechanisms ensure that all PEs in a given EVI know the ES with which a given VM, as identified by its MAC and IP addresses, is currently associated. 10. RTs are used to allow the operator (or customer) to define a spectrum of logical network topologies including mesh, hub and spoke, and extranets (e.g., a VPN whose sites are owned by different enterprises), without the need for proprietary software or the aid of other virtual or physical devices. Because the design goal for NVO is millions of instances per common physical infrastructure, the scaling properties of the control plane for NVO are extremely important. EVPN and the extensions described herein, are designed with this level of scalability in mind. RFC7348]. In this scenario, the ingress VTEP does not include an inner VLAN tag on the encapsulated frame, and the egress VTEP discards the frames with an inner VLAN tag. This mode of operation in [RFC7348] maps to VLAN-Based Service in [RFC7432], where a tenant VID gets mapped to an EVI.
VXLAN also provides an option of including an inner VLAN tag in the encapsulated frame, if explicitly configured at the VTEP. This mode of operation can map to VLAN Bundle Service in [RFC7432] because all the tenant's tagged frames map to a single bridge table / MAC-VRF, and the inner VLAN tag is not used for lookup by the disposition PE when performing VXLAN decapsulation as described in Section 6 of [RFC7348]. [RFC7637] encapsulation is based on GRE encapsulation, and it mandates the inclusion of the optional GRE Key field, which carries the VSID. There is a one-to-one mapping between the VSID and the tenant VID, as described in [RFC7637]. The inclusion of an inner VLAN tag is prohibited. This mode of operation in [RFC7637] maps to VLAN Based Service in [RFC7432]. As described in the next section, there is no change to the encoding of EVPN routes to support VXLAN or NVGRE encapsulation, except for the use of the BGP Encapsulation Extended Community to indicate the encapsulation type (e.g., VXLAN or NVGRE). However, there is potential impact to the EVPN procedures depending on where the NVE is located (i.e., in hypervisor or ToR) and whether multihoming capabilities are required. Figure 1. Assume there are three network operators: one for each of the DC1, DC2, and WAN networks. The Gateways at the edge of the data centers are responsible for translating the VNIs between the values used in each of the DCNs and the values used in the WAN.
+--------------+ | | +---------+ | WAN | +---------+ +----+ | +---+ +----+ +----+ +---+ | +----+ |NVE1|--| | | |WAN | |WAN | | | |--|NVE3| +----+ |IP |GW |--|Edge| |Edge|--|GW | IP | +----+ +----+ |Fabric +---+ +----+ +----+ +---+ Fabric | +----+ |NVE2|--| | | | | |--|NVE4| +----+ +---------+ +--------------+ +---------+ +----+ |<------ DC 1 ------> <------ DC2 ------>| Figure 1: Data-Center Interconnect with Gateway Section 10.2. +--------------+ | | +---------+ | WAN | +---------+ +----+ | | +----+ +----+ | | +----+ |NVE1|--| | |ASBR| |ASBR| | |--|NVE3| +----+ |IP Fabric|---| | | |--|IP Fabric| +----+ +----+ | | +----+ +----+ | | +----+ |NVE2|--| | | | | |--|NVE4| +----+ +---------+ +--------------+ +---------+ +----+ |<------ DC 1 -----> <---- DC2 ------>| Figure 2: Data-Center Interconnect with ASBR
RFC7432], where two options existed for mapping broadcast domains (represented by VLAN IDs) to an EVI, when the EVPN control plane is used in conjunction with VXLAN (or NVGRE encapsulation), there are also two options for mapping broadcast domains represented by VXLAN VNIs (or NVGRE VSIDs) to an EVI: Option 1: A Single Broadcast Domain per EVI In this option, a single Ethernet broadcast domain (e.g., subnet) represented by a VNI is mapped to a unique EVI. This corresponds to the VLAN-Based Service in [RFC7432], where a tenant-facing interface, logical interface (e.g., represented by a VID), or physical interface gets mapped to an EVI. As such, a BGP Route Distinguisher (RD) and Route Target (RT) are needed per VNI on every NVE. The advantage of this model is that it allows the BGP RT constraint mechanisms to be used in order to limit the propagation and import of routes to only the NVEs that are interested in a given VNI. The disadvantage of this model may be the provisioning overhead if the RD and RT are not derived automatically from the VNI. In this option, the MAC-VRF table is identified by the RT in the control plane and by the VNI in the data plane. In this option, the specific MAC-VRF table corresponds to only a single bridge table. Option 2: Multiple Broadcast Domains per EVI In this option, multiple subnets, each represented by a unique VNI, are mapped to a single EVI. For example, if a tenant has multiple segments/subnets each represented by a VNI, then all the VNIs for that tenant are mapped to a single EVI; for example, the EVI in this case represents the tenant and not a subnet. This corresponds to the VLAN-aware bundle service in [RFC7432]. The advantage of this model is that it doesn't require the provisioning of an RD/RT per VNI. However, this is a moot point when compared to Option 1 where auto- derivation is used. The disadvantage of this model is that routes would be imported by NVEs that may not be interested in a given VNI. In this option, the MAC-VRF table is identified by the RT in the control plane; a specific bridge table for that MAC-VRF is identified by the <RT, Ethernet Tag ID> in the control plane. In this option, the VNI in the data plane is sufficient to identify a specific bridge table.
RFC7432], and RT can be auto-derived as described next. Since a Gateway PE as depicted in Figure 1 participates in both the DCN and WAN BGP sessions, it is important that, when RT values are auto-derived from VNIs, there be no conflict in RT spaces between DCNs and WANs, assuming that both are operating within the same Autonomous System (AS). Also, there can be scenarios where both VXLAN and NVGRE encapsulations may be needed within the same DCN, and their corresponding VNIs are administered independently, which means VNI spaces can overlap. In order to avoid conflict in RT spaces, the 6-byte RT values with 2-octet AS number for DCNs can be auto-derived as follow: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Global Administrator | Local Administrator | +-----------------------------------------------+---------------+ | Local Administrator (Cont.) | +-------------------------------+ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Global Administrator |A| TYPE| D-ID | Service ID | +-----------------------------------------------+---------------+ | Service ID (Cont.) | +-------------------------------+ The 6-octet RT field consists of two sub-fields: - Global Administrator sub-field: 2 octets. This sub-field contains an AS number assigned by IANA <https://www.iana.org/assignments/ as-numbers/>. - Local Administrator sub-field: 4 octets * A: A single-bit field indicating if this RT is auto-derived 0: auto-derived 1: manually derived
* Type: A 3-bit field that identifies the space in which the other 3 bytes are defined. The following spaces are defined: 0 : VID (802.1Q VLAN ID) 1 : VXLAN 2 : NVGRE 3 : I-SID 4 : EVI 5 : dual-VID (QinQ VLAN ID) * D-ID: A 4-bit field that identifies domain-id. The default value of domain-id is zero, indicating that only a single numbering space exist for a given technology. However, if more than one number space exists for a given technology (e.g., overlapping VXLAN spaces), then each of the number spaces need to be identified by its corresponding domain-id starting from 1. * Service ID: This 3-octet field is set to VNI, VSID, I-SID, or VID. It should be noted that RT auto-derivation is applicable for 2-octet AS numbers. For 4-octet AS numbers, the RT needs to be manually configured because 3-octet VNI fields cannot be fit within the 2-octet local administrator field.
For the VLAN-Based Service (a single VNI per MAC-VRF), the Ethernet Tag field in the MAC/IP Advertisement, Ethernet A-D per EVI, and IMET route MUST be set to zero just as in the VLAN-Based Service in [RFC7432]. For the VLAN-Aware Bundle Service (multiple VNIs per MAC-VRF with each VNI associated with its own bridge table), the Ethernet Tag field in the MAC Advertisement, Ethernet A-D per EVI, and IMET route MUST identify a bridge table within a MAC-VRF; the set of Ethernet Tags for that EVI needs to be configured consistently on all PEs within that EVI. For locally assigned VNIs, the value advertised in the Ethernet Tag field MUST be set to a VID just as in the VLAN-aware bundle service in [RFC7432]. Such setting must be done consistently on all PE devices participating in that EVI within a given domain. For global VNIs, the value advertised in the Ethernet Tag field SHOULD be set to a VNI as long as it matches the existing semantics of the Ethernet Tag, i.e., it identifies a bridge table within a MAC-VRF and the set of VNIs are configured consistently on each PE in that EVI. In order to indicate which type of data-plane encapsulation (i.e., VXLAN, NVGRE, MPLS, or MPLS in GRE) is to be used, the BGP Encapsulation Extended Community defined in [RFC5512] is included with all EVPN routes (i.e., MAC Advertisement, Ethernet A-D per EVI, Ethernet A-D per ESI, IMET, and Ethernet Segment) advertised by an egress PE. Five new values have been assigned by IANA to extend the list of encapsulation types defined in [RFC5512]; they are listed in Section 11. The MPLS encapsulation tunnel type, listed in Section 11, is needed in order to distinguish between an advertising node that only supports non-MPLS encapsulations and one that supports MPLS and non-MPLS encapsulations. An advertising node that only supports MPLS encapsulation does not need to advertise any encapsulation tunnel types; i.e., if the BGP Encapsulation Extended Community is not present, then either MPLS encapsulation or a statically configured encapsulation is assumed. The Next Hop field of the MP_REACH_NLRI attribute of the route MUST be set to the IPv4 or IPv6 address of the NVE. The remaining fields in each route are set as per [RFC7432]. Note that the procedure defined here -- to use the MPLS Label field to carry the VNI in the presence of a Tunnel Encapsulation Extended Community specifying the use of a VNI -- is aligned with the procedures described in Section 22.214.171.124 of [TUNNEL-ENCAP] ("When a Valid VNI has not been Signaled").
RFC4023] defines the standard for using MPLS over GRE encapsulation, which can be used for this purpose. However, when MPLS over GRE is used in conjunction with EVPN, it is recommended that the GRE key field be present and be used to provide a 32-bit entropy value only if the P nodes can perform Equal-Cost Multipath (ECMP) hashing based on the GRE key; otherwise, the GRE header SHOULD NOT include the GRE key field. The Checksum and Sequence Number fields MUST NOT be included, and the corresponding C and S bits in the GRE header MUST be set to zero. A PE capable of supporting this encapsulation SHOULD advertise its EVPN routes along with the Tunnel Encapsulation Extended Community indicating MPLS over GRE encapsulation as described in the previous section. RFC5512] allows each NVE in a given EVI to know each of the encapsulations supported by each of the other NVEs in that EVI. That is, each of the NVEs in a given EVI may support multiple data-plane encapsulations. An ingress NVE can send a frame to an egress NVE only if the set of encapsulations advertised by the egress NVE forms a non-empty intersection with the set of encapsulations supported by the ingress NVE; it is at the discretion of the ingress NVE which encapsulation to choose from this intersection. (As noted in Section 5.1.3, if the BGP Encapsulation extended community is not present, then the default MPLS encapsulation or a locally configured encapsulation is assumed.) When a PE advertises multiple supported encapsulations, it MUST advertise encapsulations that use the same EVPN procedures including procedures associated with split-horizon filtering described in Section 8.3.1. For example, VXLAN and NVGRE (or MPLS and MPLS over GRE) encapsulations use the same EVPN procedures; thus, a PE can advertise both of them and can support either of them or both of them simultaneously. However, a PE MUST NOT advertise VXLAN and MPLS encapsulations together because (a) the MPLS field of EVPN routes is
set to either an MPLS label or a VNI, but not both and (b) some EVPN procedures (such as split-horizon filtering) are different for VXLAN/ NVGRE and MPLS encapsulations. An ingress node that uses shared multicast trees for sending broadcast or multicast frames MAY maintain distinct trees for each different encapsulation type. It is the responsibility of the operator of a given EVI to ensure that all of the NVEs in that EVI support at least one common encapsulation. If this condition is violated, it could result in service disruption or failure. The use of the BGP Encapsulation Extended Community provides a method to detect when this condition is violated, but the actions to be taken are at the discretion of the operator and are outside the scope of this document. RFC7365], the RD must be a unique value per EVI or per NVE as described in [RFC7432]. In other words, whenever there is more than one administrative domain for global VNI, a unique RD must be used; or, whenever the VNI value has local significance, a unique RD must be used. Therefore, it is recommended to use a unique RD as described in [RFC7432] at all times.
When the NVEs reside on the hypervisor, the EVPN BGP routes and attributes associated with multihoming are no longer required. This reduces the required routes and attributes to the following subset of four out of the total of eight listed in Section 7 of [RFC7432]: - MAC/IP Advertisement Route - Inclusive Multicast Ethernet Tag Route - MAC Mobility Extended Community - Default Gateway Extended Community However, as noted in Section 8.6 of [RFC7432], in order to enable a single-homing ingress NVE to take advantage of fast convergence, Aliasing, and Backup Path when interacting with multihomed egress NVEs attached to a given ES, the single-homing ingress NVE should be able to receive and process routes that are Ethernet A-D per ES and Ethernet A-D per EVI. Section 10.1 of [RFC7432]. 2. Advertising locally learned MAC addresses in BGP using the MAC/IP Advertisement routes. 3. Performing remote learning using BGP per Section 9.2 of [RFC7432]. 4. Discovering other NVEs and constructing the multicast tunnels using the IMET routes. 5. Handling MAC address mobility events per the procedures of Section 15 in [RFC7432]. However, as noted in Section 8.6 of [RFC7432], in order to enable a single-homing ingress NVE to take advantage of fast convergence, Aliasing, and Backup Path when interacting with multihomed egress NVEs attached to a given ES, a single-homing ingress NVE should implement the ingress node processing of routes that are Ethernet A-D per ES and Ethernet A-D per EVI as defined in Sections 8.2 ("Fast Convergence") and 8.4 ("Aliasing and Backup Path") of [RFC7432].