RFC 8014

An Architecture for Data-Center Network Virtualization over Layer 3 (NVO3)

Pages: 35
Informational

Part 1 of 2 – Pages 1 to 16

RFC8014 - Page 1

Internet Engineering Task Force (IETF)                          D. Black
Request for Comments: 8014                                      Dell EMC
Category: Informational                                        J. Hudson
ISSN: 2070-1721                                               L. Kreeger
                                                             M. Lasserre
                                                             Independent
                                                               T. Narten
                                                                     IBM
                                                           December 2016


                          An Architecture for
         Data-Center Network Virtualization over Layer 3 (NVO3)

Abstract

   This document presents a high-level overview architecture for
   building data-center Network Virtualization over Layer 3 (NVO3)
   networks.  The architecture is given at a high level, showing the
   major components of an overall system.  An important goal is to
   divide the space into individual smaller components that can be
   implemented independently with clear inter-component interfaces and
   interactions.  It should be possible to build and implement
   individual components in isolation and have them interoperate with
   other independently implemented components.  That way, implementers
   have flexibility in implementing individual components and can
   optimize and innovate within their respective components without
   requiring changes to other components.

Status of This Memo

   This document is not an Internet Standards Track specification; it is
   published for informational purposes.

   This document is a product of the Internet Engineering Task Force
   (IETF).  It represents the consensus of the IETF community.  It has
   received public review and has been approved for publication by the
   Internet Engineering Steering Group (IESG).  Not all documents
   approved by the IESG are a candidate for any level of Internet
   Standard; see Section 2 of RFC 7841.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   http://www.rfc-editor.org/info/rfc8014.

RFC8014 - Page 2

Copyright Notice

   Copyright (c) 2016 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

RFC8014 - Page 3

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   4
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
   3.  Background  . . . . . . . . . . . . . . . . . . . . . . . . .   5
     3.1.  VN Service (L2 and L3)  . . . . . . . . . . . . . . . . .   7
       3.1.1.  VLAN Tags in L2 Service . . . . . . . . . . . . . . .   8
       3.1.2.  Packet Lifetime Considerations  . . . . . . . . . . .   8
     3.2.  Network Virtualization Edge (NVE) Background  . . . . . .   9
     3.3.  Network Virtualization Authority (NVA) Background . . . .  10
     3.4.  VM Orchestration Systems  . . . . . . . . . . . . . . . .  11
   4.  Network Virtualization Edge (NVE) . . . . . . . . . . . . . .  12
     4.1.  NVE Co-located with Server Hypervisor . . . . . . . . . .  12
     4.2.  Split-NVE . . . . . . . . . . . . . . . . . . . . . . . .  13
       4.2.1.  Tenant VLAN Handling in Split-NVE Case  . . . . . . .  14
     4.3.  NVE State . . . . . . . . . . . . . . . . . . . . . . . .  14
     4.4.  Multihoming of NVEs . . . . . . . . . . . . . . . . . . .  15
     4.5.  Virtual Access Point (VAP)  . . . . . . . . . . . . . . .  16
   5.  Tenant System Types . . . . . . . . . . . . . . . . . . . . .  16
     5.1.  Overlay-Aware Network Service Appliances  . . . . . . . .  16
     5.2.  Bare Metal Servers  . . . . . . . . . . . . . . . . . . .  17
     5.3.  Gateways  . . . . . . . . . . . . . . . . . . . . . . . .  17
       5.3.1.  Gateway Taxonomy  . . . . . . . . . . . . . . . . . .  18
         5.3.1.1.  L2 Gateways (Bridging)  . . . . . . . . . . . . .  18
         5.3.1.2.  L3 Gateways (Only IP Packets) . . . . . . . . . .  18
     5.4.  Distributed Inter-VN Gateways . . . . . . . . . . . . . .  19
     5.5.  ARP and Neighbor Discovery  . . . . . . . . . . . . . . .  20
   6.  NVE-NVE Interaction . . . . . . . . . . . . . . . . . . . . .  20
   7.  Network Virtualization Authority (NVA)  . . . . . . . . . . .  21
     7.1.  How an NVA Obtains Information  . . . . . . . . . . . . .  21
     7.2.  Internal NVA Architecture . . . . . . . . . . . . . . . .  22
     7.3.  NVA External Interface  . . . . . . . . . . . . . . . . .  22
   8.  NVE-NVA Protocol  . . . . . . . . . . . . . . . . . . . . . .  24
     8.1.  NVE-NVA Interaction Models  . . . . . . . . . . . . . . .  24
     8.2.  Direct NVE-NVA Protocol . . . . . . . . . . . . . . . . .  25
     8.3.  Propagating Information Between NVEs and NVAs . . . . . .  25
   9.  Federated NVAs  . . . . . . . . . . . . . . . . . . . . . . .  26
     9.1.  Inter-NVA Peering . . . . . . . . . . . . . . . . . . . .  29
   10. Control Protocol Work Areas . . . . . . . . . . . . . . . . .  29
   11. NVO3 Data-Plane Encapsulation . . . . . . . . . . . . . . . .  29
   12. Operations, Administration, and Maintenance (OAM) . . . . . .  30
   13. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .  31
   14. Security Considerations . . . . . . . . . . . . . . . . . . .  31
   15. Informative References  . . . . . . . . . . . . . . . . . . .  32
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  34
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  35

RFC8014 - Page 4

1.  Introduction

   This document presents a high-level architecture for building data-
   center Network Virtualization over Layer 3 (NVO3) networks.  The
   architecture is given at a high level, which shows the major
   components of an overall system.  An important goal is to divide the
   space into smaller individual components that can be implemented
   independently with clear inter-component interfaces and interactions.
   It should be possible to build and implement individual components in
   isolation and have them interoperate with other independently
   implemented components.  That way, implementers have flexibility in
   implementing individual components and can optimize and innovate
   within their respective components without requiring changes to other
   components.

   The motivation for overlay networks is given in "Problem Statement:
   Overlays for Network Virtualization" [RFC7364].  "Framework for Data
   Center (DC) Network Virtualization" [RFC7365] provides a framework
   for discussing overlay networks generally and the various components
   that must work together in building such systems.  This document
   differs from the framework document in that it doesn't attempt to
   cover all possible approaches within the general design space.
   Rather, it describes one particular approach that the NVO3 WG has
   focused on.

2.  Terminology

   This document uses the same terminology as [RFC7365].  In addition,
   the following terms are used:

   NV Domain:  A Network Virtualization Domain is an administrative
      construct that defines a Network Virtualization Authority (NVA),
      the set of Network Virtualization Edges (NVEs) associated with
      that NVA, and the set of virtual networks the NVA manages and
      supports.  NVEs are associated with a (logically centralized) NVA,
      and an NVE supports communication for any of the virtual networks
      in the domain.

   NV Region:  A region over which information about a set of virtual
      networks is shared.  The degenerate case of a single NV Domain
      corresponds to an NV Region corresponding to that domain.  The
      more interesting case occurs when two or more NV Domains share
      information about part or all of a set of virtual networks that
      they manage.  Two NVAs share information about particular virtual
      networks for the purpose of supporting connectivity between
      tenants located in different NV Domains.  NVAs can share
      information about an entire NV Domain, or just individual virtual
      networks.

RFC8014 - Page 5

   Tenant System Interface (TSI):  The interface to a Virtual Network
      (VN) as presented to a Tenant System (TS, see [RFC7365]).  The TSI
      logically connects to the NVE via a Virtual Access Point (VAP).
      To the Tenant System, the TSI is like a Network Interface Card
      (NIC); the TSI presents itself to a Tenant System as a normal
      network interface.

   VLAN:  Unless stated otherwise, the terms "VLAN" and "VLAN Tag" are
      used in this document to denote a Customer VLAN (C-VLAN)
      [IEEE.802.1Q]; the terms are used interchangeably to improve
      readability.

3.  Background

   Overlay networks are an approach for providing network virtualization
   services to a set of Tenant Systems (TSs) [RFC7365].  With overlays,
   data traffic between tenants is tunneled across the underlying data
   center's IP network.  The use of tunnels provides a number of
   benefits by decoupling the network as viewed by tenants from the
   underlying physical network across which they communicate.
   Additional discussion of some NVO3 use cases can be found in
   [USECASES].

   Tenant Systems connect to Virtual Networks (VNs), with each VN having
   associated attributes defining properties of the network (such as the
   set of members that connect to it).  Tenant Systems connected to a
   virtual network typically communicate freely with other Tenant
   Systems on the same VN, but communication between Tenant Systems on
   one VN and those external to the VN (whether on another VN or
   connected to the Internet) is carefully controlled and governed by
   policy.  The NVO3 architecture does not impose any restrictions to
   the application of policy controls even within a VN.

   A Network Virtualization Edge (NVE) [RFC7365] is the entity that
   implements the overlay functionality.  An NVE resides at the boundary
   between a Tenant System and the overlay network as shown in Figure 1.
   An NVE creates and maintains local state about each VN for which it
   is providing service on behalf of a Tenant System.

RFC8014 - Page 6

       +--------+                                             +--------+
       | Tenant +--+                                     +----| Tenant |
       | System |  |                                    (')   | System |
       +--------+  |          ................         (   )  +--------+
                   |  +-+--+  .              .  +--+-+  (_)
                   |  | NVE|--.              .--| NVE|   |
                   +--|    |  .              .  |    |---+
                      +-+--+  .              .   +--+-+
                      /       .              .
                     /        .  L3 Overlay  .   +--+-++--------+
       +--------+   /         .    Network   .   | NVE|| Tenant |
       | Tenant +--+          .              .- -|    || System |
       | System |             .              .   +--+-++--------+
       +--------+             ................
                                     |
                                   +----+
                                   | NVE|
                                   |    |
                                   +----+
                                     |
                                     |
                           =====================
                             |               |
                         +--------+      +--------+
                         | Tenant |      | Tenant |
                         | System |      | System |
                         +--------+      +--------+


                  Figure 1: NVO3 Generic Reference Model

   The following subsections describe key aspects of an overlay system
   in more detail.  Section 3.1 describes the service model (Ethernet
   vs. IP) provided to Tenant Systems.  Section 3.2 describes NVEs in
   more detail.  Section 3.3 introduces the Network Virtualization
   Authority, from which NVEs obtain information about virtual networks.
   Section 3.4 provides background on Virtual Machine (VM) orchestration
   systems and their use of virtual networks.

RFC8014 - Page 7

3.1.  VN Service (L2 and L3)

   A VN provides either Layer 2 (L2) or Layer 3 (L3) service to
   connected tenants.  For L2 service, VNs transport Ethernet frames,
   and a Tenant System is provided with a service that is analogous to
   being connected to a specific L2 C-VLAN.  L2 broadcast frames are
   generally delivered to all (and multicast frames delivered to a
   subset of) the other Tenant Systems on the VN.  To a Tenant System,
   it appears as if they are connected to a regular L2 Ethernet link.
   Within the NVO3 architecture, tenant frames are tunneled to remote
   NVEs based on the Media Access Control (MAC) addresses of the frame
   headers as originated by the Tenant System.  On the underlay, NVO3
   packets are forwarded between NVEs based on the outer addresses of
   tunneled packets.

   For L3 service, VNs are routed networks that transport IP datagrams,
   and a Tenant System is provided with a service that supports only IP
   traffic.  Within the NVO3 architecture, tenant frames are tunneled to
   remote NVEs based on the IP addresses of the packet originated by the
   Tenant System; any L2 destination addresses provided by Tenant
   Systems are effectively ignored by the NVEs and overlay network.  For
   L3 service, the Tenant System will be configured with an IP subnet
   that is effectively a point-to-point link, i.e., having only the
   Tenant System and a next-hop router address on it.

   L2 service is intended for systems that need native L2 Ethernet
   service and the ability to run protocols directly over Ethernet
   (i.e., not based on IP).  L3 service is intended for systems in which
   all the traffic can safely be assumed to be IP.  It is important to
   note that whether or not an NVO3 network provides L2 or L3 service to
   a Tenant System, the Tenant System does not generally need to be
   aware of the distinction.  In both cases, the virtual network
   presents itself to the Tenant System as an L2 Ethernet interface.  An
   Ethernet interface is used in both cases simply as a widely supported
   interface type that essentially all Tenant Systems already support.
   Consequently, no special software is needed on Tenant Systems to use
   an L3 vs. an L2 overlay service.

   NVO3 can also provide a combined L2 and L3 service to tenants.  A
   combined service provides L2 service for intra-VN communication but
   also provides L3 service for L3 traffic entering or leaving the VN.
   Architecturally, the handling of a combined L2/L3 service within the
   NVO3 architecture is intended to match what is commonly done today in
   non-overlay environments by devices providing a combined bridge/
   router service.  With combined service, the virtual network itself
   retains the semantics of L2 service, and all traffic is processed
   according to its L2 semantics.  In addition, however, traffic
   requiring IP processing is also processed at the IP level.

RFC8014 - Page 8

   The IP processing for a combined service can be implemented on a
   standalone device attached to the virtual network (e.g., an IP
   router) or implemented locally on the NVE (see Section 5.4 on
   Distributed Inter-VN Gateways).  For unicast traffic, NVE
   implementation of a combined service may result in a packet being
   delivered to another Tenant System attached to the same NVE (on
   either the same or a different VN), tunneled to a remote NVE, or even
   forwarded outside the NV Domain.  For multicast or broadcast packets,
   the combination of NVE L2 and L3 processing may result in copies of
   the packet receiving both L2 and L3 treatments to realize delivery to
   all of the destinations involved.  This distributed NVE
   implementation of IP routing results in the same network delivery
   behavior as if the L2 processing of the packet included delivery of
   the packet to an IP router attached to the L2 VN as a Tenant System,
   with the router having additional network attachments to other
   networks, either virtual or not.

3.1.1.  VLAN Tags in L2 Service

   An NVO3 L2 virtual network service may include encapsulated L2 VLAN
   tags provided by a Tenant System but does not use encapsulated tags
   in deciding where and how to forward traffic.  Such VLAN tags can be
   passed through so that Tenant Systems that send or expect to receive
   them can be supported as appropriate.

   The processing of VLAN tags that an NVE receives from a TS is
   controlled by settings associated with the VAP.  Just as in the case
   with ports on Ethernet switches, a number of settings are possible.
   For example, Customer VLAN Tags (C-TAGs) can be passed through
   transparently, could always be stripped upon receipt from a Tenant
   System, could be compared against a list of explicitly configured
   tags, etc.

   Note that there are additional considerations when VLAN tags are used
   to identify both the VN and a Tenant System VLAN within that VN, as
   described in Section 4.2.1.

3.1.2.  Packet Lifetime Considerations

   For L3 service, Tenant Systems should expect the IPv4 Time to Live
   (TTL) or IPv6 Hop Limit in the packets they send to be decremented by
   at least 1.  For L2 service, neither the TTL nor the Hop Limit (when
   the packet is IP) is modified.  The underlay network manages TTLs and
   Hop Limits in the outer IP encapsulation -- the values in these
   fields could be independent from or related to the values in the same
   fields of tenant IP packets.

RFC8014 - Page 9

3.2.  Network Virtualization Edge (NVE) Background

   Tenant Systems connect to NVEs via a Tenant System Interface (TSI).
   The TSI logically connects to the NVE via a Virtual Access Point
   (VAP), and each VAP is associated with one VN as shown in Figure 2.
   To the Tenant System, the TSI is like a NIC; the TSI presents itself
   to a Tenant System as a normal network interface.  On the NVE side, a
   VAP is a logical network port (virtual or physical) into a specific
   virtual network.  Note that two different Tenant Systems (and TSIs)
   attached to a common NVE can share a VAP (e.g., TS1 and TS2 in
   Figure 2) so long as they connect to the same VN.

                    |         Data-Center Network (IP)        |
                    |                                         |
                    +-----------------------------------------+
                         |                           |
                         |       Tunnel Overlay      |
            +------------+---------+       +---------+------------+
            | +----------+-------+ |       | +-------+----------+ |
            | |  Overlay Module  | |       | |  Overlay Module  | |
            | +---------+--------+ |       | +---------+--------+ |
            |           |          |       |           |          |
     NVE1   |           |          |       |           |          | NVE2
            |  +--------+-------+  |       |  +--------+-------+  |
            |  | VNI1      VNI2 |  |       |  | VNI1      VNI2 |  |
            |  +-+----------+---+  |       |  +-+-----------+--+  |
            |    | VAP1     | VAP2 |       |    | VAP1      | VAP2|
            +----+----------+------+       +----+-----------+-----+
                 |          |                   |           |
                 |\         |                   |           |
                 | \        |                   |          /|
          -------+--\-------+-------------------+---------/-+-------
                 |   \      |     Tenant        |        /  |
            TSI1 |TSI2\     | TSI3            TSI1  TSI2/   TSI3
                +---+ +---+ +---+             +---+ +---+   +---+
                |TS1| |TS2| |TS3|             |TS4| |TS5|   |TS6|
                +---+ +---+ +---+             +---+ +---+   +---+

                       Figure 2: NVE Reference Model

   The Overlay Module performs the actual encapsulation and
   decapsulation of tunneled packets.  The NVE maintains state about the
   virtual networks it is a part of so that it can provide the Overlay
   Module with information such as the destination address of the NVE to
   tunnel a packet to and the Context ID that should be placed in the
   encapsulation header to identify the virtual network that a tunneled
   packet belongs to.

RFC8014 - Page 10

   On the side facing the data-center network, the NVE sends and
   receives native IP traffic.  When ingressing traffic from a Tenant
   System, the NVE identifies the egress NVE to which the packet should
   be sent, adds an overlay encapsulation header, and sends the packet
   on the underlay network.  When receiving traffic from a remote NVE,
   an NVE strips off the encapsulation header and delivers the
   (original) packet to the appropriate Tenant System.  When the source
   and destination Tenant System are on the same NVE, no encapsulation
   is needed and the NVE forwards traffic directly.

   Conceptually, the NVE is a single entity implementing the NVO3
   functionality.  In practice, there are a number of different
   implementation scenarios, as described in detail in Section 4.

3.3.  Network Virtualization Authority (NVA) Background

   Address dissemination refers to the process of learning, building,
   and distributing the mapping/forwarding information that NVEs need in
   order to tunnel traffic to each other on behalf of communicating
   Tenant Systems.  For example, in order to send traffic to a remote
   Tenant System, the sending NVE must know the destination NVE for that
   Tenant System.

   One way to build and maintain mapping tables is to use learning, as
   802.1 bridges do [IEEE.802.1Q].  When forwarding traffic to multicast
   or unknown unicast destinations, an NVE could simply flood traffic.
   While flooding works, it can lead to traffic hot spots and to
   problems in larger networks (e.g., excessive amounts of flooded
   traffic).

   Alternatively, to reduce the scope of where flooding must take place,
   or to eliminate it all together, NVEs can make use of a Network
   Virtualization Authority (NVA).  An NVA is the entity that provides
   address mapping and other information to NVEs.  NVEs interact with an
   NVA to obtain any required address-mapping information they need in
   order to properly forward traffic on behalf of tenants.  The term
   "NVA" refers to the overall system, without regard to its scope or
   how it is implemented.  NVAs provide a service, and NVEs access that
   service via an NVE-NVA protocol as discussed in Section 8.

   Even when an NVA is present, Ethernet bridge MAC address learning
   could be used as a fallback mechanism, should the NVA be unable to
   provide an answer or for other reasons.  This document does not
   consider flooding approaches in detail, as there are a number of
   benefits in using an approach that depends on the presence of an NVA.

   For the rest of this document, it is assumed that an NVA exists and
   will be used.  NVAs are discussed in more detail in Section 7.

RFC8014 - Page 11

3.4.  VM Orchestration Systems

   VM orchestration systems manage server virtualization across a set of
   servers.  Although VM management is a separate topic from network
   virtualization, the two areas are closely related.  Managing the
   creation, placement, and movement of VMs also involves creating,
   attaching to, and detaching from virtual networks.  A number of
   existing VM orchestration systems have incorporated aspects of
   virtual network management into their systems.

   Note also that although this section uses the terms "VM" and
   "hypervisor" throughout, the same issues apply to other
   virtualization approaches, including Linux Containers (LXC), BSD
   Jails, Network Service Appliances as discussed in Section 5.1, etc.
   From an NVO3 perspective, it should be assumed that where the
   document uses the term "VM" and "hypervisor", the intention is that
   the discussion also applies to other systems, where, e.g., the host
   operating system plays the role of the hypervisor in supporting
   virtualization, and a container plays the equivalent role as a VM.

   When a new VM image is started, the VM orchestration system
   determines where the VM should be placed, interacts with the
   hypervisor on the target server to load and start the VM, and
   controls when a VM should be shut down or migrated elsewhere.  VM
   orchestration systems also have knowledge about how a VM should
   connect to a network, possibly including the name of the virtual
   network to which a VM is to connect.  The VM orchestration system can
   pass such information to the hypervisor when a VM is instantiated.
   VM orchestration systems have significant (and sometimes global)
   knowledge over the domain they manage.  They typically know on what
   servers a VM is running, and metadata associated with VM images can
   be useful from a network virtualization perspective.  For example,
   the metadata may include the addresses (MAC and IP) the VMs will use
   and the name(s) of the virtual network(s) they connect to.

   VM orchestration systems run a protocol with an agent running on the
   hypervisor of the servers they manage.  That protocol can also carry
   information about what virtual network a VM is associated with.  When
   the orchestrator instantiates a VM on a hypervisor, the hypervisor
   interacts with the NVE in order to attach the VM to the virtual
   networks it has access to.  In general, the hypervisor will need to
   communicate significant VM state changes to the NVE.  In the reverse
   direction, the NVE may need to communicate network connectivity
   information back to the hypervisor.  Examples of deployed VM
   orchestration systems include VMware's vCenter Server, Microsoft's
   System Center Virtual Machine Manager, and systems based on OpenStack
   and its associated plugins (e.g., Nova and Neutron).  Each can pass
   information about what virtual networks a VM connects to down to the

RFC8014 - Page 12

   hypervisor.  The protocol used between the VM orchestration system
   and hypervisors is generally proprietary.

   It should be noted that VM orchestration systems may not have direct
   access to all networking-related information a VM uses.  For example,
   a VM may make use of additional IP or MAC addresses that the VM
   management system is not aware of.

4.  Network Virtualization Edge (NVE)

   As introduced in Section 3.2, an NVE is the entity that implements
   the overlay functionality.  This section describes NVEs in more
   detail.  An NVE will have two external interfaces:

   Facing the Tenant System:  On the side facing the Tenant System, an
      NVE interacts with the hypervisor (or equivalent entity) to
      provide the NVO3 service.  An NVE will need to be notified when a
      Tenant System "attaches" to a virtual network (so it can validate
      the request and set up any state needed to send and receive
      traffic on behalf of the Tenant System on that VN).  Likewise, an
      NVE will need to be informed when the Tenant System "detaches"
      from the virtual network so that it can reclaim state and
      resources appropriately.

   Facing the Data-Center Network:  On the side facing the data-center
      network, an NVE interfaces with the data-center underlay network,
      sending and receiving tunneled packets to and from the underlay.
      The NVE may also run a control protocol with other entities on the
      network, such as the Network Virtualization Authority.

4.1.  NVE Co-located with Server Hypervisor

   When server virtualization is used, the entire NVE functionality will
   typically be implemented as part of the hypervisor and/or virtual
   switch on the server.  In such cases, the Tenant System interacts
   with the hypervisor, and the hypervisor interacts with the NVE.
   Because the interaction between the hypervisor and NVE is implemented
   entirely in software on the server, there is no "on-the-wire"
   protocol between Tenant Systems (or the hypervisor) and the NVE that
   needs to be standardized.  While there may be APIs between the NVE
   and hypervisor to support necessary interaction, the details of such
   APIs are not in scope for the NVO3 WG at the time of publication of
   this memo.

   Implementing NVE functionality entirely on a server has the
   disadvantage that server CPU resources must be spent implementing the
   NVO3 functionality.  Experimentation with overlay approaches and
   previous experience with TCP and checksum adapter offloads suggest

RFC8014 - Page 13

   that offloading certain NVE operations (e.g., encapsulation and
   decapsulation operations) onto the physical network adapter can
   produce performance advantages.  As has been done with checksum and/
   or TCP server offload and other optimization approaches, there may be
   benefits to offloading common operations onto adapters where
   possible.  Just as important, the addition of an overlay header can
   disable existing adapter offload capabilities that are generally not
   prepared to handle the addition of a new header or other operations
   associated with an NVE.

   While the exact details of how to split the implementation of
   specific NVE functionality between a server and its network adapters
   are an implementation matter and outside the scope of IETF
   standardization, the NVO3 architecture should be cognizant of and
   support such separation.  Ideally, it may even be possible to bypass
   the hypervisor completely on critical data-path operations so that
   packets between a Tenant System and its VN can be sent and received
   without having the hypervisor involved in each individual packet
   operation.

4.2.  Split-NVE

   Another possible scenario leads to the need for a split-NVE
   implementation.  An NVE running on a server (e.g., within a
   hypervisor) could support NVO3 service towards the tenant but not
   perform all NVE functions (e.g., encapsulation) directly on the
   server; some of the actual NVO3 functionality could be implemented on
   (i.e., offloaded to) an adjacent switch to which the server is
   attached.  While one could imagine a number of link types between a
   server and the NVE, one simple deployment scenario would involve a
   server and NVE separated by a simple L2 Ethernet link.  A more
   complicated scenario would have the server and NVE separated by a
   bridged access network, such as when the NVE resides on a Top of Rack
   (ToR) switch, with an embedded switch residing between servers and
   the ToR switch.

   For the split-NVE case, protocols will be needed that allow the
   hypervisor and NVE to negotiate and set up the necessary state so
   that traffic sent across the access link between a server and the NVE
   can be associated with the correct virtual network instance.
   Specifically, on the access link, traffic belonging to a specific
   Tenant System would be tagged with a specific VLAN C-TAG that
   identifies which specific NVO3 virtual network instance it connects
   to.  The hypervisor-NVE protocol would negotiate which VLAN C-TAG to
   use for a particular virtual network instance.  More details of the
   protocol requirements for functionality between hypervisors and NVEs
   can be found in [NVE-NVA].

RFC8014 - Page 14

4.2.1.  Tenant VLAN Handling in Split-NVE Case

   Preserving tenant VLAN tags across an NVO3 VN, as described in
   Section 3.1.1, poses additional complications in the split-NVE case.
   The portion of the NVE that performs the encapsulation function needs
   access to the specific VLAN tags that the Tenant System is using in
   order to include them in the encapsulated packet.  When an NVE is
   implemented entirely within the hypervisor, the NVE has access to the
   complete original packet (including any VLAN tags) sent by the
   tenant.  In the split-NVE case, however, the VLAN tag used between
   the hypervisor and offloaded portions of the NVE normally only
   identifies the specific VN that traffic belongs to.  In order to
   allow a tenant to preserve VLAN information from end to end between
   Tenant Systems in the split-NVE case, additional mechanisms would be
   needed (e.g., carry an additional VLAN tag by carrying both a C-TAG
   and a Service VLAN Tag (S-TAG) as specified in [IEEE.802.1Q] where
   the C-TAG identifies the tenant VLAN end to end and the S-TAG
   identifies the VN locally between each Tenant System and the
   corresponding NVE).

4.3.  NVE State

   NVEs maintain internal data structures and state to support the
   sending and receiving of tenant traffic.  An NVE may need some or all
   of the following information:

   1.  An NVE keeps track of which attached Tenant Systems are connected
       to which virtual networks.  When a Tenant System attaches to a
       virtual network, the NVE will need to create or update the local
       state for that virtual network.  When the last Tenant System
       detaches from a given VN, the NVE can reclaim state associated
       with that VN.

   2.  For tenant unicast traffic, an NVE maintains a per-VN table of
       mappings from Tenant System (inner) addresses to remote NVE
       (outer) addresses.

   3.  For tenant multicast (or broadcast) traffic, an NVE maintains a
       per-VN table of mappings and other information on how to deliver
       tenant multicast (or broadcast) traffic.  If the underlying
       network supports IP multicast, the NVE could use IP multicast to
       deliver tenant traffic.  In such a case, the NVE would need to
       know what IP underlay multicast address to use for a given VN.
       Alternatively, if the underlying network does not support
       multicast, a source NVE could use unicast replication to deliver
       traffic.  In such a case, an NVE would need to know which remote
       NVEs are participating in the VN.  An NVE could use both
       approaches, switching from one mode to the other depending on

RFC8014 - Page 15

       factors such as bandwidth efficiency and group membership
       sparseness.  [FRAMEWORK-MCAST] discusses the subject of multicast
       handling in NVO3 in further detail.

   4.  An NVE maintains necessary information to encapsulate outgoing
       traffic, including what type of encapsulation and what value to
       use for a Context ID to identify the VN within the encapsulation
       header.

   5.  In order to deliver incoming encapsulated packets to the correct
       Tenant Systems, an NVE maintains the necessary information to map
       incoming traffic to the appropriate VAP (i.e., TSI).

   6.  An NVE may find it convenient to maintain additional per-VN
       information such as QoS settings, Path MTU information, Access
       Control Lists (ACLs), etc.

4.4.  Multihoming of NVEs

   NVEs may be multihomed.  That is, an NVE may have more than one IP
   address associated with it on the underlay network.  Multihoming
   happens in two different scenarios.  First, an NVE may have multiple
   interfaces connecting it to the underlay.  Each of those interfaces
   will typically have a different IP address, resulting in a specific
   Tenant Address (on a specific VN) being reachable through the same
   NVE but through more than one underlay IP address.  Second, a
   specific Tenant System may be reachable through more than one NVE,
   each having one or more underlay addresses.  In both cases, NVE
   address-mapping functionality needs to support one-to-many mappings
   and enable a sending NVE to (at a minimum) be able to fail over from
   one IP address to another, e.g., should a specific NVE underlay
   address become unreachable.

   Finally, multihomed NVEs introduce complexities when source unicast
   replication is used to implement tenant multicast as described in
   Section 4.3.  Specifically, an NVE should only receive one copy of a
   replicated packet.

   Multihoming is needed to support important use cases.  First, a bare
   metal server may have multiple uplink connections to either the same
   or different NVEs.  Having only a single physical path to an upstream
   NVE, or indeed, having all traffic flow through a single NVE would be
   considered unacceptable in highly resilient deployment scenarios that
   seek to avoid single points of failure.  Moreover, in today's
   networks, the availability of multiple paths would require that they
   be usable in an active-active fashion (e.g., for load balancing).

RFC8014 - Page 16

4.5.  Virtual Access Point (VAP)

   The VAP is the NVE side of the interface between the NVE and the TS.
   Traffic to and from the tenant flows through the VAP.  If an NVE runs
   into difficulties sending traffic received on the VAP, it may need to
   signal such errors back to the VAP.  Because the VAP is an emulation
   of a physical port, its ability to signal NVE errors is limited and
   lacks sufficient granularity to reflect all possible errors an NVE
   may encounter (e.g., inability to reach a particular destination).
   Some errors, such as an NVE losing all of its connections to the
   underlay, could be reflected back to the VAP by effectively disabling
   it.  This state change would reflect itself on the TS as an interface
   going down, allowing the TS to implement interface error handling
   (e.g., failover) in the same manner as when a physical interface
   becomes disabled.

(page 16 continued on part 2)