The top-level network telemetry framework partitions the network telemetry into four modules based on the telemetry data object source and represents their relationship. Once the network operation applications acquire the data from these modules, they can apply data analytics and take actions. At the next level, the framework decomposes each module into separate components. Each of these modules follows the same underlying structure, with one component dedicated to the configuration of data subscriptions and data sources, a second component dedicated to encoding and exporting data, and a third component instrumenting the generation of telemetry related to the underlying resources. Throughout the framework, the same set of abstract data-acquiring mechanisms and data types (Section 3.3
) are applied. The two-level architecture with the uniform data abstraction helps accurately pinpoint a protocol or technique to its position in a network telemetry system or disaggregates a network telemetry system into manageable parts.
Telemetry can be applied on the forwarding plane, control plane, and management plane in a network, as well as on other sources out of the network, as shown in Figure 1
. Therefore, we categorize the network telemetry into four distinct modules (management plane, control plane, forwarding plane, and external data and event telemetry) with each having its own interface to network operation applications.
| Network Operation |<-------+
| Applications | |
| | |
^ ^ ^ |
| | | |
V V | V
| | Control | | | |
| | Plane | | | External |
| <---> | | | Data and |
| | Telemetry | | | Event |
| Management | ^ V | | Telemetry |
| Plane +-------|-------+ | |
| Telemetry | V | +-----------+
| | Forwarding |
| | Plane |
| <---> |
| | Telemetry |
| | |
The rationale of this partition lies in the different telemetry data objects that result in different data sources and export locations. Such differences have profound implications on in-network data programming and processing capability, data encoding and the transport protocol, and required data bandwidth and latency. Data can be sent directly or proxied via the control and management planes. There are advantages/disadvantages to both approaches.
Note that in some cases, the network controller itself may be the source of telemetry data that is unique to it or derived from the telemetry data collected from the network elements. Some of the principles and taxonomy specific to the control plane and management plane telemetry could also be applied to the controller when it is required to provide the telemetry data to network operation applications hosted outside. The scope of this document is focused on the network elements telemetry, and further details related to controllers are thus out of scope.
We summarize the major differences of the four modules in Table 1
. They are compared from six angles:
Data Export Location
Telemetry Application Protocol
Data Transport Method
Data Object is the target and source of each module. Because the data source varies, the location where data is mostly conveniently exported also varies. For example, forwarding plane data mainly originates as data exported from the forwarding Application-Specific Integrated Circuits (ASICs), while control plane data mainly originates from the protocol daemons running on the control CPU(s). For convenience and efficiency, it is preferred to export the data off the device from locations near the source. Because the locations that can export data have different capabilities, different choices of data models, encoding, and transport methods are made to balance the performance and cost. For example, the forwarding chip has high throughput but limited capacity for processing complex data and maintaining state, while the main control CPU is capable of complex data and state processing but has limited bandwidth for high throughput data. As a result, the suitable telemetry protocol for each module can be different. Some representative techniques are shown in the corresponding table blocks to highlight the technical diversity of these modules. Note that the selected techniques just reflect the de facto state of the art and are by no means exhaustive (e.g., IPFIX can also be implemented over TCP and SCTP, but that is not recommended for the forwarding plane). The key point is that one cannot expect to use a universal protocol to cover all the network telemetry requirements.
Table 1: Comparison of Data Object Modules
||configuration and operation state
||control protocol and signaling, RIB
||flow and packet QoS, traffic stat., buffer and queue stat., FIB, Access Control List (ACL)
||terminal, social, and environmental
||main control CPU
||main control CPU, linecard CPU, or forwarding chip
||forwarding chip or linecard CPU; main control CPU unlikely
||YANG, MIB, syslog
||GPB, JSON, XML
||GPB, JSON, XML, plain text
||GPB, JSON, XML, plain text
||gRPC, NETCONF, RESTCONF
||gRPC, NETCONF, IPFIX, traffic mirroring
||IPFIX, traffic mirroring, gRPC, NETFLOW
||HTTP(S), TCP, UDP
||HTTP(S), TCP, UDP
Note that the interaction with the applications that consume network telemetry data can be indirect. Some in-device data transfer is possible. For example, in the management plane telemetry, the management plane will need to acquire data from the data plane. Some operational states can only be derived from data plane data sources such as the interface status and statistics. As another example, obtaining control plane telemetry data may require the ability to access the Forwarding Information Base (FIB) of the data plane.
On the other hand, an application may involve more than one plane and interact with multiple planes simultaneously. For example, an SLA compliance application may require both the data plane telemetry and the control plane telemetry.
The requirements and challenges for each module are summarized as follows (note that the requirements may pertain across all telemetry modules; however, we emphasize those that are most pronounced for a particular plane).
The management plane of network elements interacts with the Network Management System (NMS) and provides information such as performance data, network logging data, network warning and defects data, and network statistics and state data. The management plane includes many protocols, including the classical SNMP and syslog. Regardless the protocol, management plane telemetry must address the following requirements:
Convenient Data Subscription: An application should have the freedom to choose which data is exported (see Section 3.3) and the means and frequency of how that data is exported (e.g., on-change or periodic subscription).
Structured Data: For automatic network operation, machines will replace humans for network data comprehension. Data modeling languages, such as YANG, can efficiently describe structured data and normalize data encoding and transformation.
High-Speed Data Transport: In order to keep up with the velocity of information, a data source needs to be able to send large amounts of data at high frequency. Compact encoding formats or data compression schemes are needed to reduce the quantity of data and improve the data transport efficiency. The subscription mode, by replacing the query mode, reduces the interactions between clients and servers and helps to improve the data source's efficiency.
Network Congestion Avoidance: The application must protect the network from congestion with congestion control mechanisms or, at minimum, with circuit breakers. [RFC 8084] and [RFC 8085] provide some solutions in this space.
The control plane telemetry refers to the health condition monitoring of different network control protocols at all layers of the protocol stack. Keeping track of the operational status of these protocols is beneficial for detecting, localizing, and even predicting various network issues, as well as for network optimization, in real time and with fine granularity. Some particular challenges and issues faced by the control plane telemetry are as follows:
How to correlate the End-to-End (E2E) Key Performance Indicators (KPIs) to a specific layer's KPIs. For example, IPTV users may describe their UE by the video smoothness and definition. Then in case of an unusually poor UE KPI or a service disconnection, it is non-trivial to delimit and pinpoint the issue in the responsible protocol layer (e.g., the transport layer or the network layer), the responsible protocol (e.g., IS-IS or BGP at the network layer), and finally the responsible device(s) with specific reasons.
Conventional OAM-based approaches for control plane KPI measurement, which include Ping (L3), Traceroute (L3), [y1731] (L2), and so on. One common issue behind these methods is that they only measure the KPIs instead of reflecting the actual running status of these protocols, making them less effective or efficient for control plane troubleshooting and network optimization.
How more research is needed for the BGP monitoring protocol (BMP). BMP is an example of the control plane telemetry; it is currently used for monitoring BGP routes and enables rich applications, such as BGP peer analysis, Autonomous System (AS) analysis, prefix analysis, and security analysis. However, the monitoring of other layers, protocols, and the cross-layer, cross-protocol KPI correlations are still in their infancy (e.g., IGP monitoring is not as extensive as BMP), which requires further research.
Note that the requirement and solutions for network congestion avoidance are also applicable to the control plane telemetry.
An effective forwarding plane telemetry system relies on the data that the network device can expose. The quality, quantity, and timeliness of data must meet some stringent requirements. This raises some challenges for the network data plane devices where the first-hand data originates.
A data plane device's main function is user traffic processing and forwarding. While supporting network visibility is important, the telemetry is just an auxiliary function, and it should strive to not impede normal traffic processing and forwarding (i.e., the forwarding behavior should not be altered, and the trade-off between forwarding performance and telemetry should be well-balanced).
Network operation applications require end-to-end visibility across various sources, which can result in a huge volume of data. However, the sheer quantity of data must not exhaust the network bandwidth, regardless of the data delivery approach (i.e., whether through in-band or out-of-band channels).
The data plane devices must provide timely data with the minimum possible delay. Long processing, transport, storage, and analysis delay can impact the effectiveness of the control loop and even render the data useless.
The data should be structured, labeled, and easy for applications to parse and consume. At the same time, the data types needed by applications can vary significantly. The data plane devices need to provide enough flexibility and programmability to support the precise data provision for applications.
The data plane telemetry should support incremental deployment and work even though some devices are unaware of the system.
The requirement and solutions for network congestion avoidance are also applicable to the forwarding plane telemetry.
Although not specific to the forwarding plane, these challenges are more difficult for the forwarding plane because of the limited resources and flexibility. Data plane programmability is essential to support network telemetry. Newer data plane forwarding chips are equipped with advanced telemetry features and provide flexibility to support customized telemetry functions.
Technique Taxonomy: This pertains to how one instruments the telemetry; there can be multiple possible dimensions to classify the forwarding plane telemetry techniques.
Active, Passive, and Hybrid: This dimension pertains to the end-to-end measurement. Active and passive methods (as well as the hybrid types) are well documented in [RFC 7799]. Passive methods include TCPDUMP, [RFC 7011], sFlow, and traffic mirroring. These methods usually have low data coverage. The bandwidth cost is very high in order to improve the data coverage. On the other hand, active methods include Ping, the [RFC 4656], the [RFC 5357], the [RFC 8762], and [RFC 6812]. These methods are intrusive and only provide indirect network measurements. Hybrid methods, including [RFC 9197], [RFC 8321], and [RFC 8889], provide a well-balanced and more flexible approach. However, these methods are also more complex to implement.
In-Band and Out-of-Band: Telemetry data carried in user packets before being exported to a data collector is considered in-band (e.g., [RFC 9197]). Telemetry data that is directly exported to a data collector without modifying user packets is considered out-of-band (e.g., the postcard-based approach described in Appendix A.3.5). It is also possible to have hybrid methods, where only the telemetry instruction or partial data is carried by user packets (e.g., [RFC 8321]).
End-to-End and In-Network: End-to-end methods start from, and end at, the network end hosts (e.g., Ping). In-network methods work in networks and are transparent to end hosts. However, if needed, in-network methods can be easily extended into end hosts.
Data Subject: Depending on the telemetry objective, the methods can be flow based (e.g., [RFC 9197]), path based (e.g., Traceroute), and node based (e.g., [RFC 7011]). The various data objects can be packet, flow record, measurement, states, and signal.
Events that occur outside the boundaries of the network system are another important source of network telemetry. Correlating both internal telemetry data and external events with the requirements of network systems, as presented in [NMRG-ANTICIPATED-ADAPTATION
], provides a strategic and functional advantage to management operations.
As with other sources of telemetry information, the data and events must meet strict requirements, especially in terms of timeliness, which is essential to properly incorporate external event information into network management applications. The specific challenges are described as follows:
The role of the external event detector can be played by multiple elements, including hardware (e.g., physical sensors, such as seismometers) and software (e.g., big data sources that can analyze streams of information, such as Twitter messages). Thus, the transmitted data must support different shapes but, at the same time, follow a common but extensible schema.
Since the main function of the external event detectors is to perform the notifications, their timeliness is assumed. However, once messages have been dispatched, they must be quickly collected and inserted into the control plane with variable priority, which is higher for important sources and events and lower for secondary ones.
The schema used by external detectors must be easily adopted by current and future devices and applications. Therefore, it must be easily mapped to current data models, such as in terms of YANG.
As the communication with external entities outside the boundary of a provider network may be realized over the Internet, the risk of congestion is even more relevant in this context and proper countermeasures must be taken. Solutions such as network transport circuit breakers are needed as well.
Organizing both internal and external telemetry information together will be key for the general exploitation of the management possibilities of current and future network systems, as reflected in the incorporation of cognitive capabilities to new hardware and software (virtual) elements.
The telemetry module at each plane can be further partitioned into five distinct conceptual components:
Data Query, Analysis, and Storage: This component works at the network operation application block in Figure 1. It is normally a part of the network management system at the receiver side. On one hand, it is responsible for issuing data requirements. The data of interest can be modeled data through configuration or custom data through programming. The data requirements can be queries for one-shot data or subscriptions for events or streaming data. On the other hand, it receives, stores, and processes the returned data from network devices. Data analysis can be interactive to initiate further data queries. This component can reside in either network devices or remote controllers. It can be centralized and distributed and involve one or more instances.
Data Configuration and Subscription: This component manages data queries on devices. It determines the protocol and channel for applications to acquire desired data. This component is also responsible for configuring the desired data that might not be directly available from data sources. The subscription data can be described by models, templates, or programs.
Data Encoding and Export: This component determines how telemetry data is delivered to the data analysis and storage component with access control. The data encoding and the transport protocol may vary due to the data export location.
Data Generation and Processing: The requested data needs to be captured, filtered, processed, and formatted in network devices from raw data sources. This may involve in-network computing and processing on either the fast path or the slow path in network devices.
Data Object and Source: This component determines the monitoring objects and original data sources provisioned in the device. A data source usually just provides raw data that needs further processing. Each data source can be considered a probe. Some data sources can be dynamically installed, while others will be more static.
| | |
| Data Query, Analysis, & Storage | |
| | +
+---------------------+-------+----------+ | |
| Data Configuration | | | |
| & Subscription | Data Encoding | | |
| (model, template, | & Export | | |
| & program) | | | |
+---------------------+------------------| | |
| | | |
| Data Generation | | |
| & Processing | | |
| | | |
+----------------------------------------| | |
| | | |
| Data Object and Source | |-+
Broadly speaking, network data can be acquired through subscription (push) and query (poll). A subscription is a contract between publisher and subscriber. After initial setup, the subscribed data is automatically delivered to registered subscribers until the subscription expires. There are two variations of subscription. The subscriptions can be predefined, or the subscribers are allowed to configure and tailor the published data to their specific needs.
In contrast, queries are used when a client expects immediate and one-off feedback from network devices. The queried data may be directly extracted from some specific data source or synthesized and processed from raw data. Queries work well for interactive network telemetry applications.
In general, data can be pulled (i.e., queried) whenever needed, but in many cases, pushing the data (i.e., subscription) is more efficient, and it can reduce the latency of a client detecting a change. From the data consumer point of view, there are four types of data from network devices that a telemetry data consumer can subscribe or query:
Simple Data: Data that are steadily available from some datastore or static probes in network devices.
Derived Data: Data that need to be synthesized or processed in the network from raw data from one or more network devices. The data processing function can be statically or dynamically loaded into network devices.
Event-triggered Data: Data that are conditionally acquired based on the occurrence of some events. An example of event-triggered data could be an interface changing operational state between up and down. Such data can be actively pushed through subscription or passively polled through query. There are many ways to model events, including using Finite State Machine (FSM) or [NETMOD-ECA-POLICY].
Streaming Data: Data that are continuously generated. It can be a time series or the dump of databases. For example, an interface packet counter is exported every second. The streaming data reflect real-time network states and metrics and require large bandwidth and processing power. The streaming data are always actively pushed to the subscribers.
The above telemetry data types are not mutually exclusive. Rather, they are often composite. Derived data is composed of simple data; event-triggered data can be simple or derived; and streaming data can be based on some recurring event. The relationships of these data types are illustrated in Figure 3
| Event-Triggered Data |<----+ Streaming Data |
| | | |
| | | |
| | +--------------+ | |
| +-->| Derived Data |<--+ |
| +------+------ + |
| | |
| V |
| +--------------+ |
+------>| Simple Data |<------+
Subscription usually deals with event-triggered data and streaming data, and query usually deals with simple data and derived data. But the other ways are also possible. Advanced network telemetry techniques are designed mainly for event-triggered or streaming data subscription and derived data query.
The following table shows how the existing mechanisms (mainly published in IETF and with the emphasis on the latest new technologies) are positioned in the framework. Given the vast body of existing work, we cannot provide an exhaustive list, so the mechanisms in the tables should be considered as just examples. Also, some comprehensive protocols and techniques may cover multiple aspects or modules of the framework, so a name in a block only emphasizes one particular characteristic of it. More details about some listed mechanisms can be found in Appendix A.
Table 2: Existing Work Mapping
|data configuration and subscribe
||gNMI, NETCONF, RESTCONF, SNMP, YANG-Push
||gNMI, NETCONF, RESTCONF, YANG-Push
||NETCONF, RESTCONF, YANG-Push
|data generation and process
||IOAM, PSAMP, PBT, AM
|data encoding and export
||gRPC, HTTP, TCP
Although the framework is generally suitable for any network environments, the multi-domain telemetry has some unique challenges that deserve further architectural consideration, which is out of the scope of this document.