Internet Research Task Force (IRTF) J. Nobre Request for Comments: 8316 University of Vale do Rio dos Sinos Category: Informational L. Granville ISSN: 2070-1721 Federal University of Rio Grande do Sul A. Clemm Huawei A. Gonzalez Prieto VMware February 2018 Autonomic Networking Use Case for Distributed Detection of Service Level Agreement (SLA) Violations
AbstractThis document describes an experimental use case that employs autonomic networking for the monitoring of Service Level Agreements (SLAs). The use case is for detecting violations of SLAs in a distributed fashion. It strives to optimize and dynamically adapt the autonomic deployment of active measurement probes in a way that maximizes the likelihood of detecting service-level violations with a given resource budget to perform active measurements. This optimization and adaptation should be done without any outside guidance or intervention. This document is a product of the IRTF Network Management Research Group (NMRG). It is published for informational purposes. Status of This Memo This document is not an Internet Standards Track specification; it is published for informational purposes. This document is a product of the Internet Research Task Force (IRTF). The IRTF publishes the results of Internet-related research and development activities. These results might not be suitable for deployment. This RFC represents the consensus of the Network Management Research Group of the Internet Research Task Force (IRTF). Documents approved for publication by the IRSG are not candidates for any level of Internet Standard; see Section 2 of RFC 7841. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at https://www.rfc-editor.org/info/rfc8316.
Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Definitions and Acronyms . . . . . . . . . . . . . . . . . . 5 3. Current Approaches . . . . . . . . . . . . . . . . . . . . . 6 4. Use Case Description . . . . . . . . . . . . . . . . . . . . 7 5. A Distributed Autonomic Solution . . . . . . . . . . . . . . 8 6. Intended User Experience . . . . . . . . . . . . . . . . . . 10 7. Implementation Considerations . . . . . . . . . . . . . . . . 11 7.1. Device-Based Self-Knowledge and Decisions . . . . . . . . 11 7.2. Interaction with Other Devices . . . . . . . . . . . . . 11 8. Comparison with Current Solutions . . . . . . . . . . . . . . 12 9. Related IETF Work . . . . . . . . . . . . . . . . . . . . . . 12 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 11. Security Considerations . . . . . . . . . . . . . . . . . . . 13 12. Informative References . . . . . . . . . . . . . . . . . . . 13 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 16 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 16
RFC7297]. Violations of SLOs can be associated with significant financial loss, which can by divided into two categories. First, there is the loss that can be incurred by the user of a service when the agreed service levels are not provided. For example, a financial brokerage's stock orders might suffer losses when it is unable to execute stock transactions in a timely manner. An electronic retailer may lose customers when its online presence is perceived by customers as sluggish. An online gaming provider may not be able to provide fair access to online players, resulting in frustrated players who are lost as customers. In each case, the failure of a service provider to meet promised service-level guarantees can have a substantial financial impact on users of the service. Second, there is the loss that is incurred by the provider of a service who is unable to meet promised SLOs. Those losses can take several forms, such as penalties for violating the service level agreement and even loss of future revenue due to reduced customer satisfaction (which, in many cases, is more serious). Hence, SLOs are a key concern for the service provider. In order to ensure that SLOs are not being violated, service levels need to be continuously monitored at the network infrastructure layer in order to know, for example, when mitigating actions need to be taken. To that end, service-level measurements must take place. Network measurements can be performed using active or passive measurement techniques. In passive measurements, production traffic is observed, and no monitoring traffic is created by the measurement process itself. That is, network conditions are checked in a non-intrusive way. In the context of IP Flow Information Export
(IPFIX), several documents were produced that define how to export data associated with flow records, i.e., data that is collected as part of passive measurement mechanisms, generally applied against flows of production traffic (e.g., [RFC7011]). In addition, it is possible to collect real data traffic (not just summarized flow records) with time-stamped packets, possibly sampled (e.g., per [RFC5474]), as a means of measuring and inferring service levels. Active measurements, on the other hand, are more intrusive to the network in the sense that they involve injecting synthetic test traffic into the network to measure network service levels, as opposed to simply observing production traffic. The IP Performance Metrics (IPPM) Working Group produced documents that describe active measurement mechanisms such as the One-Way Active Measurement Protocol (OWAMP) [RFC4656], the Two-Way Active Measurement Protocol (TWAMP) [RFC5357], and the Cisco Service-Level Assurance Protocol [RFC6812]. In addition, there are some mechanisms that do not cleanly fit into either active or passive categories, such as Performance and Diagnostic Metrics (PDM) Destination Option techniques [RFC8250]. Active measurement mechanisms offer a high level of control over what and how to measure. They do not require inspecting production traffic. Because of this, active measurements usually offer better accuracy and privacy than passive measurement mechanisms. Traffic encryption and regulations that limit the amount of payload inspection that can occur are non-issues. Furthermore, active measurement mechanisms are able to detect end-to-end network performance problems in a fine-grained way (e.g., simulating the traffic that must be handled considering specific SLOs). As a result, active measurements are often preferred over passive measurement for SLA monitoring. Measurement probes must be hosted in network devices and measurement sessions must be activated to compute the current network metrics (for example, metrics such as the ones described in [RFC4148], although note that [RFC4148] was obsoleted by [RFC6248]). This activation should be dynamic in order to follow changes in network conditions, such as those related to routes being added or new customer demands. While offering many advantages, active measurements are expensive in terms of network resource consumption. Active measurements generally involve measurement probes that generate synthetic test traffic that is directed at a responder. The responder needs to timestamp test traffic it receives and reflect it back to the originating measurement probe. The measurement probe subsequently processes the returned packets along with time-stamping information in order to compute service levels. Accordingly, active measurements consume substantial CPU cycles as well as memory of network devices to
generate and process test traffic. In addition, synthetic traffic increases network load. Thus, active measurements compete for resources with other functions, including routing and switching. The resources required and traffic generated by the active measurement sessions are, in a large part, a function of the number of measured network destinations. (In addition, the amount of traffic generated for each measurement plays a role that, in turn, influences the accuracy of the measurement.) When more destinations are measured, a greater number of resources are consumed and more traffic is needed to perform the measurements. Thus, to have better monitoring coverage, it is necessary to deploy more sessions, which consequently increases consumed resources. Otherwise, enabling the observation of just a small subset of all network flows can lead to insufficient coverage. Furthermore, while some end-to-end service levels can be determined by adding up the service levels observed across different path segments, the same is not true for all service levels. For example, the end-to-end delay or packet loss from a node A to a node C routed via a node B can often be computed simply by adding delays (or loss) from A to B and from B to C. This allows the decomposition of a large set of end-to-end measurements into a much smaller set of segment measurements. However, end-to-end jitter and mean opinion scores cannot be decomposed as easily and, for higher accuracy, must be measured end-to-end. Hence, the decision about how to place measurement probes becomes an important management activity. The goal is to obtain the maximum benefits of service-level monitoring with a limited amount of measurement overhead. Specifically, the goal is to maximize the number of service-level violations that are detected with a limited number of resources. The use case and the solution approach described in this document address an important practical issue. They are intended to provide a basis for further experimentation to lead to solutions for wider deployment. This document represents the consensus of the IRTF's Network Management Research Group (NMRG). It was discussed extensively and received three separate in-depth reviews.
Autonomic Network: A network containing exclusively autonomic nodes, requiring no configuration, and deriving all required information through self-knowledge, discovery, or intent. Autonomic Service Agent (ASA): An agent implemented on an autonomic node that implements an autonomic function, either in part (in the case of a distributed function, as in the context of this document) or whole Measurement Session: A communications association between a probe and a responder used to send and reflect synthetic test traffic for active measurements Probe: The source of synthetic test traffic in an active measurement Responder: The destination for synthetic test traffic in an active measurement SLA: Service Level Agreement SLO: Service Level Objective P2P: Peer-to-Peer (Note: The definitions for "Autonomic Network" and "Autonomic Service Agent" are borrowed from [RFC7575]).
activating the new set of required sessions every time the network traffic pattern changes. Finally, the current practice for active measurements usually covers only a fraction of the network flows that should be observed, which invariably leads to the damaging consequence of undetected SLA violations.
solution, or ideally an autonomic solution, is needed so that network measurements are automatically orchestrated and dynamically reconfigured from within the network. This can be accomplished using an autonomic solution that is distributed, using ASAs that are implemented on nodes in the network. RFC7575] can help such detection through an efficient activation of measurement sessions. Such an approach, along with a detailed assessment confirming its viability, is described in [P2PBNM-Nobre-2012]. The problem to be solved by AN in the present use case is how to steer the process of measurement session activation by a complete solution that sets all necessary parameters for this activation to operate efficiently, reliably, and securely, with no required human intervention other than setting overall policy. When a node first comes online, it has no information about which measurements are more critical than others. In the absence of information about past measurements and information from measurement peers, it may start with an initial set of measurement sessions, possibly randomly seeding a set of starter measurements and perhaps taking a round-robin approach for subsequent measurement rounds. However, as measurements are collected, a node will gain an increasing amount of information that it can utilize to refine its strategy of selecting measurement targets going forward. For one, it may take note of which targets returned measurement results very close to service-level thresholds; these targets may require closer scrutiny compared to others. Second, it may utilize observations that are made by its measurement peers in order to conclude which measurement targets may be more critical than others and to ensure that proper overall measurement coverage is obtained (so that not every node incidentally measures the same targets, while other targets are not measured at all). We advocate for embedding P2P technology in network devices in order to use autonomic control loops to make decisions about measurement sessions. Specifically, we advocate for network devices to implement an autonomic function that monitors service levels for violations of SLOs and that determines which measurement sessions to set up at any given point in time based on current and past observations of the node and of other peer nodes. By performing these functions locally and autonomically on the device itself, which measurements to conduct can be modified quickly based
on local observations while taking local resource availability into account. This allows a solution to be more robust and react more dynamically to rapidly changing service levels than a solution that has to rely on central coordination. However, in order to optimize decisions about which measurements to conduct, a node will need to communicate with other nodes. This allows a node to take into account other nodes' observations in addition to its own in its decisions. For example, remote destinations whose observed service levels are on the verge of violating stated objectives may require closer monitoring than remote destinations that are comfortably within a range of tolerance. A distributed autonomic solution also allows nodes to coordinate their probing decisions to collectively achieve the best possible measurement coverage. Because the number of resources available for monitoring, exchanging measurement data, and coordinating with other nodes is limited, a node may be interested in identifying other nodes whose observations are similar to and correlated with its own. This helps a node prioritize and decide which other nodes to coordinate and exchange data with. All of this requires the use of a P2P overlay. A P2P overlay is essential for several reasons: o It makes it possible for nodes (or more specifically, the ASAs that are deployed on those nodes) in the network to autonomically set up measurement sessions without having to rely on a central management system or controller to perform configuration operations associated with configuring measurement probes and responders. o It facilitates the exchange of data between different nodes to share measurement results so that each node can refine its measurement strategy based not just on its own observations, but also on observations from its peers. o It allows nodes to coordinate their measurements to obtain the best possible test coverage and avoid measurements that have a very low likelihood of detecting service-level violations. The provisioning of the P2P overlay should be transparent for the network administrator. An Autonomic Control Plane such as defined in [ACP] provides an ideal candidate for the P2P overlay to run on. An autonomic solution for the distributed detection of SLA violations provides several benefits. First, it provides efficiency; this solution should optimize the resource consumption and avoid resource starvation on the network devices. A device that is "self-aware" of
its available resources will be able to adjust measurement activities rapidly as needed, without requiring a separate control loop involving resource monitoring by an external system. Second, placing logic about where to conduct measurements into the node enables rapid control loops that allow devices to react instantly to observations and adjust their measurement strategy. For example, a device could decide to adjust the amount of synthetic test traffic being sent during the measurement itself depending on results observed so far on this and other concurrent measurement sessions. As a result, the solution could decrease the time necessary to detect SLA violations. Adaptivity features of an autonomic loop could capture the network dynamics faster than a human administrator or even a central controller. Finally, the solution could help to reduce the workload of human administrators. In practice, these factors combine to maximize the likelihood of SLA violations being detected while operating within a given resource budget, allowing a continuous measurement strategy that takes into account past measurement results to be conducted, observations of other measures such as link utilization or flow data, measurement results shared between network devices, and future measurement activities coordinated among nodes. Combined, this can result in efficient measurement decisions that achieve a golden balance between offering broad network coverage and honing in on service-level "hot spots".
between the number of detected SLO violations and the number of total SLO violations that are actually occurring (some of which might go undetected). In that case, the solution will aim to minimize the resources spent (i.e., the amount of test traffic and number of measurement sessions) that are required to achieve that target.
measurement data (i.e., management peers) creates a new topology. Different approaches could be used to define this topology (e.g., correlated peers [P2PBNM-Nobre-2012]). To bootstrap peer selection, each device should use its known neighbors (e.g., FIB and RIB tables) as initial seeds to identify possible peers. It should be noted that a solution will benefit if topology information and network discovery functions are provided by the underlying autonomic framework. A solution will need to be able to discover measurement peers as well as measurement targets, specifically measurement targets that support active measurement responders and that will be able to respond to measurement requests and reflect measurement traffic as needed. DECON] and CSAMP [CSAMP]), but these do not focus on autonomic features.
3. ALTO: The Application-Layer Traffic Optimization Working Group aims to provide topological information at a higher abstraction layer, which can be based upon network policy, and with application-relevant service functions located in it. Their work could be leveraged to define the topology for network devices that exchange measurement data. [ACP] Eckert, T., Ed., Behringer, M., Ed., and S. Bjarnason, "An Autonomic Control Plane (ACP)", Work in Progress, draft-ietf-anima-autonomic-control-plane-13, December 2017. [CSAMP] Sekar, V., Reiter, M., Willinger, W., Zhang, H., Kompella, R., and D. Andersen, "CSAMP: A System for Network-Wide Flow Monitoring", NSDI USENIX Symposium Networked Systems Design and Implementation, April 2008.
[DECON] di Pietro, A., Huici, F., Costantini, D., and S. Niccolini, "DECON: Decentralized Coordination for Large- Scale Flow Monitoring", IEEE INFOCOM Workshops, DOI 10.1109/INFCOMW.2010.5466642, March 2010. [P2PBNM-Nobre-2012] Nobre, J., Granville, L., Clemm, A., and A. Gonzalez Prieto, "Decentralized Detection of SLA Violations Using P2P Technology, 8th International Conference Network and Service Management (CNSM)", 8th International Conference on Network and Service Management (CNSM), 2012, <http://ieeexplore.ieee.org/xpls/ abs_all.jsp?arnumber=6379997>. [RFC4148] Stephan, E., "IP Performance Metrics (IPPM) Metrics Registry", BCP 108, RFC 4148, DOI 10.17487/RFC4148, August 2005, <https://www.rfc-editor.org/info/rfc4148>. [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. Zekauskas, "A One-way Active Measurement Protocol (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, <https://www.rfc-editor.org/info/rfc4656>. [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", RFC 5357, DOI 10.17487/RFC5357, October 2008, <https://www.rfc-editor.org/info/rfc5357>. [RFC5474] Duffield, N., Ed., Chiou, D., Claise, B., Greenberg, A., Grossglauser, M., and J. Rexford, "A Framework for Packet Selection and Reporting", RFC 5474, DOI 10.17487/RFC5474, March 2009, <https://www.rfc-editor.org/info/rfc5474>. [RFC6248] Morton, A., "RFC 4148 and the IP Performance Metrics (IPPM) Registry of Metrics Are Obsolete", RFC 6248, DOI 10.17487/RFC6248, April 2011, <https://www.rfc-editor.org/info/rfc6248>. [RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, S., and E. Yedavalli, "Cisco Service-Level Assurance Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013, <https://www.rfc-editor.org/info/rfc6812>. [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information", STD 77, RFC 7011, DOI 10.17487/RFC7011, September 2013, <https://www.rfc-editor.org/info/rfc7011>.
[RFC7297] Boucadair, M., Jacquenet, C., and N. Wang, "IP Connectivity Provisioning Profile (CPP)", RFC 7297, DOI 10.17487/RFC7297, July 2014, <https://www.rfc-editor.org/info/rfc7297>. [RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A., Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic Networking: Definitions and Design Goals", RFC 7575, DOI 10.17487/RFC7575, June 2015, <https://www.rfc-editor.org/info/rfc7575>. [RFC8250] Elkins, N., Hamilton, R., and M. Ackermann, "IPv6 Performance and Diagnostic Metrics (PDM) Destination Option", RFC 8250, DOI 10.17487/RFC8250, September 2017, <https://www.rfc-editor.org/info/rfc8250>.