Tech-invite3GPPspaceIETFspace
959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 8632

A YANG Data Model for Alarm Management

Pages: 82
Proposed Standard
Errata
Part 4 of 4 – Pages 70 to 82
First   Prev   None

Top   ToC   RFC8632 - Page 70   prevText

Appendix A. Vendor-Specific Alarm Types Example

This example shows how to define alarm types in a vendor-specific module. In this case, the vendor "xyz" has chosen to define top- level identities according to X.733 event types. module example-xyz-alarms { namespace "urn:example:xyz-alarms"; prefix xyz-al; import ietf-alarms { prefix al; } identity xyz-alarms { base al:alarm-type-id; } identity communications-alarm { base xyz-alarms; } identity quality-of-service-alarm { base xyz-alarms; } identity processing-error-alarm { base xyz-alarms; } identity equipment-alarm { base xyz-alarms; } identity environmental-alarm { base xyz-alarms; } // communications alarms identity link-alarm { base communications-alarm; } // QoS alarms identity high-jitter-alarm { base quality-of-service-alarm; } }
Top   ToC   RFC8632 - Page 71

Appendix B. Alarm Inventory Example

This shows an alarm inventory: one alarm type is defined only with the identifier and another is dynamically configured. In the latter case, a digital input has been connected to a smoke detector; therefore, the "alarm-type-qualifier" is set to "smoke-detector" and the "alarm-type-id" to "environmental-alarm". <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms" xmlns:xyz-al="urn:example:xyz-alarms" xmlns:dev="urn:example:device"> <alarm-inventory> <alarm-type> <alarm-type-id>xyz-al:link-alarm</alarm-type-id> <alarm-type-qualifier/> <resource> /dev:interfaces/dev:interface </resource> <will-clear>true</will-clear> <description> Link failure; operational state down but admin state up </description> </alarm-type> <alarm-type> <alarm-type-id>xyz-al:environmental-alarm</alarm-type-id> <alarm-type-qualifier>smoke-alarm</alarm-type-qualifier> <will-clear>true</will-clear> <description> Connected smoke detector to digital input </description> </alarm-type> </alarm-inventory> </alarms>

Appendix C. Alarm List Example

In this example, we show an alarm that has toggled [major, clear, major]. An operator has acknowledged the alarm. <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms" xmlns:xyz-al="urn:example:xyz-alarms" xmlns:dev="urn:example:device"> <alarm-list> <number-of-alarms>1</number-of-alarms> <last-changed>2018-04-08T08:39:50.00Z</last-changed> <alarm>
Top   ToC   RFC8632 - Page 72
         <resource>
           /dev:interfaces/dev:interface[name='FastEthernet1/0']
         </resource>
         <alarm-type-id>xyz-al:link-alarm</alarm-type-id>
         <alarm-type-qualifier></alarm-type-qualifier>
         <time-created>2018-04-08T08:20:10.00Z</time-created>
         <is-cleared>false</is-cleared>
         <alt-resource>1.3.6.1.2.1.2.2.1.1.17</alt-resource>
         <last-raised>2018-04-08T08:39:40.00Z</last-raised>
         <last-changed>2018-04-08T08:39:50.00Z</last-changed>
         <perceived-severity>major</perceived-severity>
         <alarm-text>
           Link operationally down but administratively up
         </alarm-text>
         <status-change>
           <time>2018-04-08T08:39:40.00Z</time>
           <perceived-severity>major</perceived-severity>
           <alarm-text>
             Link operationally down but administratively up
           </alarm-text>
         </status-change>
         <status-change>
           <time>2018-04-08T08:30:00.00Z</time>
           <perceived-severity>cleared</perceived-severity>
           <alarm-text>
             Link operationally up and administratively up
           </alarm-text>
         </status-change>
         <status-change>
           <time>2018-04-08T08:20:10.00Z</time>
           <perceived-severity>major</perceived-severity>
           <alarm-text>
             Link operationally down but administratively up
           </alarm-text>
         </status-change>
         <operator-state-change>
           <time>2018-04-08T08:39:50.00Z</time>
           <state>ack</state>
           <operator>joe</operator>
           <text>Will investigate, ticket TR764999</text>
         </operator-state-change>
       </alarm>
     </alarm-list>
   </alarms>
Top   ToC   RFC8632 - Page 73

Appendix D. Alarm Shelving Example

This example shows how to shelve alarms. We shelve alarms related to the smoke detectors, since they are being installed and tested. We also shelve all alarms from FastEthernet1/0. <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms" xmlns:xyz-al="urn:example:xyz-alarms" xmlns:dev="urn:example:device"> <control> <alarm-shelving> <shelf> <name>FE10</name> <resource> /dev:interfaces/dev:interface[name='FastEthernet1/0'] </resource> </shelf> <shelf> <name>detectortest</name> <alarm-type> <alarm-type-id> xyz-al:environmental-alarm </alarm-type-id> <alarm-type-qualifier-match> smoke-alarm </alarm-type-qualifier-match> </alarm-type> </shelf> </alarm-shelving> </control> </alarms>
Top   ToC   RFC8632 - Page 74

Appendix E. X.733 Mapping Example

This example shows how to map a dynamic alarm type (alarm-type- id=environmental-alarm, alarm-type-qualifier=smoke-alarm) to the corresponding X.733 "event-type" and "probable-cause" parameters. <alarms xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms" xmlns:xyz-al="urn:example:xyz-alarms"> <control> <x733-mapping xmlns="urn:ietf:params:xml:ns:yang:ietf-alarms-x733"> <alarm-type-id>xyz-al:environmental-alarm</alarm-type-id> <alarm-type-qualifier-match> smoke-alarm </alarm-type-qualifier-match> <event-type>quality-of-service-alarm</event-type> <probable-cause>777</probable-cause> </x733-mapping> </control> </alarms>

Appendix F. Relationship to Other Alarm Standards

This section briefly describes how this alarm data model relates to other relevant standards.

F.1. Definition of "Alarm"

The table below summarizes relevant definitions of the term "alarm" in other alarm standards. +------------+---------------------------+--------------------------+ | Standard | Definition | Comment | +------------+---------------------------+--------------------------+ | X.733 | error: A deviation of a | The X.733 alarm | | [X.733] | system from normal | definition is focused on | | | operation. fault: The | the notification as such | | | physical or algorithmic | and not the state. | | | cause of a malfunction. | X.733 defines an alarm | | | Faults manifest | as a deviation from a | | | themselves as errors. | normal condition but | | | alarm: A notification, of | without the requirement | | | the form defined by this | that it needs corrective | | | function, of a specific | actions. | | | event. An alarm may or | | | | may not represent an | | | | error. | | | | | |
Top   ToC   RFC8632 - Page 75
   | G.7710     | Alarms are indications    | The G.7710 definition is |
   | [G.7710]   | that are automatically    | close to the original    |
   |            | generated by a device as  | X.733 definition.        |
   |            | a result of the           |                          |
   |            | declaration of a failure. |                          |
   |            |                           |                          |
   | Alarm MIB  | Alarm: Persistent         | RFC 3877 defines the     |
   | [RFC3877]  | indication of a fault.    | term alarm as referring  |
   |            | Fault: Lasting error or   | back to "a deviation     |
   |            | warning condition.        | from normal operation".  |
   |            | Error: A deviation of a   | The Alarm YANG data      |
   |            | system from normal        | model adds the           |
   |            | operation.                | requirement that it      |
   |            |                           | should require a         |
   |            |                           | corrective action and    |
   |            |                           | should be undesired, not |
   |            |                           | only a deviation from    |
   |            |                           | normal.  The alarm MIB   |
   |            |                           | is state oriented in the |
   |            |                           | same way as the Alarm    |
   |            |                           | YANG module; it focuses  |
   |            |                           | on the  "lasting         |
   |            |                           | condition", not the      |
   |            |                           | individual               |
   |            |                           | notifications.           |
   |            |                           |                          |
   | ISA        | Alarm: An audible and/or  | The ISA standard adds an |
   | [ISA182]   | visible means of          | important requirement to |
   |            | indicating to the         | the "deviation from      |
   |            | operator an equipment     | normal condition state": |
   |            | malfunction, process      | requiring a response.    |
   |            | deviation, or abnormal    |                          |
   |            | condition requiring a     |                          |
   |            | response.                 |                          |
   |            |                           |                          |
   | EEMUA      | An alarm is an event to   | This is the foundation   |
   | [EEMUA]    | which an operator must    | for the definition of    |
   |            | knowingly react, respond, | alarm in this document.  |
   |            | and acknowledge -- not    | It focuses on the core   |
   |            | simply acknowledge and    | criterion that an action |
   |            | ignore.                   | is really needed.        |
   |            |                           |                          |
Top   ToC   RFC8632 - Page 76
   | 3GPP Alarm | 3GPP v15: An alarm        | The latest 3GPP Alarm    |
   | IRP        | signifies an undesired    | IRP version uses         |
   | [ALARMIRP] | condition of a resource   | literally the same alarm |
   |            | (e.g., device, link) for  | definition as this alarm |
   |            | which an operator action  | data model.  It is worth |
   |            | is required.  It          | noting that earlier      |
   |            | emphasizes a key          | versions used a          |
   |            | requirement that          | definition not requiring |
   |            | operators [...] should    | an operator action and   |
   |            | not be informed about an  | the more-broad           |
   |            | undesired condition       | definition of deviation  |
   |            | unless it requires        | from normal condition.   |
   |            | operator action.          | The earlier version also |
   |            | 3GPP v12: alarm: abnormal | defined an alarm as a    |
   |            | network entity condition, | special case of "event". |
   |            | which categorizes an      |                          |
   |            | event as a fault.         |                          |
   |            | fault: a deviation of a   |                          |
   |            | system from normal        |                          |
   |            | operation, which may      |                          |
   |            | result in the loss of     |                          |
   |            | operational capabilities  |                          |
   |            | [...]                     |                          |
   +------------+---------------------------+--------------------------+

           Table 1: Definition of the Term "Alarm" in Standards

   The evolution of the definition of alarm moves from focused on events
   reporting a deviation from normal operation towards a definition to a
   undesired *state* that *requires an operator action*.

F.2. Data Model

This section describes how this YANG alarm data model relates to other standard data models. Note well that we cover other data models for alarm interfaces but not other standards such as SDO- specific alarms.

F.2.1. X.733

X.733 has acted as a base for several alarm data models over the years. The YANG alarm data model differs in the following ways: X.733 models the alarm list as a list of notifications. The YANG alarm data model defines the alarm list as the current alarm states for the resources, which is generated from the state change reporting notifications.
Top   ToC   RFC8632 - Page 77
      In X.733, an alarm can have the severity level "clear".  In the
      YANG alarm data model, "clear" is not a severity level; it is a
      separate state of the alarm.  An alarm can have the following
      states, for example, (major, cleared) and (minor, not cleared).

      X.733 uses a flat, globally defined enumerated "probable-cause" to
      identify alarm types.  This alarm data model uses a hierarchical
      YANG identity: "alarm-type".  This enables delegation of alarm
      types within organizations.  It also enables management to reason
      about abstract alarm types corresponding to base identities; see
      Section 3.2.

      The YANG alarm data model has not included the majority of the
      X.733 alarm attributes.  Rather, these are defined in an
      augmenting module [X.733] if "strict" X.733 compliance is needed.

F.2.2. The Alarm MIB (RFC 3877)

The MIB in RFC 3877 takes a different approach; rather than defining a concrete data model for alarms, it defines a model to map existing SNMP-managed objects and notifications into alarm states and alarm notifications. This was necessary since MIBs were already defined with both managed objects and notifications indicating alarms, for example, "linkUp" and "linkDown" notifications in combination with "ifAdminState" and "ifOperState". So, RFC 3877 cannot really be compared to the alarm YANG module in that sense. The Alarm MIB maps existing MIB definitions into alarms, such as "alarmModelTable". The upside of that is that an SNMP Manager can, at runtime, read the possible alarm types. This corresponds to the "alarmInventory" in the alarm YANG module.

F.2.3. 3GPP Alarm IRP

The 3GPP Alarm IRP is an evolution of X.733. Main differences between the alarm YANG module and 3GPP are as follows: 3GPP keeps the majority of the X.733 attributes, but the alarm YANG module does not. 3GPP introduced overlapping and possibly conflicting keys for alarms, alarmId, and (managed object, event type, probable cause, specific problem). (See Example 3 in Annex C of [ALARMIRP]). In the YANG alarm data model, the key for identifying an alarm instance is clearly defined by ("resource", "alarm-type-id", "alarm-type-qualifier"). See also Section 3.4 for more information.
Top   ToC   RFC8632 - Page 78
      The alarm YANG module clearly separates the resource/
      instrumentation lifecycle from the operator lifecycle. 3GPP allows
      operators to set the alarm severity to clear; this is not allowed
      by this module.  Rather, an operator closes an alarm, which does
      not affect the severity.

F.2.4. G.7710

G.7710 is different than the previously referenced alarm standards. It does not define a data model for alarm reporting. It defines common equipment management function requirements including alarm instrumentation. The scope is transport networks. The requirements in G.7710 correspond to features in the alarm YANG module in the following way: Alarm Severity Assignment Profile (ASAP): the alarm profile "/alarms/alarm-profile/". Alarm Reporting Control (ARC): alarm shelving "/alarms/control/ alarm-shelving/" and the ability to control alarm notifications "/alarms/control/notify-status-changes". Alarm shelving corresponds to the use case of turning off alarm reporting for a specific resource, which is the NALM (No ALarM) state in M.3100.

Appendix G. Alarm-Usability Requirements

This section defines usability requirements for alarms. Alarm usability is important for an alarm interface. A data model will help in defining the format, but if the actual alarms are of low value, we have not gained the goal of alarm management. Common alarm problems and their causes are summarized in Table 2. This summary is adopted to networking based on the ISA [ISA182] and Engineering Equipment Materials Users Association (EEMUA) [EEMUA] standards.
Top   ToC   RFC8632 - Page 79
   +-----------------+--------------------------------+----------------+
   | Problem         | Cause                          | How this       |
   |                 |                                | module         |
   |                 |                                | addresses the  |
   |                 |                                | cause          |
   +-----------------+--------------------------------+----------------+
   | Alarms are      | "Nuisance" alarms (chattering  | Strict         |
   | generated, but  | alarms and fleeting alarms),   | definition of  |
   | they are        | faulty hardware, redundant     | alarms         |
   | ignored by the  | alarms, cascading alarms,      | requiring      |
   | operator.       | incorrect alarm settings, and  | corrective     |
   |                 | alarms that have not been      | response.  See |
   |                 | rationalized; the alarms       | alarm          |
   |                 | represent log information      | requirements   |
   |                 | rather than true alarms.       | in Table 3.    |
   |                 |                                |                |
   | When alarms     | Insufficient alarm-response    | The alarm      |
   | occur,          | procedures and not well-       | inventory      |
   | operators do    | defined alarm types.           | lists all      |
   | not know how to |                                | alarm types    |
   | respond.        |                                | and corrective |
   |                 |                                | actions.  See  |
   |                 |                                | alarm          |
   |                 |                                | requirements   |
   |                 |                                | in Table 3.    |
   |                 |                                |                |
   | The alarm       | Nuisance alarms, stale alarms, | The alarm      |
   | display is full | and alarms from equipment not  | definition and |
   | of alarms, even | in service.                    | alarm          |
   | when there is   |                                | shelving.      |
   | nothing wrong.  |                                |                |
   |                 |                                |                |
   | During a        | Incorrect prioritization of    | State-based    |
   | failure,        | alarms.  Not using advanced    | alarm model    |
   | operators are   | alarm techniques (e.g., state- | and alarm-rate |
   | flooded with so | based alarming).               | requirements;  |
   | many alarms     |                                | see Tables 4   |
   | that they do    |                                | and 5,         |
   | not know which  |                                | respectively.  |
   | ones are the    |                                |                |
   | most important. |                                |                |
   +-----------------+--------------------------------+----------------+

                    Table 2: Alarm Problems and Causes
Top   ToC   RFC8632 - Page 80
   Based upon the above problems, EEMUA gives the following definition
   of a good alarm:

   +----------------+--------------------------------------------------+
   | Characteristic | Explanation                                      |
   +----------------+--------------------------------------------------+
   | Relevant       | Not spurious or of low operational value.        |
   |                |                                                  |
   | Unique         | Not duplicating another alarm.                   |
   |                |                                                  |
   | Timely         | Not long before any response is needed or too    |
   |                | late to do anything.                             |
   |                |                                                  |
   | Prioritized    | Indicating the importance that the operator      |
   |                | deals with the problem.                          |
   |                |                                                  |
   | Understandable | Having a message that is clear and easy to       |
   |                | understand.                                      |
   |                |                                                  |
   | Diagnostic     | Identifying the problem that has occurred.       |
   |                |                                                  |
   | Advisory       | Indicative of the action to be taken.            |
   |                |                                                  |
   | Focusing       | Drawing attention to the most important issues.  |
   +----------------+--------------------------------------------------+

                    Table 3: Definition of a Good Alarm

   Vendors SHOULD rationalize all alarms according to the table above.
   Another crucial requirement is acceptable alarm notification rates.
   Vendors SHOULD make sure that they do not exceed the recommendations
   from EEMUA below:

   +-----------------------------------+-------------------------------+
   | Long-Term Alarm Rate in Steady    | Acceptability                 |
   | Operation                         |                               |
   +-----------------------------------+-------------------------------+
   | More than one per minute          | Very likely to be             |
   |                                   | unacceptable.                 |
   |                                   |                               |
   | One per 2 minutes                 | Likely to be overdemanding.   |
   |                                   |                               |
   | One per 5 minutes                 | Manageable.                   |
   |                                   |                               |
   | Less than one per 10 minutes      | Very likely to be acceptable. |
   +-----------------------------------+-------------------------------+

              Table 4: Acceptable Alarm Rates -- Steady State
Top   ToC   RFC8632 - Page 81
   +----------------------------+--------------------------------------+
   | Number of alarms displayed | Acceptability                        |
   | in 10 minutes following a  |                                      |
   | major network problem      |                                      |
   +----------------------------+--------------------------------------+
   | More than 100              | Definitely excessive and very likely |
   |                            | to lead to the operator abandoning   |
   |                            | the use of the alarm system.         |
   |                            |                                      |
   | 20-100                     | Hard to cope with.                   |
   |                            |                                      |
   | Under 10                   | Should be manageable, but it may be  |
   |                            | difficult if several of the alarms   |
   |                            | require a complex operator response. |
   +----------------------------+--------------------------------------+

                 Table 5: Acceptable Alarm Rates -- Burst

   The numbers in Tables 4 and 5 are the sum of all alarms for a network
   being managed from one alarm console.  So every individual system or
   Network Management System (NMS) contributes to these numbers.

   Vendors SHOULD make sure that the following rules are used in
   designing the alarm interface:

   1.  Rationalize the alarms in the system to ensure that every alarm
       is necessary, has a purpose, and follows the cardinal rule that
       it requires an operator response.  Adheres to the rules of
       Table 3.

   2.  Audit the quality of the alarms.  Talk with the operators about
       how well the alarm information supports them.  Do they know what
       to do in the event of an alarm?  Are they able to quickly
       diagnose the problem and determine the corrective action?  Does
       the alarm text adhere to the requirements in Table 3?

   3.  Analyze and benchmark the performance of the system and compare
       it to the recommended metrics in Tables 4 and 5.  Start by
       identifying nuisance alarms, as well as standing alarms at normal
       state and startup.
Top   ToC   RFC8632 - Page 82

Acknowledgements

The authors wish to thank Viktor Leijon and Johan Nordlander for their valuable input on forming the alarm model. The authors also wish to thank Nick Hancock, Joey Boyd, Tom Petch, and Balazs Lengyel for their extensive reviews and contributions to this document.

Authors' Addresses

Stefan Vallin Stefan Vallin AB Email: stefan@wallan.se Martin Bjorklund Cisco Email: mbj@tail-f.com