For Fault Management (FM) purposes, each NE shall have to store and retain the following information:
-
a list of all active alarms, i.e. all alarms that have not yet been cleared; and
-
alarm history information, i.e. all notifications related to the occurrence and clearing of alarms.
It shall be possible to apply filters when active alarm information is retrieved by the Manager and when the history information is stored by the NE and retrieved by the Manager.
The storage space for alarm history in the NE is limited. Therefore it shall be organized as a circular buffer, i.e. the oldest data item(s) shall be overwritten by new data if the buffer is full. Further "buffer full" behaviours, e.g. those defined in ITU-T Recommendation X.735 [11], may be implemented as an option. The storage capacity itself, and thus the duration, for which the data can be retained, shall be Operator and implementation dependent.
After a fault has been detected and the replaceable faulty units have been identified, some management functions are necessary in order to perform system recovery and/or restoration, either automatically by the NE and/or the EM, or manually by the operator.
The fault recovery functions are used in various phases of the Fault Management (FM):
-
Once a fault has been detected, the NE shall be able to evaluate the effect of the fault on the telecommunication services and autonomously take recovery actions in order to minimize service degradation or disruption.
-
Once the faulty unit(s) has (have) been replaced or repaired, it shall be possible from the EM to put the previously faulty unit(s) back into service so that normal operation is restored. This transition should be done in such a way that the currently provided telecommunication services are not, or only minimally, disturbed.
-
At any time the NE shall be able to perform recovery actions if requested by the operator. The operator may have several reasons to require such actions; e.g. he has deduced a faulty condition by analysing and correlating alarm reports, or he wants to verify that the NE is capable of performing the recovery actions (proactive maintenance).
The recovery actions that the NE performs (autonomously or on demand) in case of faults depend on the nature and severity of the faults, on the hardware and software capabilities of the NE and on the current configuration of the NE.
Faults are distinguished in two categories: software faults and hardware faults. In the case of software faults, depending on the severity of the fault, the recovery actions may be system initializations (at different levels), activation of a backup software load, activation of a fallback software load, download of a software unit etc. In the case of hardware faults, the recovery actions depend on the existence and type of redundant (i.e. back-up) resources. Redundancy of some resources may be provided in the NE in order to achieve fault tolerance and to improve system availability.
If the faulty resource has no redundancy, the recovery actions shall be:
-
Isolate and remove from service the faulty resource so that it cannot disturb other working resources;
-
Remove from service the physical and functional resources (if any) which are dependent on the faulty one. This prevents the propagation of the fault effects to other fault-free resources;
-
State management related activities for the faulty resource and other affected/dependent resources cf. clause 4.2;
-
Generate and forward appropriate notifications to inform the OS about all the changes performed.
If the faulty resource has redundancy, the NE shall perform action a), c) and d) above and, in addition, the recovery sequence that is specific to that type of redundancy. Several types of redundancy exist (e.g. hot standby, cold standby, duplex, symmetric/asymmetric, N plus one or N plus K redundancy, etc.), and for each one, there is a specific sequence of actions to be performed in case of failure. The present document specifies the Fault Management aspects of the redundancies, but it does not define the specific recovery sequences of the redundancy types.
In the case of a failure of a resource providing service, the recovery sequence shall start immediately. Before or during the changeover, a temporary and limited loss of service shall be acceptable. In the case of a management command, the NE should perform the changeover without degradation of the telecommunication services.
The detailed definition of the management of the redundancies is out of the scope of the present document. If a fault causes the interruption of ongoing calls, then the interrupted calls shall be cleared, i.e. all resources allocated to these calls shall immediately be released by the system.