Content for TR 23.820 Word version: 9.0.0
Although network nodes in the IMS Core Network should have a very high availability, some maintenance downtime and occasional failures are unavoidable. Communication links although designed with robust protocols between the network elements are also subject to failures. A set of standardized procedures for automatic restoration after loss or corruption of data could reduce the impact of these problems resulting in the improved service to the users. The intention is that similar cases as in TS 23.007
for the CS and PS Domains are covered also for the IMS domain.
The present document identifies the changes required in the 3GPP IMS specifications so that a consistent state is restored in the IMS Core Network, after, or during a planned, or unplanned stop of a network element. The study will go through the following steps:
Establish the requirements that should be covered with these procedures. That is which are the impacts to the end user service that are acceptable and which are not, after a network failure.
List the service interruption scenarios that need to be studied.
Provide solutions, so that in all the service interruption scenarios listed, the impacts to the end user service comply with the requirements. These solutions provide procedures for the automatic restoration to a consistent state in the network and indicate how to trigger these procedures.
Analyze the impacts of the solutions in the current specifications.
Conclusion and recommended way forward.
It is important to realise that these procedures are meant to be operational procedures for restoration and so care must be taken with what is existing and will exist with OA&M procedures to avoid overlap which could cause clashes.
The following documents contain provisions which, through reference in this text, constitute provisions of the present document.
References are either specific (identified by date of publication, edition number, version number, etc.) or non specific.
For a specific reference, subsequent revisions do not apply.
For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same Release as the present document.
: "Vocabulary for 3GPP Specifications".
: "Charging Management; IP Multimedia Subsystem (IMS) Charging".
: "IP Multimedia (IM) Subsystem - Stage 2".
: "IP Multimedia Call Control Protocol based on SIP and SDP".
: "Organization of subscriber data".
: "Interworking between the Public Land Mobile Network (PLMN) supporting packet based services and Packet Data Networks (PDN)".
: "General Packet Radio Service (GPRS); GPRS Tunnelling Protocol (GTP) across the Gn and Gp interface".
: "IP Multimedia (IM) Subsystem Cx and Dx interfaces; Signalling flows and message contents".
For the purposes of the present document, the terms and definitions given in TR 21.905
and the following apply. A term defined in the present document takes precedence over the definition of the same term, if any, in TR 21.905
A period of time in which one or more network elements do not respond to requests and do not send any requests to the rest of the system.
For the purposes of the present document, the abbreviations given in TR 21.905
and the following apply. An abbreviation defined in the present document takes precedence over the definition of the same abbreviation, if any, in TR 21.905
Globally Routed User Agent URI
Operations, Administration & Maintenance
Operations & Maintenance
This clause contains a list of the requirements for the solution provided in this study. The general goal is to have a set of procedures ensuring that the impact of the service interruption of a node is limited to the loss of the capacity of that node for the time that it is out of service, plus some additional signalling in order to perform the take over by other network elements with the same function.
There are network elements that hold permanent data. There should be methods to ensure that the information in these network elements is not lost even in disaster events, such as a complete site crash. For this reason, this study assumes that the redundancy provided for these network elements makes unnecessary additional network procedures for restoration of permanent data. On the other hand, temporary data will be considered in these procedures.
For the nodes that don't handle permanent data, the assumption is that their memory is affected if an outage occurs and the information related to some of the users may be lost.
Interruption of established sessions is considered an acceptable consequence of the failure service interruption of one of the network elements in the session path. This means that the restoration of session data does not need to be analyzed in this study. Means may be taken by implementations outside of this study to enable established sessions to be restored or maintained.
The accounting of IMS sessions is already based on principles (interim accounting as described in TS 32.260
) that ensure that the charging of this interrupted sessions is also terminated.
The restoration procedures could involve some steps that take place during the establishment of sessions after the outage has occurred (i.e. after the network element that failed returns to normal functioning or another network element takes over). If that is the case, the increase of the session establishment time should not be significant (it should still have a high probability to remain within the acceptable levels for the end user).
In order to reduce OPEX and the time required to restore the network, the need for manual intervention should be minimized. This implies that the procedures should be triggered by network signalling events and O&M steps should also be minimized.
Loss of service refers to the state in which session origination attempts by the user or session termination requests to that user fail while the UE appears to be registered and also when the network does not respond to registration attempts. Ideally the proposed solution should avoid this kind of loss of service altogether and ensure that requests are terminated correctly in all cases. If this is not feasible, then the time of loss of service for the user should be minimized.
In the solutions provided it needs to be taken into consideration that network element failures tend to occur in situations when the signalling is overloaded. If that is the case, restoration procedures that involve a high level of messages (e.g. triggering re-registrations for all the UEs controlled by a P-CSCF or S-CSCF) should be avoided. Such kind of procedures could result in further problems in other network elements and provoke a domino effect of subsequent failures.
The solution provided should be such that it allows the recovery of the network to a situation where the load is balanced between network elements performing the same function.