TR 22.813
Study of Use Cases and Requirements
for enhanced Voice Codecs for the EPS

3GPP‑Page CONTENT_↓

V10.0.0 (Wzip) 2010/04 15 p.

Rapporteur:: Mr. Lottin, Philippe
Orange

Content for TR 22.813 Word version: 10.0.0

0 Introduction p. 4

Conversational voice and audio services are among the most important ones in 3GPP mobile business. Switch from narrowband to wideband speech audio with AMR-WB is an important improvement step for these services. In order to keep high value in such services, Rel-10 and the new EPS system give an opportunity for a next major step in voice and audio enhancement, particularly for very high audio quality suited to the increasing mixed music/speech content.

The Evolved Packet System (EPS) is an evolution of the 3GPP architecture to a higher-data-rate, lower-latency packet-optimized system that supports various radio-access technologies. The EPS architecture has been optimized for mobile broadband services, in particular enriched and new multimedia services. The EPS system is an enabler for providing conversational services with very high audio quality, including enhanced voice and mixed content quality.

This document aims at studying use cases which may benefit from improved audio quality for EPS services and examining the need for enhanced codecs. The related system and service requirements are discussed.

1 Scope p. 5

This study defines and analyses the new use cases in the environment of EPS and its future services. It evaluates how much the newly defined requirements are met by the already available codecs in 3GPP releases prior to Rel-10. Potential new requirements for voice codecs are elaborated in due course.

Specifically, the objective of this study item is to:

Identify service and system requirements,
Identify high level requirements,
Assess the existing codecs in respect to identified requirements,
Define a strategy for EPS voice codec(s) standardization,
Identify some use cases for enhanced voice and mixed content conversational multimedia service for EPS (see Annex A).

Starting from Rel-8 timeframe, MTSI as a conversational multimedia service is provided over EPS. MTSI is the main service for which the enhancements are targeted. This does not however exclude the enhanced service requirements from being later adopted to other services where such enhancements are needed and are found suitable.

2 References p. 5

The following documents contain provisions which, through reference in this text, constitute provisions of the present document.

References are either specific (identified by date of publication, edition number, version number, etc.) or non specific.

For a specific reference, subsequent revisions do not apply.

For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same Release as the present document.

[1]

TR 21.905: "Vocabulary for 3GPP Specifications".

[2]

TS 22.278: v8.4.0: "Service requirements for the Evolved Packet System (EPS) (Release 8)".

[3]

TS 22.173: V8.2.0: "Multi-media Telephony Service and supplementary services; Stage 1".

[4]

TS 26.071: "Mandatory Speech Codec speech processing functions; AMR Speech CODEC; General description".

[5]

TS 26.090: "Mandatory Speech Codec speech processing functions; Adaptive Multi-Rate (AMR) speech codec; Transcoding functions".

[6]

TS 26.073: "ANSI C code for the Adaptive Multi Rate (AMR) speech codec".

[7]

TS 26.104: "ANSI C code for the floating-point Adaptive Multi Rate (AMR) speech codec".

[8]

TS 26.171: "Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; General description".

[9]

TS 26.190: "Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; Transcoding functions".

[10]

TS 26.173: "ANCI-C code for the Adaptive Multi Rate - Wideband (AMR-WB) speech codec".

[11]

TS 26.204: "Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; ANSI-C code".

[12]

TS 26.103: "Speech codec list for GSM and UMTS".

[13]

TS 26.290: "Extended AMR Wideband codec; Transcoding functions".

[14]

TS 26.401: "General audio codec audio processing functions; Enhanced aacPlus general audio codec; General description".

[15]

TS 26.114: "Multimedia Telephony; Media handling and interaction".

[16]

TS 22.105: "Services and Service capabilities"..

3 Definitions, symbols and abbreviations p. 6

3.1 Abbreviations p. 6

For the purposes of the present document, the following abbreviations apply in addition to TR 21.905:

AMR

Adaptive Multi Rate (codec)

AMR-NB

Adaptive Multi Rate Narrowband (codec) = AMR

AMR-WB

Adaptive Multi Rate Wideband (codec)

Circuit Switched

EFR

Enhanced Full Rate (codec)

EVS

Enhanced Voice Service

Full-Band

(GSM) Full Rate (codec)

(GSM) Half Rate (codec)

JBM

Jitter Buffer Management

MTSI

Multimedia Telephony Service for IMS

MCU

Multipoint Control Unit

Narrowband

Packet Switched

PSTN

Public Switched Telephone Network

SWB

Super Wideband

Wideband

WiMAX

Wireless Microwave Access

WLAN

Wireless Local Area Network

4 Overview on the EPS service objectives and characteristics p. 6

4.1 EPS objectives p. 6

EPS objectives indicate the potential for new enhanced voice services. As EPS will provide higher data rates, lower latency and enhanced QoS, voice services can obviously be developed to take advantage of those features.

Furthermore, since EPS is a packet-optimised system, it will introduce flexibility in codec functionality and voice service provision, but also requires consideration regarding e.g. jitter and error concealment.

4.2 EPS requirements p. 7

EPS service requirements introduce interworking with variety of broadband networks as well as PS services provided on Rel-9 and earlier. In addition, EPS will be interworking with CS services. These requirements envision backwards compatibility requirements also for voice services.

4.3 EPS Enhanced Voice Service p. 7

Enhanced Voice Service (EVS) is used in this TR to refer to the targeted enhancements. The intention of this TR is to assess the existing voice services against the requirements (identified in this TR). The enhancements will be justified if the requirements cannot be met by the existing 3GPP services.

There are no specific 'Enhanced Voice Service' or 'Voice Service' definitions in 3GPP. The term 'EVS' is used for the purposes of this TR to refer to the enhancements within each relevant 3GPP service (such as MTSI) as shown in Figure 1.

In addition to other enhancements, improved narrowband and wideband quality should be offered by the EVS but it is not intended that EVS will replace the existing narrowband and wideband media types or codec formats.

The EVS codec will support the AMR-WB codec format as a subset of the coding formats supported by it. See details in Clause 6.1.6. The AMR-WB interoperable operation modes of the EVS codec become an alternative implementation for AMR-WB operation provided that the enhancements are consistently significant.

Copy of original 3GPP image for 3GPP TS 22.813, Figure 1: Illustration of the speech media component in Multimedia Telephony Service for IMS (MTSI) over EPS in Rel-8/Rel-9 (a) and plans for Rel-10 and upwards (b). MTSI is used here as an example of a conversational 3GPP multimedia service provided over EPS.

Figure 1: Illustration of the speech media component in Multimedia Telephony Service for IMS (MTSI) over EPS in Rel-8/Rel-9 (a) and plans for Rel-10 and upwards (b). MTSI is used here as an example of a conversational 3GPP multimedia service provided over EPS.
(⇒ copy of original 3GPP image)

Support of AMR and AMR-WB codecs (AMR-WB for MTSI clients offering wideband speech

5 System and service requirements p. 8

The enhanced voice services for EPS should fulfill the system and service requirements described in the following sub-sections.

5.1 Quality of user experience p. 8

Enhanced voice services in the EPS should allow for significantly improved quality of user experience compared to voice services enabled by pre-Rel-10 3GPP access systems, considering the advancement of other communication systems. It should be possible to achieve significantly better service quality than is possible with both Rel-9 3GPP narrowband and wideband voice services. User experience is also impacted by the consistency of the quality experienced by the user under varying access and core network conditions. Under varying network conditions the user experience should be better than in pre-Rel-10 by the means of EVS. The improvement of the end-to-end quality should address all parts of the transmission chain.

5.2 Efficiency p. 9

Enhanced voice services in the EPS should be deployable in an efficient manner. Enhanced voice services should address the efficient use of the transmission resource in EPS access and transport networks as well as the possibility to implement the new services on low-cost devices and network equipment with limited computational resource. With regard to transmission efficiency, it should exceed that of the pre-Rel-10 3GPP wideband voice service.

5.3 Backward service interoperability with pre-Rel-10 3GPP networks and devices p. 9

Transcoding generates additional network effort (e.g. costs, transcoding gateways), degrades audio quality and increases latency. Transcoding-based solutions for ensuring service interoperability should be avoided as much as possible.

Besides, conference call may interconnect subscribers using enhanced quality services equipped with new enhanced devices and users with "old/legacy" terminals. The best possible quality should be delivered to all participants.

If transcoding cannot be avoided, voice conversational quality degradation should be as limited as possible.

6 High level technical requirements on speech codecs p. 9

In the below, high-level technical requirement topics are listed. This list is not exclusive.

6.1 Design constraints p. 9

6.1.1 Audio bandwidth p. 9

It is recommended that the EVS codec

have mandatory support of high-quality and high-efficiency operation of NB, WB, and SWB audio;
may support FB audio.

6.1.2 Number of audio channels p. 9

Stereo or multi-channel presentation is one way to realize significantly improved QoE. The codec may provide stereo/ multi-channel coding capability. Multiple monophonic coding, i.e. one monophonic coding per channel, can be realized by appropriate packetization techniques while stereo/multi-channel coding, i.e. joint coding of the channels, is part of the audio coder algorithm. The choice of whether using dedicated stereo/multi-channel coding or multiple monophonic codings depends on a trade-off between achievable quality, available bit rates, available delay, complexity and other implementation factors.

If stereo/multi-channel coding capability is provided, it is necessary to specify how stereo or multi-channel capture/presentations could be achieved in mobile communications.

6.1.3 Bit rates p. 9

It is recommended that the codec shall span a large range of bit rates from low rates needed for high efficiency conversational speech services to high rates required for EVS with high quality operation. It is recommended that the offered span of bit rates shall be wide enough to allow for rate adaptation in response to available transmission resource.

6.1.4 Algorithmic delay p. 10

The codec delay requirement for the EVS codec is recommended to be flexible within certain limits, allowing for overall optimizations of the system performance and considering that there is a trade-off between delay consumed by the speech codec and delay consumed by the PS transmission via the LTE air interface.

The delay requirements should be the same regardless of the nature of the input content (e.g. speech, music and mixed content).

The algorithmic delay of the EVS codec should be such that the overall end-to-end delay in an EVS-UE to EVS-UE connection meets or exceeds the preferred performance expectations in TS 22.105. According to TS 22.105 v9.0.0, conversational voice one way delay is <150 ms preferred and <400 ms limit where it is noted that the one way delay in the mobile network (from UE to PLMN border) is approximately 100 ms.

In addition it is recommended to consult the relevant 3GPP working groups (SA1, SA2, RAN, CT) in order to get a technically relevant breakdown of the various possible delay contributing elements of the end-to-end EVS-UE to EVS-UE transmission chain, which in turn enables specifying the allowable limits for the algorithmic delay of the EVS codec.

6.1.5 Complexity p. 10

The EVS Codec should be implementable on a mobile device using today's technology. The EVS codec should provide low computational complexity not significantly exceeding the design limits set during the AMR-WB codec standardization, and should have low memory usage. Increased computational complexity and memory usage should be commensurate with the gain in quality of user experience (e.g. higher audio bandwidth such as SWB or stereo if it is supported) or with increased efficiency (e.g. lower bit rate for same quality when compared to a reference codec).

6.1.6 Backward interoperability p. 10

Backward interoperability with existing 3GPP codecs aims to reduce the need for transcoding as much as possible.

Backward interoperability with existing codecs can be achieved through

the use or negotiation of existing 3GPP codecs previously defined for voice services, or
bitstream interoperability with one or more of these codecs when a new codec is defined.

It is recommended that the EVS codec shall achieve backward interoperability to the existing 3GPP AMR-WB codec by supporting all AMR-WB codec formats used in 3GPP conversational speech telephony services including CS.

6.2 Performance requirements p. 10

It is recommended that EVS codec performance requirements address speech quality, quality for mixed content and music. The quality of the EVS codec should be evaluated in ideal and realistic communication scenarios. The realistic communication scenarios include a case where the UE is located at a cell edge.

6.2.1 Speech quality p. 10

It is recommended that the EVS codec achieves the following quality requirements:

For narrowband signals: the EVS codec shall offer significant improvements in trade-offs among capacity, delay, error robustness and speech quality over the state-of-the-art 3GPP narrowband codec (AMR).
For wideband signals: the quality shall be significantly improved with respect to the state-of-the-art 3GPP wideband codec at equivalent operating points.
For super-wideband signals: the quality shall be significantly better than the state-of-the-art 3GPP wideband codec with wideband input and be no worse than state-of-the-art conversational super-wideband codecs at equivalent operating points.

Besides, the codec performance should show an increase in quality with increased bit rate and audio bandwidth.

For the AMR-WB codec modes that are included in the EVS codec (see section 4.3) it is additionally recommended that they should consistently provide significantly improved quality over the pre-Rel-10 AMR-WB codec when operating at the same bit rate

in codec configurations with the AMR-WB encoder inside EVS and the AMR-WB decoder inside EVS,
in codec configurations with the AMR-WB encoder inside EVS and the pre-Rel-10 AMR-WB decoder,
in codec configurations with the pre-Rel-10 AMR-WB encoder and the AMR-WB decoder inside EVS, and
in relevant 3GPP CS system configurations.

6.2.2 Quality for mixed content and music p. 11

In order to retain music quality when interoperating with legacy CS networks (such as the PSTN), or when interoperating with networks that do not support signaling the start and stop of music, mixed content and music coded by the EVS codec should have quality that is significantly better than pre-Rel-10 3GPP speech codecs.

The following is recommended:

For narrowband signals: the quality shall be significantly improved with respect to the state-of-the-art 3GPP narrowband codec (AMR) at equivalent operating points.
For wideband signals: the quality shall be significantly improved with respect to the state-of-the-art 3GPP conversational wideband codec at equivalent operating points.
For super-wideband signals: the quality shall be significantly better than the state-of-the-art 3GPP conversational wideband codec with wideband input and be no worse than state-of-the-art conversational super-wideband codecs at equivalent operating points. Note: It is envisioned that the EVS codec will provide better performance than a specific state-of-the-art conversational super-wideband codec at equivalent operating points.

in codec configurations with the AMR-WB encoder inside EVS and the AMR-WB decoder inside EVS,
in codec configurations with the AMR-WB encoder inside EVS and the pre-Rel-10 AMR-WB decoder,
in codec configurations with the pre-Rel-10 AMR-WB encoder and the AMR-WB decoder inside EVS, and
in relevant 3GPP CS system configurations.

6.2.3 Robustness to packet loss and delay jitter p. 11

The quality of the EVS codec should be evaluated using patterns of losses and delays based upon the expected behaviour of deployed EPS systems under a wide range of packet loss and delay/jitter conditions. As network conditions degrade, the degradation of EVS speech quality should be minimized. Furthermore, the performance of the codec should be evaluated in realistic scenarios including different delay and error profiles consistent with the objective requirements on JBM in TS 26.114.

The following is recommended:

For narrowband and wideband, speech quality in lossy conditions shall be significantly improved with respect to state-of-the-art 3GPP conversational codecs at equivalent operating points under the same loss conditions.
For super-wideband, quality should be maintained to be significantly better than state-of-the-art super-wideband codecs under equivalent loss conditions.

6.3 Transcoding performance p. 12

Transcoding can be avoided through negotiating the use of existing 3GPP codecs previously defined for voice services. In case transcoding cannot be avoided, it is recommended that the quality degradation due to transcoding and the additional delay implied by it shall be as limited as possible. Possible transcoding configurations are self-tandemings and tandemings with existing 3GPP codecs as e.g. AMR-WB and AMR.

7 Assessment of existing codecs p. 12

The 3GPP-defined speech codecs for narrowband are the GSM Full-Rate (FR) codec, the GSM Half-Rate (HR) codec, the GSM Enhanced Full-Rate (EFR) codec and the Adaptive Multi-Rate (AMR) codec. For wideband speech, the Adaptive Multi-Rate Wideband (AMR-WB) codec has been standardized. The narrowband AMR codec is sometimes referred to as the Adaptive Multi-Rate Narrowband (AMR-NB) codec.

Section 6 places recommendations (including recommended requirements) on narrowband and wideband speech coding that will lead to improvements over the existing 3GPP speech codecs. Consequently, per definition, these recommendations are not met by the existing 3GPP codecs.

Section 6 places recommendations (including recommended requirements) for encoding of super-wideband speech. None of 3GPP's speech codecs are capable of rendering super-wideband speech. As such, none of the existing 3GPP speech codecs meet all the recommendations set forth for an EVS codec.

8 Strategy for EPS's codec(s) standardization p. 12

This TR focuses on the identification and definition of service and system requirements and high level codec requirements that should be met for the "enhanced voice services". An EVS codec meeting these requirements allows for substantially enhanced voice services over what is possible with pre-Rel-10 3GPP codecs.

It is recommended that 3GPP now define and start the EVS codec development work item with the target to meet the requirements and recommendations set in this TR.

A Use cases p. 12

Consensus was not reached regarding the relevance of the following use cases to this TR. Furthermore, some use cases description are considered incomplete.

A.1 High quality audio conferencing over heterogeneous networks and with different terminals p. 12

Among the different use cases related to voice services, the high quality audio conferencing scenario over heterogeneous networks and with different terminals, described more in details below, is a relevant scenario considered for this study item for the following reasons:

It is one of the most demanding for enhanced quality of service and high voice quality because of longer duration of conference calls.
It requires interoperability and flexibility since audio conferencing service may interconnect different users connected from different access networks and devices.

The use case describes the scenario of a multipoint audio conferencing session with N participants over EPS.

The participants are connected over heterogeneous access networks that include 3GPP access (e.g. E-UTRAN) and non-3GPP access networks, like WiMAX, WLAN, cdma2000®, PSTN…, with different QoS (bit rate, delay, packet loss, and jitter). They are also supposed to be in a situation of mobility between different access networks. They are participating using different types of terminals and in different environments (home, office, car, train…). Such terminals have different capabilities in terms of:

Supported codecs: new terminals support the EPS's codec(s) but old terminals can still participate in the conference even with only 3GPP pre-EPS codec(s).

The following examples are some possibilities illustrating the diversity of the participation environment conditions:

User A with PC terminal connected to WLAN.
User B with a laptop terminal working outside and connected to its home network (e.g. non 3GPP defined).
User C with a 3G/EPS hand held terminal in car connected through the LTE access network. This participant is supposed to continue seamlessly the participation in the conference call when arriving to his office but using e.g. WLAN connection.
User D with 2G or 3G terminal in train connected through GSM/GPRS/EDGE or UMTS with noisy background.

A.2 Two-party communication p. 13

The most typical and most important use case, where the end-user has high expectations regarding the perceived audio quality, is conversational voice communication between two parties (e.g. Multimedia telephony).

Within the scope of 3GPP systems, the most likely case is that at least one of the involved terminals is a mobile terminal. Mobile terminals can in many cases be assumed to be used in mobile environments with certain ambient noise level. Also, it is likely that the mobile terminals will be used in hand-held mode using the built-in microphones and loudspeakers, or with a monaural mobile handsfree set. It is however notable that there is a technology trend significantly enhancing the acoustic front-ends of mobile terminals. It can hence be assumed that there are an increasing number of cases in which the mobile terminal provides fullband and stereo (or multi-channel) capabilities. Furthermore, such cases can be regarded typical for fixed line, PC, or IPTV terminals. To optimize the end-user experience, during the codec negotiation a wideband codec should be prioritized over a narrowband codec. The network operator policy can override the codec negotiation.

A.3 Interoperability with pre-Rel-10 3GPP systems p. 13

Today, 3GPP systems have a subscriber base counting in billions. The introduction of EPS can hence only be gradual and interoperation with pre-Rel-10 systems and terminal equipment as well with Rel-10 equipment not supporting EPS is a likely use case still for many years to come.

Although multiple methods to achieve interoperation may exist, the selected method of interoperation should ensure the highest possible quality for the end user relative to other possible methods of interoperation.

A.4 Mobility scenarios p. 13

By their nature mobile terminals are likely to move during a session. Together with the fact that EPS will not have 100% coverage from its first roll-out this means that there is a high likelihood for handovers between cells with EPS support and cells without. Cells without are cells with only CS or legacy pre-Rel-10 PS access type. This scenario is likely to happen both with mobiles participating in two-party as well as in multi-party communication use cases.

For mobile terminals which camps in EPS the services continuity as described in TS 22.278 applies.

A.5 Call Hold and (Explicit) Call transfer p. 13

These are supplementary services defined in [2],[3]. The call hold service allows a served mobile subscriber (subscriber A) to interrupt communication on an existing active call (with subscriber B), to place another call (with subscriber C) and, if desired, to re-establish the original communication.

The ECT supplementary service enables the served mobile subscriber (subscriber A) who has two calls, each of which can be an incoming or outgoing call, to connect the other parties in the two calls (subscribers B and C) and release the served mobile subscribers own connection.

In both of these supplementary services it is possible scenario that one of the calls established with subscriber A is an enhanced EPS voice service while the other is a legacy 3GPP wideband or narrowband voice service.

A.6 Voice-mail service p. 14

A typical use case of voice mail service is that a first user A establishes a call to a voice-mail server and leaves a message for some other user B. Later user B connects to the server and listens to the message. One likely case is that the terminals of both users are of different capabilities.

A.7 Access to network media server content p. 14

An increasing number of conversational services between two or multiple parties include access to media servers which put requirements on music and mixed content (speech + music) quality. This use case involves server-based media access like:

Calling participants of a conference call are listening music on hold waiting for the conference call organizer to open the bridge.
User is provided with a mixed content (speech + music) message (informative or advertising) from a multimedia server while waiting for the end user to answer.

A.8 Music on hold p. 14

At present, music while on hold offers little more to the user than providing an indication that the call has not dropped. EPS is an opportunity to provide music on hold, of comparable quality to multimedia streaming, to improve user experience.