Content for TR 22.856 Word version: 19.2.0

1… 5… 5.2… 5.3… 5.4… 5.5… 5.6… 5.7… 5.8… 5.9… 5.10… 5.11… 5.12… 5.13… 5.14… 5.15… 5.16… 5.17… 5.18… 5.19… 5.20… 5.21… 5.22… 5.23… 5.24… 5.25… 5.26… 5.27… 5.28… 6 7… 7.2 8 A B C…

5.11 Use case of IMS-based 3D Avatar Communication 5.11.1 Description 5.11.2 Pre-conditions 5.11.3 Service Flows 5.11.4 Post-conditions 5.11.5 Existing feature partly or fully covering use case functionality 5.11.6 Potential New Requirements needed to support the use case
...

5.11 Use case of IMS-based 3D Avatar Communication p. 47

5.11.1 Description p. 47

This use case identifies two fundamental scenarios and one sub-scenario for 3D Avatar Communication by means of the IMS. The intention of the proposal is to fully specify this system in 3GPP, to provide a standard for a new form of media to be used in telecommunication by mobile users. In the terminology of this use case, the avatar is a digital representation of a user, and this digital representation is exchanged (with other media, notably audio), with one or more users as mobile metaverse media.

An Avatar Call is similar to a Video Call in that both are visual, interactive, provide live feedback to participants regarding their emotions, attentiveness and other social information. They differ in that an Avatar call can be more private - neither revealing the environment where the caller is, nor their actual appearance. An avatar may be preferable to display to one's own face in a call for a number of reasons - a user may not feel presentable, may want to make a specific impression, or may have to communicate when only limited data communication is possible. The key difference between an Avatar Call and a Video Call is that the Avatar Call requires only a very constrained data rate, e.g. 5 kbps, to support.

This use case is timely because the key enabling technologies have reached a sufficient maturity. The key avatar technologies are the means to (1) capture facial and calculate values according to a model, (2) efficiently send both media and model components through a communication channel, both initially and over time, (3) produce media for presentation to a user for the duration of the communication. We anticipate services will be increasingly available in the coming months and years. The current approaches under development are effectively proprietary and they are not integrated with the IMS.

The scenarios considered in this use case are:

1(a).

IMS users initiate an avatar call.

1(b).

An IMS users initiate a video call, but one (or both) users decide instead to provide Avatar Call representation instead of video representation.

For both 1(a) and 1(b) the goal is to capture sensing data of the communicating users (especially facial data) to create an animated user digital representation (avatar). This media is provided to communicating users as a new teleservice user experience enabled by the IMS.

A user interacts with a computer-generated system. Avatar communication is used to generate an appearance for a simulated entity with whom the user communicates.

5.11.2 Pre-conditions p. 47

Users Adonis and Aphrodite are 3GPP subscribers. Both have terminal equipment sufficient to capture their facial expression and movements adequately for computing avatar modeling information. The terminal equipment also includes a display, e.g. a screen, to display visual media. The terminal equipment is capable of initiating and terminating the IMS multimedia application 'avatar call.' The terminal equipment is also capable of capturing the facial appearance and movements sufficiently to produce data required by a Facial Action Coding System (FACS).

A network accessible service is capable of initiating and terminating an IMS session and the IMS multimedia application 'avatar call.'

5.11.3 Service Flows p. 47

1(a). Avatar Call

Adonis is on a business trip and filthy after a day servicing industrial equipment. He calls Aphrodite, who is several time zones away and reading in bed after an exhausting day.
Adonis doesn't want to initiate a video call since he hasn't had a chance to clean up and is still at work, surrounded by ugly machines. He initiates an 'avatar call' explicitly with his terminal equipment interface.
Aphrodite, several time zones away, reading in bed after an exhausting day, is alerted of an incoming call. She sees it is from Adonis and that it is an avatar call. She accepts the call, pleased that she will be presented on the call as an avatar.

Figure 5.11.3-1: Avatar media prepared for an avatar call
(⇒ copy of original 3GPP image)

In more detail, the media that is provided uplink is generated on each terminal. This is analogous to the way in which speech and video codecs operate today.

Figure 5.11.3-2: Avatar generation on each UE
(⇒ copy of original 3GPP image)

Once the avatar call is established the communicating parties provide information uplink. The terminal (a) captures facial information of the call participants and (b) locally determines an encoding that captures the facial information (e.g. consisting of data points, colouring and other metadata.) This information is transmitted as a form of media uplink, and provided by the IMS to the other participant(s) in the avatar call. When (c) the media is received by a participant, the media is rendered as a two (or three) dimensional digital representation, shown above as the 'comic figure' on the right.

In this use case, the UE performs processing of the data acquired by the UE to generate the avatar codec. It is possible to send the acquired data, e.g. video data from more than one camera, uplink so that the avatar codec could be rendered by the 5G network. It is however advantageous from a service perspective to support this capability on the UE. First, the uplink data requirement is greatly reduced. Second, the confidentiality of the captured data could prevent the user from being willing to expose it to the network. Third, the avatar may not be based on sensor data at all, if it is a 'software generated' avatar (as by a game or other application, etc.) and in this case there is no sensor data to send uplink to be rendered.

1(b). Video call falls back to an Avatar call

Adonis is striking as can be, and standing at an awe-inspiring vista on mount Olympus. He initiates a video call with Aphrodite.
Unfortunately, Adonis has forgotten to consider the time zone difference. For Aphrodite, it is the middle of the night. What's more, Aphrodite has been up for several hours in the middle of the night to clean up a mess made by her sick cat. While she wants to take the call from Adonis, she prefers to be presented by an avatar, and not to take the call as a video call from her side. She explicitly requests an 'avatar presentation' instead of a 'video presentation' and picks up Adonis' call.
The call between Adonis and Aphrodite is established. Adonis sees Aphrodite's avatar representation. Aphrodite sees Adonis in the video media received as part of the call.
Adonis walks further along the mountain trail while still speaking to Aphrodite. The coverage gets worse and worse until it is no longer possible to transmit video uplink adequately. Rather than switching to a voice-only call, Adonis activates 'avatar call' representation. This requires very little data throughput.
Adonis and Aphrodite enjoy the rest of their avatar call.

2. Aphrodite calls automated customer service. Aphrodite calls a customer service of company "Inhabitabilis" to initiate a video call.

ii. Inhabitabilis customer service employs a 'receptionist' named Nemo, who is actually not a person at all. He is a software construct. There is an artificial intelligence algorithm that generates his utterances. At the same time, an appearance is generated as a set of code points using a FACS system, corresponding to the dialog and interaction between Aphrodite and Nemo.

iii. Aphrodite is able to get answers to her questions and thanks Nemo. In all the above scenarios, the following applies.

3. Aphrodite uses a terminal device without cameras, or whose cameras are insufficient and/or Adonis uses a terminal device without avatar codec support

In this scenario, the UE used by either calling party is not able to support an IMS 3D avatar call. Through the use of transcoding, this lack of support can be overcome. In the service flow shown below, as an example, Aphrodite's UE cannot capture her visually so as to generate an avatar encoding, so she expresses herself in text.

i. Aphrodite calls Adonis and wants to share an avatar call. She cannot however be captured via FACS due to a lack of sufficient camera support on her UE. Instead she uses text-based avatar media.

ii. The text-based avatar media is transported to the point at which this media is rendered as a 3D avatar media codec.

Figure 5.11.3-3: Example of a text-based Avatar Media enables avatar call without camera, etc. support on a UE
(⇒ copy of original 3GPP image)

The transcoding rendering of the avatar media to 3D avatar media could be at any point in the system - Aphrodite's UE, the network, or Adonis' UE.

iii. Adonis' UE is able to display an avatar version of Aphrodite and hear it speak (text to voice.) To the extent that the avatar configuration and voice generation configuration is well associated with Aphrodite, Adonis can hear and see her speaking, though Aphrodite only provides text as input to the conversation.

Other examples (not further described here) could, for example, transcode the media provided by Aphrodite (e.g. text, binary, avatar encoding, etc.) and transcode it to video for presentation to Adonis. This would be useful if Adonis' UE did not have support for avatar encoding.

5.11.4 Post-conditions p. 49

In each of the scenarios above, avatar media provides an acceptable interactive choice for a video call experience. The advantages are privacy, efficiency and ease of integration with computer software to animate a simulated conversational partner.

5.11.5 Existing feature partly or fully covering use case functionality p. 50

TS 22.228 defines the service requirements for IMS. IMS supports different IMS multimedia applications. IMS supports a wide range of services, notably voice and video calls. There is extensive support for services, tightly integrated with the 3GPP system, with extensive support for roaming and integration with both PSTN and ISDN telephony, emergency services and more. The requirements for a 3D avatar application are largely covered by existing requirements in the 5G standard for IMS.

TS 22.173 defines the media handling capabilities of the IMS Multimedia Telephony service

The specific gaps that are addressed in 5.A.6 include: extended feature negotiation, enabling the user to decide whether to present video or avatar communication, the ability to support Avatar communication and content efficiently, the ability to support standardized Avatar media in the 5G system.

The following KPIs are easily supported by the 5G system. They are included in order to contrast the requirements of an avatar call with a video call.

Use Case	Characteristic parameter (KPI)
Use Case	End-to-end latency	Service bit rate: user-experienced data rate
Avatar call	[NOTE 1]	<5 kbps [45]
Video call	< 150 msec preferred <400 msec limit Lip-synch: < 100 msec [46]	32-384 kb/s [46]
NOTE 1: The latency requirement for real time immersive service experience would be the same as the video call below. For some user experiences (smaller devices or an embedded icon-sized representation in other application, etc.) the latency tolerance could be greater. NOTE 2: The video call KPIs are from TS 22.105 and have not changed since Rel-99. Actual transactional video call parameters may be higher now.

5.11.6 Potential New Requirements needed to support the use case p. 50

[PR 5.11.6-1]

The IMS shall allow multimedia conversational communications between two or more users providing real time conversational transfer of animated user digital representation and speech data.

[PR 5.11.6-2]

The 5G system shall support a means for UEs to produce 3D avatar media to be sent uplink, and to receive this media downlink.

[PR 5.11.6-3]

The 5G system shall support a means for the production of 3D avatar media to be accomplished on a UE to support confidentiality of the data used to produce the 3D avatar (e.g. from the UE cameras, etc.)

[PR 5.11.6-4]

Subject to user consent, the 5G system shall support a means to provide bidirectional transitioning between video and avatar media for parties of an IMS call.

[PR 5.11.6-5]

The 5G system shall support a means to enable locally generated media (e.g. text or video) of a party to be transcoded before it is rendered for the receiving party.

[PR 5.11.6-6]

The 5G system shall support collection of charging information associated with initiating and terminating an IMS-based 3D avatar call.