Content for TR 26.928 Word version: 17.0.0

0… 4… 4.1.2… 4.2… 4.3… 4.4… 4.5… 4.6… 4.6.7 4.7… 4.9… 5… 6… 7… 8 A… A.4… A.7… A.10… A.13 A.14 A.15 A.16… A.19… A.22…

A.19 Use Case 18: AR avatar multi-party calls A.20 Use Case 19: Front-facing camera video multi-party calls A.21 Use Case 20: AR Streaming with Localization Registry

A.19 Use Case 18: AR avatar multi-party calls p. 120

Use Case Name

AR avatar multi-party call

Description

This use case is about multi-party communication with spatial audio rendering, where avatars and audio of each participant are transmitted and spatially rendered in the direction of their geolocation. Each participant is equipped with AR glasses with external or built-in head phones. 3D audio can be captured and transmitted instead of mono, which leads to enhancements when sharing the audio experience.

A potential user experience is described as a user story:

Bob, Jeff, and Frank are in Venice and walking around the old city sightseeing. They are all wearing AR glasses with a mobile connection via their smartphone. The AR glasses support audio spatialization, e.g. via binaural rendering and playback over the built-in headphones, allowing the real world to be augmented with visuals and audio.

They start a multi-party call, where each of them gets the other two friends displayed on his AR glasses and can hear the audio. While they walk around in the silent streets, they have a continuous voice call with the avatars displayed on their AR glasses, while also other information is displayed to direct them to the secret places of Venice. Each of them transmits his current location to his friends. Their AR glasses / headphones place the others visually and acoustically (i.e. binaurally rendered) in the direction where the others are. Thus, they all at least know the direction of the others.

As Jeff wants to buy some ice cream, he switches to push-to-talk to not annoy his friends with all the interactions he has with the ice cream shop.

As Bob gets closer to Piazza San Marco the environment gets noisier with sitting and flying pigeons surrounding him. Bob turns on the "hear what I hear" feature to give them an impression on the fascinating environment, sending 3D audio of the scene to Frank and Jeff. As they got interested, they also want to experience the pigeons around them and walk through the city to the square. Each of the friends is still placed on the AR glasses visually and acoustically in the direction where the friend is, which makes it easy for them to find Piazza San Marco and for Frank to just walk across the square to Bob as he approaches him. Jeff, who still eats his ice cream is now also coming closer to Piazza San Marco and just walks directly to Bob and Jeff. As they get close to each other they are no longer rendered (avatars and audio), based on the positional information, and they simply chat with each other.

Type: AR

Degrees of Freedom: 3DoF

Delivery: Conversational

Device: AR glasses, headphones

Preconditions

Connected AR glasses or phone with tethered AR glasses and headphones (with acoustic transparency).

Positioning support (e.g. using GNSS) to derive geolocation, allowing calculation of relative position.

3D audio capturing (e.g. using microphone arrays) and rendering.

Requirements and QoS/QoE Considerations

QoS: QoS requirements like MTSI requirements for voice/audio and avatars (conversational, RTP), e.g. 5QI 1 for audio.

QoE: Immersive voice/audio and visual experience, Quality of the capturing and rendering of avatars, the different participants and 3D audio.

Feasibility

AR glasses in various form factors exist. Those usually include motion sensors, e.g. based on accelerometers, gyroscopes, and magnetometers, but also cameras are common, allowing inside-out tracking and augmentation of the real world.

3D audio capturing and rendering are available, e.g. using spherical or arbitrary microphone arrays for capturing and using binaural rendering technologies for audio spatialization.

An audio-only solution using headphones and head-tracking is easier to implement, this would however remove the possibility to visually augment the real world and display avatars.

Potential Standardization Status and Needs

Visual coding and transmission of avatars

Audio coding and transmission of mono objects and 3D audio for streams from all participants

A.20 Use Case 19: Front-facing camera video multi-party calls p. 121

Use Case Name

Front-facing camera video multi-party call

Description

This use case is based on front-facing camera calls, i.e. a user is having a video call, seeing the other participants on the display of e.g. a smartphone he holds at arm's length. The use case has some overlap with UC 6 (AR face-to-face calls) and UC 10 (Real-time 3D Communication), extended by spatial audio rendering for headphones/headsets. The spatial audio rendering is based on the head-tracker data extracted from the smartphones front-facing camera, giving a user the impression, even with movements, that the voice of the other participants originates from a virtual stage in the direction of the phone with the video of the other's faces.

A potential user experience is described as a user story:

Bob, Jeff, and Frank are back in New York City and each of them is walking to work. They just have their smart phones with a front-facing camera and a small headset, allowing the real world to be augmented with audio.

They start a multi-party video call to discuss the plans for the evening, where each of them gets the other two friends displayed on the phone and can hear the audio, coming from the direction on the horizontal plane where the phone is placed in their hand and some small spread to allow easy distinction. While they walk around in the streets of New York, they have a continuous voice call with their phones at arm's length, with the, potentially cut-out, faces of their pals displayed on their phones. For Bob the acoustic front is always in the direction of his phone, thus the remote participants are always in the front. When Bob rotates his head though, the front-facing camera tracks this rotation and the spatial audio is binauralized using the head-tracking information, leaving the position of the other participants steady relative to the phone's position. As Bob turns around a corner with the phone still at arm's length for the video call using the front-facing camera, his friends remain steady relative to the phone's position.

Type: AR

Degrees of Freedom: 3DoF

Delivery: Conversational

Device: Smartphone with front-facing camera, headset , AR glasses

Preconditions

Phone with front-facing camera, motion sensors, and headset (more or less acoustically transparent). Motion sensors to compensate movement of the phone, front-facing camera to capture the video for the call and potentially track the head's rotation.

Requirements and QoS/QoE Considerations

QoS: QoS requirements like MTSI requirements (conversational, RTP), e.g. 5GQI 1 and 2.

QoE: Immersive voice/audio and visual experience, Quality of the capturing, coding and rendering of the participant video (potentially cut out faces), Quality of the capturing, coding and rendering of the participant audio, including binaural rendering taking head tracking data into account.

Feasibility

Several multi-party video call applications using the front-facing camera exist, e.g. https://www.cnet.com/how-to/how-to-use-group-facetime-iphone-ipad-ios-12/ , https://faq.whatsapp.com/en/android/26000026/?category=5245237

Head tracking using cameras exists, e.g. https://xlabsgaze.com

Binaural rendering with head-tracking also exists (see also TS26.118)

Potential Standardization Status and Needs

Visual coding and transmission of video recorded by front-facing camera; potentially cut-out heads, alpha channel coding

Audio coding and transmission for streams from all participants

A.21 Use Case 20: AR Streaming with Localization Registry p. 123

Use Case Description: AR Streaming with Localization Registry

A group of friends has arrived at a museum. The museum provides them with an AR guide for the exhibits. The museum's exhibition space has been earlier scanned and registered via one of the museum's AR devices to a Spatial Computing Service. The service allows storing, recalling and updating of spatial configuration of the exhibition space by a registered AR device. Visitors' AR devices (to be used by museum guests as AR guides) have downloaded the spatial configuration upon entering the museum and are ready to use.

The group proceeds to the exhibit together with their AR guides, which receive a VoD stream of the museum guide with the identifier Group A. Registered surfaces next to exhibits are used for displaying the video content (may be 2D or 3D content) of the guide. In the case of spatial audio content, this may be presented in relation to the registered surfaces. The VoD stream playback is synched amongst the users of Group A. Any user within the group can pause, rewind or fast forward the content, and this affects the playback for all the members of the group. Since all users view the content together, this allows them to experience the exhibit as a group, and discuss during pauses without affecting the content streams for other museum visitors that they are physically sharing the space with. Other groups in the museum use the same spatial configuration, but their VoD content is synched within their own group.

The use case can be extended to private spaces, e.g., a group of friends gathered at their friend Alice's house to watch a movie. Alice's living room is registered already under her home profile on the Spatial Computing Service; the saved information includes her preferred selection of the living room wall as the movie screening surface. She shares this configuration via the service with her guests.

In this use case, the interaction with a travel guide avatar may also occur in a conversational fashion.

Categorization

Type: AR and Social AR

Degrees of Freedom: 6DoF

Delivery: Streaming, Interactive, Conversational

Device: AR glasses with binaural audio playback support

Preconditions

The use case requires technical solutions for the following functions:

Spatial Computing Service

A 5G service that registers users and stores their indoor spatial configuration with the following features:

Reception of a stream of visual features for a space to be registered. The input may be from a mobile phone camera, an AR device or a combination of data from multiple sensors and cameras located in the space.

Usage of a localization algorithm such as SLAM (Simultaneous Localization and Mapping) for indoor spatial localization, and the storage of special configurations, such as the selection of particular surfaces for special functions (e.g., wall for displaying a video stream).

Distribution of previously stored information to other devices belonging to the same user or to other authorized users.

Updating of localization information and redistribution when required.

Content synchronization

A streaming server that distributes content and ensures synchronized content playback for multiple AR users. The server does not need to have the content stored locally. It can, for example, get the content stream from a streaming service and then redistribute it. For the museum guests, the functionality may be part of the XR client or embedded in a home gateway or console-like device.

QoS/QoE Considerations

Required QoS:

Sufficiently low latency for synchronized streaming playback and conversational QoS.

Required QoE:

Synchronization of VoD content for multiple users within acceptable parameters. This requires ensuring the streams' playback occurs near simultaneously for all users, so that user reactions to specific scenes such as jump scares in a horror movie or a goal in a sport sequence are also synced within the group. Furthermore, playback reaction time to user actions such as pause, fast forward and rewind should be low and similar for all users within the group. Conversational low-delay QoE is also expected.

Feasibility

The use case is feasible within a timeframe of 3 years. Required hardware, AR glasses, are available in the market, and network requirements are no more or less than existing streaming services.

The feasibility of the use case depends on the accuracy of the localization registration and mapping algorithm. Multiparty AR experiences, such as a shared AR map annotation demo from Mapbox (https://blog.mapbox.com/multi-user-ar-experience-1a586f40b2ce?gi=60ceb3226701) and the Multiuser AR experience exhibition at the San Fransisco Museum of Modern Art by Ubiquity6 (https://www.youtube.com/watch?v=T-I3YG_w-Z4), provide good examples for proof of concept of already available technology for creating a shared AR experience.

Potential Standardization Status and Needs

The following aspects may require standardization work:

Standardized way of sharing and storing indoor spatial information with the service and other devices.

Mixing VoD streams may require some additional functions for social AR media control playback. This would relate to allowing users to control the playout of the VoD stream (pause, rewind, fast-forward) for all users in a synchronized manner.