Content for TR 26.998 Word version: 18.0.0

0… 4… 4.2… 4.2.2… 4.2.2.2 4.2.2.3 4.2.2.4 4.2.3… 4.3… 4.4… 4.5… 4.6… 4.6.4… 4.6.5… 4.6.8… 5 6… 6.2… 6.2.4… 6.2.4.2 6.2.5… 6.3… 6.3.4… 6.3.4.2 6.3.5… 6.4… 6.4.4 6.4.5… 6.5… 6.5.4 6.5.5 6.5.6… 6.6… 6.6.4 6.6.5… 7… 8… 8.9 9 A… A.2 A.3… A.4 A.5 A.6 A.7…

4.6 Related Work p. 42

4.6.1 3GPP p. 42

This clause documents the 3GPP activity related to services using AR/MR device.

3GPP TR 26.928 provides an introduction to XR including AR and a mapping to 5G media centric architectures. It also specified the core use cases for XR and device types.
3GPP TS 22.261 identified use cases and requirements for 5G systems including AR and TR 22.873 documents new scenarios of AR communication for IMS Multimedia Telephony service.
3GPP SA4 is working on the documentation of 360-degree video support to MTSI in TS 26.114. It will provide the recommendations of codec configuration and signalling mechanisms for viewport-dependent media delivery.
3GPP TR 26.926 [48] provides Traffic Models and Quality Evaluation Methods for Media and XR Services in 5G Systems.
In the context of Release-17, 3GPP RAN work [16] identified traffic models for XR application and an evaluation methodology to access the XR performance.
3GPP SA4 is working on the development of the EVS Codec Extension for Immersive Voice and Audio Services (IVAS) codec. It targets encoding/decoding/rendering of speech, music and generic sound, with low latency operation and support of high error robustness under various transmission conditions, The IVAS codec is expected to provide support for a range of service capabilities, e.g., from mono to stereo to fully immersive audio, implementable on a wide range of UEs. The work on IVAS is expected to provide support for MTSI services and potentially streaming services through the definition of a new immersive audio media component.

NOTE:
The integration of IVAS into the architectures developed in this report is for further study.
In the context of Release-18 under the Terminal Audio quality performance and Test methods for Immersive Audio Services (ATIAS) work item, 3GPP SA4 is working on the specification of test methods in TS 26.260 and requirements in TS 26.261 [57] for immersive audio.

MPEG has developed a suite of standards for immersive media with a project of MPEG-I (ISO/IEC 23090 Coded Representation of Immersive Media). It contains all the media related components, including video, audio, and system for AR/MR as well as 360-degree video.

Part 1 - Immersive Media Architectures: Provides the structure of MPEG-I, core use cases and scenarios, and definitions of terminologies for immersive media
Part 2 - Omnidirectional MediA Format (OMAF): Defines a media format that enables omnidirectional media applications (360-degree video) with support of 3DoF, 3DoF+, and 6DoF based on ISOBMFF. The second edition of OMAF was published in 2021 [17].
Part 3 - Versatile Video Coding (VVC): Describes the 2D video compression standard, providing the improved compression performance and new functionalities as compared to HEVC. The first edition was published in 2021 [18].
Part 4 - Immersive Audio Coding: It provides the compression and rendering technologies to deliver 6DoF immersive audio experience.
Part 5 - Visual Volumetric Video-based Coding (V3C) and Video-based Point Cloud Compression (V-PCC): It defines the coding technologies for point cloud media data, utilizing the legacy and future 2D video coding standards. The first edition was published in 2021 [19] and the second edition for dynamic mesh compression is developing.
Part 6 - Immersive Media Metrics: It specifies a list of media metric and a measurement framework to evaluate the immersive media quality and experience. The first edition was published in 2021 [47].
Part 7 - Immersive Media Metadata: It defines common immersive media metadata to be referenced to various other standards.
Part 8 - Network-Based Media Processing (NBMP): It defines a media framework to support media processing for immersive media which may be performed in the network entities. It also specifies the composition of network-based media processing services and provides the common interfaces. The first edition was published in 2020 [20], and currently developing media processing entity capabilities and split rendering support as the second amendment.
Part 9 - Geometry-based Point Cloud Compression (G-PCC): It defines the coding technologies for point cloud media data, using techniques that traverse directly the 3D space in order to create the predictors for compression.
Part 10 - Carriage of Visual Volumetric Video-based Coding Data: It specifies the storage format for V3C and V-PCC coded data. It also supports flexible extraction of component streams at delivery and/or decoding time.
Part 11 - Implementation Guidelines for Network-based Media Processing
Part 12 - Immersive Video: It provides coding technology of multiple texture and depth views representing immersive video for 6DoF.
Part 13 - Video Decoding Interface for Immersive Media: It provides the interface and operation of video engines to support flexible use of media decoder. MPEG Systems has also initiated the next phase of development for extending MPEG-I Scene Description including the support of additional immersive media codecs, support for haptics, AR anchoring, user representation and avatars, as well as interactivity.
Part 14 - Scene Description for MPEG Media: It describes the spatial-temporal relationship among individual media objects to be integrated.
Part 15 - Conformance Testing for Versatile Video Coding
Part 16 - Reference Software for Versatile Video Coding
Part 17 - Reference Software and Conformance for Omnidirectional MediA Format
Part 18 - Carriage of Geometry-based Point Cloud Compression Data
Part 19 - Reference Software for V-PCC
Part 20 - Conformance for V-PCC
Part 21 - Reference Software for G-PCC
Part 22 - Conformance for G-PCC
Part 23 - Conformance and Reference Software for MPEG Immersive Video
Part 24 - Conformance and Reference Software for Scene Description for MPEG Media
Part 25 - Conformance and Reference Software for Carriage of Visual Volumetric Video-based Coding Data
Part 26 - Conformance and Reference Software for Carriage of Geometry-based Point Cloud Compression Data

4.6.3 ETSI Industry Specification Group p. 44

ETSI Industry Specification Group AR Framework (ISG ARF) has developed a framework for AR components and systems [21]. It introduces the characteristics of an AR system and describes the functional building blocks of the AR reference architecture and their mutual relationships. The generic nature of the architecture is validated by mapping the workflow of several use cases to the components of this framework architecture.

The ETSI AR Framework Architecture describes a system composed of hardware and software components as well as data describing the real world and virtual content. The architecture is composed of three layers as described in clause 4 of [21] and illustrated in Figure 4.6.3-1.

Hardware layer including:
- Tracking Sensors: These sensors aim to localize (position and orientation) the AR system in real-time in order to register virtual contents with the real environment. Most of AR systems such as smartphones, tablets or see-through glasses embed at least one or several vision sensors (generally monochrome or RGB cameras) as well as an inertial measurement unit and a GPS™. However, specific and/or recent systems use complementary sensors such as dedicated vision sensors (e.g. depth sensors and event cameras), or exteroceptive sensors (e.g. Infrared/laser tracking, Li-Fi™ and Wi-Fi™).
- Processing Units: Computer vision, machine learning-based inference as well as 3D rendering are processing operations requiring significant computing resources optimized thanks to dedicated processor architectures (e.g. GPU, VPU and TPU). These processing units may be embedded in the device, may be remote and/or distributed.
- Rendering Interfaces: Virtual content require interfaces to be rendered to the user so that he or she may perceive them as part of the real world. As each rendering device has its own characteristics, the signals generated by the rendering software generally need to be transformed in order to adapt them to each specific rendering hardware.
Software layer including:
- Vision Engine: This software aims to mix the virtual content with the real world. It consists of localizing (position and orientation) the AR device relative to the real world reference, localizing specific real objects relatively to the AR device, reconstructing a 3D representation of the real world or analysing the real world (e.g. objects detection, segmentation, classification and tracking). This software component essentially uses vision sensors signals as input, but not only (e.g. fusion of visual information with inertial measurements or initialization with a GPS), it benefits from the hardware optimization offered by the various dedicated processors embedded in the device or remote, and will deliver to the rendering engine all information required to adapt the rendering for a consistent combination of virtual content with the real world.
- 3D Rendering Engine: This software maintains an up-to-date internal 3D representation of the virtual scene augmenting the real world. This internal representation is updated in real-time according to various inputs such as user's interactions, virtual objects behaviour, the last user viewpoint estimated by the Vision Engine, an update of the World Knowledge to manage for example occlusions between real and virtual elements, etc. This internal representation of the virtual content is accessible by the renderer (e.g. video, audio or haptic) which produces thanks to dedicated hardware (e.g. Graphic Processing unit) data (e.g. 2D images, sounds or forces) ready to be played by the Rendering Interfaces (e.g. screens, headphones or a force-feedback arm).
Data layer including:
- World Knowledge: This World Knowledge represents the information either generated by the Vision Engine or imported from external tools to provide information about the real world or a part of this world (CAD model, markers, etc.). This World Knowledge corresponds to the digital representation of the real space used for different usages such as localization, world analysis, 3D reconstruction, etc.
- Interactive Content: These Interactive Content represent the virtual content mixed to the perception of the real world. These contents may be interactive or dynamic, meaning that they include both 3D contents, their animations, their behaviour regarding input events such as user's interactions. These Interactive Contents could be extracted from external authoring tools requiring to adapt original content to AR application (e.g. 3D model simplification, fusion, and instruction guidelines conversion).

Copy of original 3GPP image for 3GPP TS 26.998, Fig. 4.6.3-1: Global overview of the architecture of an AR system

Figure 4.6.3-1: Global overview of the architecture of an AR system
(⇒ copy of original 3GPP image)

In the ETSI AR functional architecture, there are eleven logical functions as illustrated in Figure 4.6.3-2. Each function is composed of two or more subfunctions.

Copy of original 3GPP image for 3GPP TS 26.998, Fig. 4.6.3-2: Diagram of the functional reference architecture

Figure 4.6.3-2: Diagram of the functional reference architecture
(⇒ copy of original 3GPP image)

The ETSI ISG ARF has validated that the functional architecture covers all requirements for AR experience delivery in a variety of use cases. The logical functions are connected by Reference Point (RP). An RP in the AR functional architecture is located at the juncture of two non-overlapping functions and represents the interactions between those functions.

Details for each of the eleven functions and their subfunctions are described in clause 5 of [21] and details of each of the 18 RPs are described in clause 6 of [21].

Content for TR 26.998 Word version: 18.0.0

4.6 Related Work p. 42

4.6.1 3GPP p. 42

4.6.2 MPEG p. 43

4.6.3 ETSI Industry Specification Group p. 44