AR Conversational services are end-to-end use-cases that include communication between two or more parties. The following building blocks to realize AR conversational services are identified:
Call setup and control: this building block covers the
signalling to setup a call or a conference.
fetching of the entry point for the AR experience. The protocol needs to support upgrading and downgrading to/from an AR experience. It also needs to support adding and removing media. This also includes the device type (Type-1, Type-2, or Type-3) as well as non-AR experience, e.g., tablet.
Formats: The media and metadata types and formats for AR calls need to be identified. The format for the entry point, namely the scene description, and any extensions to support AR telephony need to be identified. Also, the format for media capturing, e.g., 2D video, depth map, 3D point clouds, colour attributes, etc. need to be identified. For AR telephony media types, the necessary QoS characteristics need to be defined, as well as format properties and codecs.
Delivery: the transport protocols for the AR media need to be identified. AR telephony and conferencing applications require low latency exchange of real-time media. A protocol stack, e.g. based on RTP, will be required.
5G system integration: offering the appropriate support by the 5G system to AR telephony and conferencing applications includes:
signalling for QoS allocation,
discovery and setup of edge resources to process media for AR telephony,
usage of MBS and MTSI,
data collection and reporting.
The building blocks may have different instantiations and/or options. For example, the delivery may be mapped to a WebRTC protocol stack or to an MTSI protocol stack. Furthermore, a single session may combine several delivery methods to accommodate the different media types supported by an AR conversational service.
In addition, AR telephony and conferencing applications may support asymmetrical and symmetrical experiences. In an asymmetrical case, one party is sending AR immersive media and the backchannel from other participants may be audio only, 2D video, etc. In a symmetrical case, all involved parties are sending and receiving AR immersive media.
The use case relevant to this scenario may be further categorized into AR two-party call use cases and AR conferencing use cases. The AR two-party call use cases include:
UC#3: Real-time 3D Communication
UC#4: AR guided assistant at remote location (industrial services)
UC#7: Real-time communication with the shop assistant
UC#11: AR animated avatar calls
UC#16: AR remote cooperation
UC#19: AR conferencing
The AR conferencing use cases include:
UC#8: 360-degree conference meeting
UC#9: XR meeting
UC#10: Convention / Poster Session
UC#12: AR avatar multi-party calls
UC#13: Front-facing camera video multi-party calls
UC#19: AR conferencing
3GPP TR 22.873 also addresses use cases relevent to this scenario, namely conference call with AR holography and AR call. Note that the first use case has similarity with the UC#19 and the second use case has similarity with the UC#4 as listed in Table 5-1.
There are different options for mapping to 5G system:
The MTSI architecture (TS 26.114) supports audio and 2D video conversational services. Extending the MTSI architecture to support AR signalling and immersive media. This includes both MTSI/RTP and MTSI/Data channel (DC) stack options.
Extending the 5GMS architecture (TS 26.501) to support AR conversational services by combining live uplink and live downlink. 5GMS offers basic functionality such as QoS support, reporting, and in the future also edge, which will be beneficial for all types of applications. The typical/expected QoS parameters (especially delay) need to be clarified.
An architecture based on something different than MTSI / IMS or 5GMS, for example, browser implementations such as WebRTC. WebRTC is widely deployed today for conversational services and is built on flexible ecosystem on the device side, which is important in this case since conversational AR will require significant device-side changes.
SIP- and RTP-based or DC-based. SDP signalling and formats for AR are missing and need to be defined. Encoding and decoding at MTSI client needs to be extended beyond ITT4RT with improved support of AR immersive media formats (e.g. meshes, point clouds).
TCP- and HTTP-based streaming, using DASH/HLS and MPEG OMAF/CMAF technology. AR & immersive media content and signalling are assumed to work with HTTP-based streaming in the other use-case mappings.
For example, using non-IMS WebRTC data channel and/or extending WebRTC audio/video for AR media such as immersive media communications. AR signalling aspects to be studied.
Find and connect is solved through SIP and E.164 addressing in IMS.
Find-and-connect for the conversational, UE-to-UE, case is undefined.
WebRTC implementations offer dedicated APIs for connection establishment in various contexts such as social media platforms. Browser applications are widely available.
Technically possible, latency is likely in principle to not be a problem to achieve, building on the existing QoS and policy framework in 5GC.
Low latency and QoS DASH support in 5GMS to be studied.
WebRTC is designed with low latency in mind but has no defined relation to QoS and policy framework in 5GC and use of that need to be studied.
Cross-operator interconnect aspects are included. Edge processing functions to be studied.
Cross-operator interconnect aspects are currently ignored. Edge processing functions to be studied (e.g. EMSA).
Cross-operator interconnect aspects are currently not applicable since WebRTC is used OTT today, but will become relevant and need study, especially if used with QoS. Edge processing functions to be studied.
(only in scope of SA3, not SA4)
LI framework exist. Possible extensions to cover new AR media formats to be studied in SA3.
Not in scope since it is not a telephony service.
Not in scope if only OTT.
To describe the functional architecture for AR conversational use-cases such as Annex A.4 and identify the content delivery protocols and performance indicators an end-to-end architecture is addressed. The end-to-end workflow for AR conferencing (one direction) is shown in Figure 6.5.3-1. Camera is capturing the participant in an AR conferencing scenario. The camera is connected to a UE (e.g. laptop) via a data network (wired/wireless). Live camera feed, sensors and audio signals are provided to a UE/Edge node (or split) which processes, encodes, and transmits immersive media content to the 5G system for distribution. The immersive media processing function in UE may include pre-processing of the captured 3D video, format conversion, and any other processing needed before compression. Immersive media content includes 3D representation, such as in form of meshes or point clouds, of participants in an AR conferencing scenario. After processing and encoding, the compressed 3D video and audio streams are transmitted over the 5G system. A 5G STAR UE decodes, processes and renders the 3D video and audio stream.
The use-case may be extended to bi-directional/symmetric case by adding a 3D camera on the receiver side and AR glasses on the sender side and applying a similar workflow. For an asymmetrical case of EDGAR UE, the immersive media is further pre-rendered by the immersive media processing function in the 5G System and transmitted to the UE. Depending on the device capability, further media processing such as main scene management, composition, and rendering partial scene for individual participant are processed in cloud/edge.
We consider an immersive AR two party call between Alice and Bob. The end-to-end call flow is described:
[STAR UE Alice - STAR UE Bob]: Either one of the UEs may initiate an AR immersive call by starting an application on the phone or AR glasses.
[STAR UE Alice - STAR UE Bob]: Both UEs communicate with a signalling server to establish the AR call. During the session establishment, both parties agree on the format (e.g. point clouds, triangular/polygon meshes). The exact session type and configuration depends on the capabilities of STAR UE.
[STAR UE Alice]: Alice is captured by a depth camera embedded within the STAR UE or an external camera which generates an immersive 3D media stream (audio and video).
[STAR UE Alice]: The immersive 3D media is encoded and transmitted in real-time to Bob over the 5G system. Additional pre-processing may be applied before encoding such as format conversion.
[STAR UE Bob]: The immersive 3D media is received on Bob's STAR UE. The immersive 3D media stream is decoded and rendered on AR glasses. Additional post-processing may be applied before rendering such as format conversion, customization to match the stream to rendered environment e.g. filling holes.
[STAR UE Bob]: Bob is captured by a depth camera generating an immersive 3D media which is encoded and transmitted in real-time to Alice's AR glasses.
[STAR UE Alice]: The immersive 3D media which is received, decoded and rendered on Alice's AR glasses.
[STAR UE Alice - STAR UE Bob]: Both UEs terminate the service at the end of the call.
We consider an immersive AR asymmetrical call between Alice and Bob, where Bob is transmitting immersive media to be consumed on the AR glasses of Alice (STAR UE). Bob (non-STAR UE) is receiving content from Alice via other means such as audio, 2D video, etc. The end-to-end call flow is described:
[STAR UE Alice - non-STAR UE Bob]: Alice initiates an AR immersive call by starting an application on the phone or AR glasses.
[STAR UE Alice - non-STAR UE Bob]: Alice communicates with a signalling server to establish the AR call. During the session establishment, the format is identified (e.g., point clouds, triangular/polygon meshes). The exact session type and configuration depends on the capabilities of STAR UE.
[non-STAR UE Bob]: Bob is captured by a depth camera embedded within the STAR UE or an external camera which generates an immersive 3D media stream (audio and video).
[non-STAR UE Bob]: The immersive 3D media is encoded and transmitted in real-time to Alice over the 5G system. Additional pre-processing may be applied before encoding such as format conversion.
[STAR UE Alice]: The immersive 3D media is received on Alice's STAR UE. The immersive 3D media stream is decoded and rendered on AR glasses. Additional post-processing may be applied before rendering such as format conversion, customization to match the stream to rendered environment e.g., filling holes.
[STAR UE Alice]: Alice is transmitting audio, 2D video or other media content as a back channel to Bob.
[non-STAR UE Bob]: The 2D video or other media content which is received, decoded and rendered on Bob's device.
[STAR UE Alice - non-STAR UE Bob]: Alice terminates the service at the end of the call.