Content for TR 26.928 Word version: 17.0.0

0… 4… 4.1.2… 4.2… 4.3… 4.4… 4.5… 4.6… 4.6.7 4.7… 4.9… 5… 6… 7… 8 A… A.4… A.7… A.10… A.13 A.14 A.15 A.16… A.19… A.22…

A.14 Use Case 13: 3D shared experience
...

A.14 Use Case 13: 3D shared experience p. 104

Use Case Description: 3D shared experience

In this shared 3D use case two friends (Eilean and Bob) are sharing a virtual experience. The experience builds around a crime investigation showing an investigation of two murder suspects and allowing the users to discuss and identify who committed the murder. Both Eileen and Bob are joining from home wearing a VR HMD and being captured via an RGB+depth camera. In VR they experience a 3-dimensional room (6DOF, police station), being represented in 3D and including a self-representation that allows them to point at items in the room and at each other. This representation can be based on the same capture that is made with the RGB+depth camera for communication purposes. Further, in the virtual police station each one of them has a window to follow a different interrogation (windowed 6DOF / 3DOF+), allowing them to collect information to solve the murder together (see figure 2).

Copy of original 3GPP image for 3GPP TS 26.928, Fig. A.14-1: example image of a virtual 3D experience with photo-realistic user representations

Figure A.14-1: example image of a virtual 3D experience with photo-realistic user representations
(⇒ copy of original 3GPP image)

Categorization

Type: AR, MR, VR

Degrees of Freedom: 3DoF+ / 6DOF

Delivery: Conversational

Device: Mobile / Laptop

Preconditions

The above use case results into the following hardware requirements:

Each user needs a VR HMD (mobile, stand alone, wired/wireless VR HMD).

Each user needs a depth camera to be captured (based on Bluetooth, integrated into a mobile phone or wired)

Each user needs a microphone and audio headset for audio upload and spatial audio playback

Each user needs to be connected and registered to a network that is able to facitilate the end-to-end audio/video call.

Requirements and QoS/QoE Considerations

The following QoS requirements are considered:

Bandwidth: As minimal bandwidth it is expected at least 6Mbit/s (this is for a single 2D+ user stream with RGB + depth video), however this requirement can increase with more complex and higher resolution streams.

Delay: suitable for real-time communication

Delay (self-view): suitable for feeling of embodiment

The main goal of this use case is to create a shared presence and immersion in a 3DOF+/6DOF experience. Thus the following QoE Considerations are relevant:

Capture & Processing:

The resolution of the rgb+depth camera needs to be sufficient.

The foreground / background extraction needs to result into an accurate cut-out of a user

Transmission:

The compression of audio and video data should follow similar constraints as traditional video conferencing.

Rendering:

Users, needs to be scaled and positioned in the AR/VR environment in a natural way

Audio playback needs to match the spatial orientation of the user

A self view needs to be properly aligned with the actual body movement to align proprioceptive and visual experience. Also, delay for this needs to be kept to a minimum.

Feasibility

Demos & Technology overview:

M. J. Prins, S. N. B. Gunkel, H. M. Stokking, and O. A. Niamut. TogetherVR: A Framework for photorealistic shared media experiences in 360-degree VR. SMPTE Motion Imaging Journal 127.7:39-44, August 2018.

S. N. B. Gunkel, H. M. Stokking, M. J. Prins, N. van der Stap, F.B.T. Haar, and O.A. Niamut, 2018, June. Virtual Reality Conferencing: Multi-user immersive VR experiences on the web. In Proceedings of the 9th ACM Multimedia Systems Conference (pp. 498-501). ACM.

2018, IBC Demo: https://vrtogether.eu/2018/09/14/ibc-show-2018/

In summary:

Users are captured with an RGB+depth device, e.g. Microsoft Kinect or Intel Realsense Camera

This capture is processed locally for foreground/background segmentation and optionally for creation of a self-view.

WebRTC is used for setting up streams to the other call participants.

A-Frame / WebVR is used for rendering the virtual environment.

Existing Service:

http://www.mimesysvr.com/

Summery of steps:

Copy of original 3GPP image for 3GPP TS 26.928, Fig. A.14-2: Functional blocks of end-to-end communication

Figure A.14-2: Functional blocks of end-to-end communication
(⇒ copy of original 3GPP image)

Furthermore to realize this use case it is mapped into the following functional blocks:

Capture & Processing: The Data from the rgb+depth camera needs to be acquired and further processed (to remove the user from its background), particularly the depth information might need further possessing before transmission

Transmission: There needs to be a two-way end to end link between individual participants to transmit audio and video data. The video data should include a both the rgb colour and depth information.

Rendering: The transferred user representation has to be blended into the VR environment (according to its geometrical properties based on the RGB + Depth data) and any audio needs to be played according to its special origin within the environment. Further the self-representation of the user has to be displayed aligned so that the view of the user and its physical position match.

Please not that all 3 functional blocks can be executed either on one device, multiple devices or the network.

Potential Standardization Status and Needs

The following aspects may require standardization work:

System

Architecture

Communication interfaces (signalling)

Media Orchestration (i.e. metadata)

Position and scaling of people

Spatial Audio (e.g. including audio directionality of users)

Background audio

Shared content, i.e. multi-device media synchronization

Allow Network based processing (e.g. cloud rendering, foreground /background removal of user capture, image enhancements like hole filling, replace HMD of user with a photo-realistic representation of there face, etc.)

Transmission

The end-to-end system (including the network) needs to support the RGB+Depth video data.