Use Case Description: 3D shared experience
In this shared 3D use case two friends (Eilean and Bob) are sharing a virtual experience. The experience builds around a crime investigation showing an investigation of two murder suspects and allowing the users to discuss and identify who committed the murder. Both Eileen and Bob are joining from home wearing a VR HMD and being captured via an RGB+depth camera. In VR they experience a 3-dimensional room (6DOF, police station), being represented in 3D and including a self-representation that allows them to point at items in the room and at each other. This representation can be based on the same capture that is made with the RGB+depth camera for communication purposes. Further, in the virtual police station each one of them has a window to follow a different interrogation (windowed 6DOF / 3DOF+), allowing them to collect information to solve the murder together (see figure 2).
Categorization
Type: AR, MR, VR
Degrees of Freedom: 3DoF+ / 6DOF
Delivery: Conversational
Device: Mobile / Laptop
Preconditions
The above use case results into the following hardware requirements:
Each user needs a VR HMD (mobile, stand alone, wired/wireless VR HMD).
Each user needs a depth camera to be captured (based on Bluetooth, integrated into a mobile phone or wired)
Each user needs a microphone and audio headset for audio upload and spatial audio playback
Each user needs to be connected and registered to a network that is able to facitilate the end-to-end audio/video call.
Requirements and QoS/QoE Considerations
The following QoS requirements are considered:
Bandwidth: As minimal bandwidth it is expected at least 6Mbit/s (this is for a single 2D+ user stream with RGB + depth video), however this requirement can increase with more complex and higher resolution streams.
Delay: suitable for real-time communication
Delay (self-view): suitable for feeling of embodiment
The main goal of this use case is to create a shared presence and immersion in a 3DOF+/6DOF experience. Thus the following QoE Considerations are relevant:
Capture & Processing:
The resolution of the rgb+depth camera needs to be sufficient.
The foreground / background extraction needs to result into an accurate cut-out of a user
Transmission:
The compression of audio and video data should follow similar constraints as traditional video conferencing.
Rendering:
Users, needs to be scaled and positioned in the AR/VR environment in a natural way
Audio playback needs to match the spatial orientation of the user
A self view needs to be properly aligned with the actual body movement to align proprioceptive and visual experience. Also, delay for this needs to be kept to a minimum.
Feasibility
Demos & Technology overview:
M. J. Prins, S. N. B. Gunkel, H. M. Stokking, and O. A. Niamut. TogetherVR: A Framework for photorealistic shared media experiences in 360-degree VR. SMPTE Motion Imaging Journal 127.7:39-44, August 2018.
S. N. B. Gunkel, H. M. Stokking, M. J. Prins, N. van der Stap, F.B.T. Haar, and O.A. Niamut, 2018, June. Virtual Reality Conferencing: Multi-user immersive VR experiences on the web. In Proceedings of the 9th ACM Multimedia Systems Conference (pp. 498-501). ACM.
2018, IBC Demo: https://vrtogether.eu/2018/09/14/ibc-show-2018/
In summary:
Users are captured with an RGB+depth device, e.g. Microsoft Kinect or Intel Realsense Camera
This capture is processed locally for foreground/background segmentation and optionally for creation of a self-view.
WebRTC is used for setting up streams to the other call participants.
A-Frame / WebVR is used for rendering the virtual environment.
Existing Service:
http://www.mimesysvr.com/
Summery of steps:
Furthermore to realize this use case it is mapped into the following functional blocks:
Capture & Processing: The Data from the rgb+depth camera needs to be acquired and further processed (to remove the user from its background), particularly the depth information might need further possessing before transmission
Transmission: There needs to be a two-way end to end link between individual participants to transmit audio and video data. The video data should include a both the rgb colour and depth information.
Rendering: The transferred user representation has to be blended into the VR environment (according to its geometrical properties based on the RGB + Depth data) and any audio needs to be played according to its special origin within the environment. Further the self-representation of the user has to be displayed aligned so that the view of the user and its physical position match.
Please not that all 3 functional blocks can be executed either on one device, multiple devices or the network.
Potential Standardization Status and Needs
The following aspects may require standardization work:
System
Architecture
Communication interfaces (signalling)
Media Orchestration (i.e. metadata)
Position and scaling of people
Spatial Audio (e.g. including audio directionality of users)
Background audio
Shared content, i.e. multi-device media synchronization
Allow Network based processing (e.g. cloud rendering, foreground /background removal of user capture, image enhancements like hole filling, replace HMD of user with a photo-realistic representation of there face, etc.)
Transmission
The end-to-end system (including the network) needs to support the RGB+Depth video data.