Immersive multi-modal VR application describes the case of a human interacting with virtual entities in a remote environment such that the perception of interaction with a real physical world is achieved. Users are supposed to perceive multiple senses (vision, sound, touch) for full immersion in the virtual environment. The degree of immersion achieved indicates how real the created virtual environment is. Even a tiny error in the preparation of the remote environment might be noticed, as humans are quite sensitive when using immersive multi-modal VR applications. Therefore, a high-field virtual environment (high-resolution images and 3-D stereo audio) is essential to achieve an ultimately immersive experience.
One of the major objectives of VR designers and researchers is to obtain more realistic and compelling virtual environments. As the asynchrony between different modalities increases, users' sense of presence and realism will decrease. There have been efforts (since 1960s or even earlier) in multi-modal-interaction research regarding the detection of synchronisation thresholds. The obtained results vary, depending on the kind of stimuli and the psychometric methods employed. Hirsh and Sherrick measured the synchronisation thresholds regarding visual, auditory and tactile modalities .
M.E. Altinsoy and co. believe the audio-tactile synchronization has to be at least within an accuracy of ±40 ms . More results have been reported based on extensive theoretical and experimental efforts.  further indicated the perceptual threshold values were 50 ms for the audio lag and the 25 ms for audio lead.
As to the visual-tactile synchronisation threshold, Massimiliano Di Luca and Arash Mahnan provided test results in  that indicate that none of the participants could reliably detect the asynchrony if haptic feedback was presented less than 50ms after the view of the contact with an object. The asynchrony tolerated for haptic before visual feedback was instead only 15ms.
The devices for immersive multi-modal VR application may include multiple types of devices such as VR glass type device, the gloves and other potential devices that support haptic and/or kinaesthetic modal. These devices which are 5G UEs are connected to the immersive multi-modal VR application server via the 5G network without any UE relays, see Figure 5.1.2-1.
Based on the service agreement between MNO and immersive multi-modal VR application operator, the application operator may in advance provide the 5G network with the application information including the application traffic characteristics and the service requirement for network connection. For example, the packet size for haptic data is related to the Degrees of Freedom (DoF) that the haptic devices supports, and packet size for one DoF is 2-8 Bytes  and the haptic device generates and sends 500 haptic packets within one second.
The application user utilizes the devices to experience immersive multi-modal VR application. The user powers on the devices to connect to the application server, then the user starts the gaming application.
During the gaming running period, the devices periodically send the sensing information to the application server, including: haptic and/or kinesthetic feedback signal information which is generated by haptic device, and the sensing information such as positioning and view information which is generated by the VR glasses.
According to the uplink data from the devices, the application server performs necessary process operations on immersive game reality including rendering and coding the video, the audio and haptic model data, then application server periodically sends the downlink data to the devices, with different time periods respectively, via 5G network.
3GPP TS 22.261 specifies KPIs for high data rate and low latency interactive services including Cloud/Edge/Split Rendering, Gaming or Interactive Data Exchanging, Consumption of VR content via tethered VR headset, and audio-video synchronization thresholds.
Support of audio-video synchronisation thresholds has been captured in TS 22.261:
Due to the separate handling of the audio and video component, the 5G system will have to cater for the VR audio-video synchronisation in order to avoid having a negative impact on the user experience (i.e. viewers detecting lack of synchronization). To support VR environments, the 5G system shall support audio-video synchronisation thresholds:
in the range of [125 ms to 5 ms] for audio delayed and
in the range of [45 ms to 5 ms] for audio advanced.
Motion-to-photon delay (the time difference between the user's motion and corresponding change of the video image on display) is less than 20 ms, the communication latency for transferring the packets of one audio-visual media is less than 10 ms, e.g. the packets corresponding to one video/audio frame are transferred to the devices within 10 ms.
According to IEEE 1918.1  as for haptic feedback, the latency is less than 25 ms for accurately completing haptic operations. As rendering and hardware introduce some delay, the communication delay for haptic modality can be reasonably less than 5 ms, i.e. the packets related to one haptic feedback are transferred to the devices within 10 ms.
In practice, the service area depends on the actual deployment. In some cases a local approach (e.g. the application servers are hosted at the network edge) is preferred in order to satisfy the requirements of low latency and high reliability.
Due to the separate handling of the multiple media components, synchronization between different media components is critical in order to avoid having a negative impact on the user experience (i.e. viewers detecting lack of synchronization). Applying synchronization thresholds in the 5G system may be helpful in support of immersive multi-modal VR applications when the synchronization threshold between two or more modalities is less than the latency KPI for the application. Typical synchronization thresholds (see , ,  and ) are summarised in Table 5.1.6-2.
For each media component, "delay" refers to the case where that media component is delayed compared to the other
The 5G network shall support a mechanism to allow an authorized 3rd party to provide QoS policy for multiple flows (e.g., haptic, audio and video) of multiple UEs associated with a multi-modal application. The policy may contain e.g. coordination information.
The 5G system shall support a mechanism to apply 3rd party provided policy for flows associated with an application. The policy may contain e.g. coordination information.