This clause focuses on the interoperability point to a media decoder as indicated in Figure 6.1-1. This clause does not deal with the access engine and file parser which addresses aspects how the audio bitstream is delivered.
In all audio operation points, the VR Presentation can be rendered using a single media decoder which provides decoded PCM signals and rendering metadata to the audio renderer.
This clause defines the potential parameters of Audio Operation Points. This includes the detailed audio decoder requirements and audio rendering metadata. The requirements are defined from the perspective of the audio decoder and renderer.
Parameters for an Audio Operation Point include:
the audio decoder that the bitstream needs to conform to,
the mandated or permitted rendering data that is included in the audio bitstream.
The 3GPP MPEG-H Audio Operation Point fulfills the requirements to support 3D audio and is specified in ISO/IEC 23090-2 , clause 10.2.2. Channels, Objects and First/Higher-Order Ambisonics (FOA/HOA) are supported, as well as combinations of those. The Operation Point is based on MPEG-H 3D Audio .
A bitstream conforming to the 3GPP MPEG-H Audio Operation Point shall conform to the requirements in clause 18.104.22.168.
A receiver conforming to the 3GPP MPEG-H Audio Operation Point shall support decoding and rendering a Bitstream conforming to the 3GPP MPEG-H Audio Operation Point. Detailed receiver requirements are provided in clause 22.214.171.124.
The audio decoder is able to start decoding a new audio stream at every random access point (RAP). As defined in clause 126.96.36.199, the sync sample (RAP) contains the configuration information (PACTYP_MPEGH3DACFG and PACTYP_AUDIOSCENEINFO) that is used to initialize the audio decoder. After initialization, the audio decoder reads encoded audio frames (PACTYP_MPEGH3DAFRAME) and decodes them.
To optimize startup delay at random access, the information from the MHAS PACTYP_BUFFERINFO packet should be taken into account. The input buffer should be filled at least to the state indicated in the MHAS PACTYP_BUFFERINFO packet before starting to decode audio frames.
It is recommended that, at random access into an audio stream, the receiving device performs a 100 ms fade-in on the first PCM output buffer that it receives from the audio decoder.
If the decoder receives an MHAS stream that contains a configuration change, the decoder shall perform a configuration change according to ISO/IEC 23008-3 , clause 5.5.6. The configuration change can, for instance, be detected through the change of the MHASPacketLabel of the packet PACTYP_MPEGH3DACFG compared to the value of the MHASPacketLabel of previous MHAS packets.
If MHAS packets of type PACTYP_AUDIOTRUNCATION are present, they shall be used as described in ISO/IEC 23008-3 , clause 14.
The Access Unit that contains the configuration change and the last Access Unit before the configuration change may contain a truncation message (PACTYP_AUDIOTRUNCATION) as defined in ISO/IEC 23008-3 , clause 14. The MHAS packet of type PACTYP_AUDIOTRUNCATION enables synchronization between video and audio elementary streams at program boundaries. When used, sample-accurate splicing and reconfiguration of the audio stream are possible.
The receiver shall be capable of simultaneously receiving at least 3 MHAS streams. The MHAS streams can be simultaneously decoded or combined into a single stream prior to the decoder, by utilizing the field mae_bsMetaDataElementIDoffset in the Audio Scene Information as described in ISO/IEC 23008-3 , clause 14.6.
The 3GPP MPEG-H Audio Operation Point builds on the MPEG-H 3D Audio codec, which includes rendering to loudspeakers, binaural rendering and also provides an interface for external rendering. Legacy binaural rendering using fixed loudspeaker setups can be supported by using loudspeaker feeds as output of the decoder.
MPEG-H 3D Audio specifies methods for binauralizing the presentation of immersive content for playback via headphones, as is needed for omnidirectional media presentations. MPEG-H 3D Audio specifies a normative interface for the user's viewing orientation and permits low-complexity, low-latency rendering of the audio scene to any user orientation.
The binaural rendering of MPEG-H 3D Audio shall be applied as described in ISO/IEC 23008-3 , clause 13 according to the Low Complexity Profile and Levels restrictions for binaural rendering specified in ISO/IEC 23008-3 , clause 188.8.131.52.
For binaural rendering using head tracking the useTrackingMode flag in the BinauralRendering() syntax element shall be set to 1, as described in ISO/IEC 23008-3 , clause 17.4. This flag defines if a tracker device is connected and the binaural rendering shall be processed in a special headtracking mode, using the scene displacement values (yaw, pitch and roll).
The values for the scene displacement data shall be sent using the interface for scene displacement data specified in ISO/IEC 23008-3 , clause 17.9. The syntax of mpegh3daSceneDisplacementData() interface provided in ISO/IEC 23008-3 , clause 17.9.3 shall be used.
The metadata flag fixedPosition in SignalGroupInformation() indicates if the corresponding audio signals are updated during the processing of scene-displacement angles. In case the flag is equal to one, the positions of the corresponding audio signals are not updated during the processing of scene displacement angles.
Channel groups for which the flag gca_directHeadphone is set to "1" in the mpegh3da_getChannelMetadata() syntax element are routed to left and right output channel directly and are excluded from binaural rendering using scene displacement data (non-diegetic content). Non-diegetic content may have stereo or mono format. For mono, the signal is mixed to left and right headphone channel with a gain factor of 0.707.
The interface for binaural room impulse responses (BRIRs) specified in ISO/IEC 23008-3 , clause 17.4 shall be used for external BRIRs and HRIRs. The HRIR/BRIR data for the binaural rendering can be fed to the decoder by using the syntax element BinauralRendering(). The number of BRIR/HRIR pairs in each BRIR/HRIR set shall correspond to the number indicated in the relevant level-dependent row in Table 9 - "The binaural restrictions for the LC profile" of ISO/IEC 23008-3  according to the Low Complexity Profile and Levels restrictions in ISO/IEC 23008-3 , clause 184.108.40.206.
The measured BRIR positions are passed to the mpegh3daLocalSetupInformation(), as specified in ISO/IEC 23008-3 , clause 220.127.116.11. Thus, all renderer stages are set to the target layout that is equal to the transmitted channel configuration. As one BRIR is available per regular input channel, the Format Converter can be passed through in case regular input channel positions are used. Preferably, the BRIR measurement positions for standard target layouts 2.0, 5.1, 10.2 and 7.1.4 should be provided.
MPEG-H 3DA provides the output interfaces for the delivery of un-rendered channels, objects, and HOA content and associated metadata as specified in clause 18.104.22.168.6.5. External binaural renderers can connect to this interface e.g. for playback of head-tracked audio via headphones. An example of such external binaural renderer that connects to the external rendering interface of MPEG-H 3DA is specified in Annex B.
ISO/IEC 23008-3 , clause 17.10 specifies the output interfaces for the delivery of un-rendered channels, objects, and HOA content and associated metadata. For connecting to external renderers, a receiver shall implement the interfaces for object output, channel output and HOA output as specified in ISO/IEC 23008-3 , clause 17.10, including the additional specification of production metadata defined in ISO/IEC 23008-3 , clause 27. Any external renderer should apply the metadata provided in this interface and related audio data in the same manner as if MPEG-H internal rendering is applied:
Correct handling of loudness-related metadata in particular with the aim of preserving intended target loudness
Preserving artistic intent, such as applying transmitted Downmix and HOA Rendering matrices correctly
Rendering spatial attributes of objects appropriately (position, spatial extent, etc.)
In this interface the PCM data of the channels and objects interfaces is provided through the decoder PCM buffer, which first contains the regular rendered PCM signals (e.g. 12 signals for a 7.1+4 setup). Subsequently additional signals carry the PCM data of the originally transmitted channel representation. These are followed by signals carrying the PCM data of the un-rendered output objects. Then additional signals carry the HOA audio PCM data which number is indicated in the HOA metadata interface via the HOA order (e.g. 16 signals for HOA order 3). The HOA audio PCM data in the HOA output interface is provided in the so-called Equivalent Spatial Domain (ESD) representation. The conversion from the HOA domain into the ESD representation and vice versa is described in ISO/IEC 23008-3 , Annex C.5.1.
The metadata for channels, objects, and HOA is available once per frame and their syntax is specified in mpegh3da_getChannelMetadata(), mpegh3da_getObjectAudioAndMetadata(), and mpegh3da_getHoaMetadata() respectively. The metadata and PCM data shall be aligned for an external renderer to match each metadata element with the respective PCM frame.
MPEG-H 3D Audio  specifies coding of immersive audio material and the storage of the coded representation in an ISO BMFF track. The MPEG-H 3D Audio decoder has a constant latency, see Table 1 - "MPEG-H 3DA functional blocks and internal processing domain", of ISO/IEC 23008-3 . With this information, content authors could synchronize audio and video portions of a media presentation, e.g. ensuring lip-synch.
ISO BMFF integration for this profile is provided following the requirements and recommendations in ISO/IEC 23090-2 , clause 10.2.2.3.
A configuration change takes place in an audio stream when the content setup or the Audio Scene Information changes (e.g., when changes occur in the channel layout, the number of objects etc.), and therefore new PACTYP_MPEGH3DACFG and PACTYP_AUDIOSCENEINFO packets are required upon such occurrences. A configuration change usually happens at program boundaries, but it may also occur within a program.
Configuration change constraints specified in ISO/IEC 23090-2 , clause 10.2.2.3.2 shall apply.
The multi-stream-enabled MPEG-H Audio System is capable of handling Audio Programme Components delivered in several different elementary streams (e.g., the main MHAS stream containing one complete audio main, and one or more auxiliary MHAS streams, containing different languages and audio descriptions). The MPEG-H Audio Metadata information (MAE) allows the MPEG-H Audio Decoder to correctly decode several MHAS streams.
The sample entry 'mhm2' shall be used in cases of multi-stream delivery, i.e., the MPEG-H Audio Scene is split into two or more streams for delivery as described in ISO/IEC 23008-3 , clause 14.6. All constraints for file formats using the sample entry 'mhm2' specified in ISO/IEC 23090-2 , clause 10.2.2.3.3 shall apply.
MPEG-H 3D Audio enables seamless bitrate switching in a DASH environment with different Representations (i.e., bit streams encoded at different bitrates) of the same content, i.e., those Representations are part of the same Adaptation Set.
If the decoder receives a DASH Segment of another Representation of the same Adaptation Set, the decoder shall perform an adaptive switch according to ISO/IEC 23008-3 , clause 5.5.6.