Internet Engineering Task Force (IETF) Y.-K. Wang Request for Comments: 7798 Qualcomm Category: Standards Track Y. Sanchez ISSN: 2070-1721 T. Schierl Fraunhofer HHI S. Wenger Vidyo M. M. Hannuksela Nokia March 2016 RTP Payload Format for High Efficiency Video Coding (HEVC) Abstract This memo describes an RTP payload format for the video coding standard ITU-T Recommendation H.265 and ISO/IEC International Standard 23008-2, both also known as High Efficiency Video Coding (HEVC) and developed by the Joint Collaborative Team on Video Coding (JCT-VC). The RTP payload format allows for packetization of one or more Network Abstraction Layer (NAL) units in each RTP packet payload as well as fragmentation of a NAL unit into multiple RTP packets. Furthermore, it supports transmission of an HEVC bitstream over a single stream as well as multiple RTP streams. When multiple RTP streams are used, a single transport or multiple transports may be utilized. The payload format has wide applicability in videoconferencing, Internet video streaming, and high-bitrate entertainment-quality video, among others. Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc7798.
Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction ....................................................3 1.1. Overview of the HEVC Codec .................................4 1.1.1. Coding-Tool Features ................................4 1.1.2. Systems and Transport Interfaces ....................6 1.1.3. Parallel Processing Support ........................11 1.1.4. NAL Unit Header ....................................13 1.2. Overview of the Payload Format ............................14 2. Conventions ....................................................15 3. Definitions and Abbreviations ..................................15 3.1. Definitions ...............................................15 3.1.1. Definitions from the HEVC Specification ...........15 3.1.2. Definitions Specific to This Memo .................17 3.2. Abbreviations .............................................19 4. RTP Payload Format .............................................20 4.1. RTP Header Usage ..........................................20 4.2. Payload Header Usage ......................................22 4.3. Transmission Modes ........................................23 4.4. Payload Structures ........................................24 4.4.1. Single NAL Unit Packets ............................24 4.4.2. Aggregation Packets (APs) ..........................25 4.4.3. Fragmentation Units ................................29 4.4.4. PACI Packets .......................................32 220.127.116.11. Reasons for the PACI Rules (Informative) ..34 18.104.22.168. PACI Extensions (Informative) .............35 4.5. Temporal Scalability Control Information ..................36 4.6. Decoding Order Number .....................................37 5. Packetization Rules ............................................39 6. De-packetization Process .......................................40 7. Payload Format Parameters ......................................42 7.1. Media Type Registration ...................................42 7.2. SDP Parameters ............................................64
7.2.1. Mapping of Payload Type Parameters to SDP ..........64 7.2.2. Usage with SDP Offer/Answer Model ..................65 7.2.3. Usage in Declarative Session Descriptions ..........73 7.2.4. Considerations for Parameter Sets ..................75 7.2.5. Dependency Signaling in Multi-Stream Mode ..........75 8. Use with Feedback Messages .....................................75 8.1. Picture Loss Indication (PLI) .............................75 8.2. Slice Loss Indication (SLI) ...............................76 8.3. Reference Picture Selection Indication (RPSI) .............77 8.4. Full Intra Request (FIR) ..................................77 9. Security Considerations ........................................78 10. Congestion Control ............................................79 11. IANA Considerations ...........................................80 12. References ....................................................80 12.1. Normative References .....................................80 12.2. Informative References ...................................82 Acknowledgments ...................................................85 Authors' Addresses ................................................86 1. Introduction The High Efficiency Video Coding specification, formally published as both ITU-T Recommendation H.265 [HEVC] and ISO/IEC International Standard 23008-2 [ISO23008-2], was ratified by the ITU-T in April 2013; reportedly, it provides significant coding efficiency gains over H.264 [H.264]. This memo describes an RTP payload format for HEVC. It shares its basic design with the RTP payload formats of [RFC6184] and [RFC6190]. With respect to design philosophy, security, congestion control, and overall implementation complexity, it has similar properties to those earlier payload format specifications. This is a conscious choice, as at least RFC 6184 is widely deployed and generally known in the relevant implementer communities. Mechanisms from RFC 6190 were incorporated as HEVC version 1 supports temporal scalability. In order to help the overlapping implementer community, frequently only the differences between RFCs 6184 and 6190 and the HEVC payload format are highlighted in non-normative, explanatory parts of this memo. Basic familiarity with both specifications is assumed for those parts. However, the normative parts of this memo do not require study of RFCs 6184 or 6190.
1.1. Overview of the HEVC Codec H.264 and HEVC share a similar hybrid video codec design. In this memo, we provide a very brief overview of those features of HEVC that are, in some form, addressed by the payload format specified herein. Implementers have to read, understand, and apply the ITU-T/ISO/IEC specifications pertaining to HEVC to arrive at interoperable, well- performing implementations. Implementers should consider testing their design (including the interworking between the payload format implementation and the core video codec) using the tools provided by ITU-T/ISO/IEC, for example, conformance bitstreams as specified in [H.265.1]. Not doing so has historically led to systems that perform badly and that are not secure. Conceptually, both H.264 and HEVC include a Video Coding Layer (VCL), which is often used to refer to the coding-tool features, and a Network Abstraction Layer (NAL), which is often used to refer to the systems and transport interface aspects of the codecs. 1.1.1. Coding-Tool Features Similar to earlier hybrid-video-coding-based standards, including H.264, the following basic video coding design is employed by HEVC. A prediction signal is first formed by either intra- or motion- compensated prediction, and the residual (the difference between the original and the prediction) is then coded. The gains in coding efficiency are achieved by redesigning and improving almost all parts of the codec over earlier designs. In addition, HEVC includes several tools to make the implementation on parallel architectures easier. Below is a summary of HEVC coding-tool features. Quad-tree block and transform structure One of the major tools that contributes significantly to the coding efficiency of HEVC is the use of flexible coding blocks and transforms, which are defined in a hierarchical quad-tree manner. Unlike H.264, where the basic coding block is a macroblock of fixed- size 16x16, HEVC defines a Coding Tree Unit (CTU) of a maximum size of 64x64. Each CTU can be divided into smaller units in a hierarchical quad-tree manner and can represent smaller blocks down to size 4x4. Similarly, the transforms used in HEVC can have different sizes, starting from 4x4 and going up to 32x32. Utilizing large blocks and transforms contributes to the major gain of HEVC, especially at high resolutions.
Entropy coding HEVC uses a single entropy-coding engine, which is based on Context Adaptive Binary Arithmetic Coding (CABAC) [CABAC], whereas H.264 uses two distinct entropy coding engines. CABAC in HEVC shares many similarities with CABAC of H.264, but contains several improvements. Those include improvements in coding efficiency and lowered implementation complexity, especially for parallel architectures. In-loop filtering H.264 includes an in-loop adaptive deblocking filter, where the blocking artifacts around the transform edges in the reconstructed picture are smoothed to improve the picture quality and compression efficiency. In HEVC, a similar deblocking filter is employed but with somewhat lower complexity. In addition, pictures undergo a subsequent filtering operation called Sample Adaptive Offset (SAO), which is a new design element in HEVC. SAO basically adds a pixel- level offset in an adaptive manner and usually acts as a de-ringing filter. It is observed that SAO improves the picture quality, especially around sharp edges, contributing substantially to visual quality improvements of HEVC. Motion prediction and coding There have been a number of improvements in this area that are summarized as follows. The first category is motion merge and Advanced Motion Vector Prediction (AMVP) modes. The motion information of a prediction block can be inferred from the spatially or temporally neighboring blocks. This is similar to the DIRECT mode in H.264 but includes new aspects to incorporate the flexible quad- tree structure and methods to improve the parallel implementations. In addition, the motion vector predictor can be signaled for improved efficiency. The second category is high-precision interpolation. The interpolation filter length is increased to 8-tap from 6-tap, which improves the coding efficiency but also comes with increased complexity. In addition, the interpolation filter is defined with higher precision without any intermediate rounding operations to further improve the coding efficiency. Intra prediction and intra-coding Compared to 8 intra prediction modes in H.264, HEVC supports angular intra prediction with 33 directions. This increased flexibility improves both objective coding efficiency and visual quality as the edges can be better predicted and ringing artifacts around the edges can be reduced. In addition, the reference samples are adaptively smoothed based on the prediction direction. To avoid contouring
artifacts a new interpolative prediction generation is included to improve the visual quality. Furthermore, Discrete Sine Transform (DST) is utilized instead of traditional Discrete Cosine Transform (DCT) for 4x4 intra-transform blocks. Other coding-tool features HEVC includes some tools for lossless coding and efficient screen- content coding, such as skipping the transform for certain blocks. These tools are particularly useful, for example, when streaming the user interface of a mobile device to a large display. 1.1.2. Systems and Transport Interfaces HEVC inherited the basic systems and transport interfaces designs from H.264. These include the NAL-unit-based syntax structure, the hierarchical syntax and data unit structure, the Supplemental Enhancement Information (SEI) message mechanism, and the video buffering model based on the Hypothetical Reference Decoder (HRD). The hierarchical syntax and data unit structure consists of sequence- level parameter sets, multi-picture-level or picture-level parameter sets, slice-level header parameters, and lower-level parameters. In the following, a list of differences in these aspects compared to H.264 is summarized. Video parameter set A new type of parameter set, called Video Parameter Set (VPS), was introduced. For the first (2013) version of [HEVC], the VPS NAL unit is required to be available prior to its activation, while the information contained in the VPS is not necessary for operation of the decoding process. For future HEVC extensions, such as the 3D or scalable extensions, the VPS is expected to include information necessary for operation of the decoding process, e.g., decoding dependency or information for reference picture set construction of enhancement layers. The VPS provides a "big picture" of a bitstream, including what types of operation points are provided, the profile, tier, and level of the operation points, and some other high-level properties of the bitstream that can be used as the basis for session negotiation and content selection, etc. (see Section 7.1). Profile, tier, and level The profile, tier, and level syntax structure that can be included in both the VPS and Sequence Parameter Set (SPS) includes 12 bytes of data to describe the entire bitstream (including all temporally scalable layers, which are referred to as sub-layers in the HEVC specification), and can optionally include more profile, tier, and
level information pertaining to individual temporally scalable layers. The profile indicator shows the "best viewed as" profile when the bitstream conforms to multiple profiles, similar to the major brand concept in the ISO Base Media File Format (ISOBMFF) [IS014496-12] [IS015444-12] and file formats derived based on ISOBMFF, such as the 3GPP file format [3GPPFF]. The profile, tier, and level syntax structure also includes indications such as 1) whether the bitstream is free of frame-packed content, 2) whether the bitstream is free of interlaced source content, and 3) whether the bitstream is free of field pictures. When the answer is yes for both 2) and 3), the bitstream contains only frame pictures of progressive source. Based on these indications, clients/players without support of post-processing functionalities for the handling of frame-packed, interlaced source content or field pictures can reject those bitstreams that contain such pictures. Bitstream and elementary stream HEVC includes a definition of an elementary stream, which is new compared to H.264. An elementary stream consists of a sequence of one or more bitstreams. An elementary stream that consists of two or more bitstreams has typically been formed by splicing together two or more bitstreams (or parts thereof). When an elementary stream contains more than one bitstream, the last NAL unit of the last access unit of a bitstream (except the last bitstream in the elementary stream) must contain an end of bitstream NAL unit, and the first access unit of the subsequent bitstream must be an Intra-Random Access Point (IRAP) access unit. This IRAP access unit may be a Clean Random Access (CRA), Broken Link Access (BLA), or Instantaneous Decoding Refresh (IDR) access unit. Random access support HEVC includes signaling in the NAL unit header, through NAL unit types, of IRAP pictures beyond IDR pictures. Three types of IRAP pictures, namely IDR, CRA, and BLA pictures, are supported: IDR pictures are conventionally referred to as closed group-of-pictures (closed-GOP) random access points whereas CRA and BLA pictures are conventionally referred to as open-GOP random access points. BLA pictures usually originate from splicing of two bitstreams or part thereof at a CRA picture, e.g., during stream switching. To enable better systems usage of IRAP pictures, altogether six different NAL units are defined to signal the properties of the IRAP pictures, which can be used to better match the stream access point types as defined in the ISOBMFF [IS014496-12] [IS015444-12], which are utilized for random access support in both 3GP-DASH [3GPDASH] and MPEG DASH [MPEGDASH]. Pictures following an IRAP picture in decoding order and preceding the IRAP picture in output order are referred to
as leading pictures associated with the IRAP picture. There are two types of leading pictures: Random Access Decodable Leading (RADL) pictures and Random Access Skipped Leading (RASL) pictures. RADL pictures are decodable when the decoding started at the associated IRAP picture; RASL pictures are not decodable when the decoding started at the associated IRAP picture and are usually discarded. HEVC provides mechanisms to enable specifying the conformance of a bitstream wherein the originally present RASL pictures have been discarded. Consequently, system components can discard RASL pictures, when needed, without worrying about causing the bitstream to become non-compliant. Temporal scalability support HEVC includes an improved support of temporal scalability, by inclusion of the signaling of TemporalId in the NAL unit header, the restriction that pictures of a particular temporal sub-layer cannot be used for inter prediction reference by pictures of a lower temporal sub-layer, the sub-bitstream extraction process, and the requirement that each sub-bitstream extraction output be a conforming bitstream. Media-Aware Network Elements (MANEs) can utilize the TemporalId in the NAL unit header for stream adaptation purposes based on temporal scalability. Temporal sub-layer switching support HEVC specifies, through NAL unit types present in the NAL unit header, the signaling of Temporal Sub-layer Access (TSA) and Step- wise Temporal Sub-layer Access (STSA). A TSA picture and pictures following the TSA picture in decoding order do not use pictures prior to the TSA picture in decoding order with TemporalId greater than or equal to that of the TSA picture for inter prediction reference. A TSA picture enables up-switching, at the TSA picture, to the sub- layer containing the TSA picture or any higher sub-layer, from the immediately lower sub-layer. An STSA picture does not use pictures with the same TemporalId as the STSA picture for inter prediction reference. Pictures following an STSA picture in decoding order with the same TemporalId as the STSA picture do not use pictures prior to the STSA picture in decoding order with the same TemporalId as the STSA picture for inter prediction reference. An STSA picture enables up-switching, at the STSA picture, to the sub-layer containing the STSA picture, from the immediately lower sub-layer. Sub-layer reference or non-reference pictures The concept and signaling of reference/non-reference pictures in HEVC are different from H.264. In H.264, if a picture may be used by any other picture for inter prediction reference, it is a reference
picture; otherwise, it is a non-reference picture, and this is signaled by two bits in the NAL unit header. In HEVC, a picture is called a reference picture only when it is marked as "used for reference". In addition, the concept of sub-layer reference picture was introduced. If a picture may be used by another other picture with the same TemporalId for inter prediction reference, it is a sub- layer reference picture; otherwise, it is a sub-layer non-reference picture. Whether a picture is a sub-layer reference picture or sub- layer non-reference picture is signaled through NAL unit type values. Extensibility Besides the TemporalId in the NAL unit header, HEVC also includes the signaling of a six-bit layer ID in the NAL unit header, which must be equal to 0 for a single-layer bitstream. Extension mechanisms have been included in the VPS, SPS, Picture Parameter Set (PPS), SEI NAL unit, slice headers, and so on. All these extension mechanisms enable future extensions in a backward-compatible manner, such that bitstreams encoded according to potential future HEVC extensions can be fed to then-legacy decoders (e.g., HEVC version 1 decoders), and the then-legacy decoders can decode and output the base-layer bitstream. Bitstream extraction HEVC includes a bitstream-extraction process as an integral part of the overall decoding process. The bitstream extraction process is used in the process of bitstream conformance tests, which is part of the HRD buffering model. Reference picture management The reference picture management of HEVC, including reference picture marking and removal from the Decoded Picture Buffer (DPB) as well as Reference Picture List Construction (RPLC), differs from that of H.264. Instead of the reference picture marking mechanism based on a sliding window plus adaptive Memory Management Control Operation (MMCO) described in H.264, HEVC specifies a reference picture management and marking mechanism based on Reference Picture Set (RPS), and the RPLC is consequently based on the RPS mechanism. An RPS consists of a set of reference pictures associated with a picture, consisting of all reference pictures that are prior to the associated picture in decoding order, that may be used for inter prediction of the associated picture or any picture following the associated picture in decoding order. The reference picture set consists of five lists of reference pictures; RefPicSetStCurrBefore, RefPicSetStCurrAfter, RefPicSetStFoll, RefPicSetLtCurr, and RefPicSetLtFoll. RefPicSetStCurrBefore, RefPicSetStCurrAfter, and
RefPicSetLtCurr contain all reference pictures that may be used in inter prediction of the current picture and that may be used in inter prediction of one or more of the pictures following the current picture in decoding order. RefPicSetStFoll and RefPicSetLtFoll consist of all reference pictures that are not used in inter prediction of the current picture but may be used in inter prediction of one or more of the pictures following the current picture in decoding order. RPS provides an "intra-coded" signaling of the DPB status, instead of an "inter-coded" signaling, mainly for improved error resilience. The RPLC process in HEVC is based on the RPS, by signaling an index to an RPS subset for each reference index; this process is simpler than the RPLC process in H.264. Ultra-low delay support HEVC specifies a sub-picture-level HRD operation, for support of the so-called ultra-low delay. The mechanism specifies a standard- compliant way to enable delay reduction below a one-picture interval. Coded Picture Buffer (CPB) and DPB parameters at the sub-picture level may be signaled, and utilization of this information for the derivation of CPB timing (wherein the CPB removal time corresponds to decoding time) and DPB output timing (display time) is specified. Decoders are allowed to operate the HRD at the conventional access- unit level, even when the sub-picture-level HRD parameters are present. New SEI messages HEVC inherits many H.264 SEI messages with changes in syntax and/or semantics making them applicable to HEVC. Additionally, there are a few new SEI messages reviewed briefly in the following paragraphs. The display orientation SEI message informs the decoder of a transformation that is recommended to be applied to the cropped decoded picture prior to display, such that the pictures can be properly displayed, e.g., in an upside-up manner. The structure of pictures SEI message provides information on the NAL unit types, picture-order count values, and prediction dependencies of a sequence of pictures. The SEI message can be used, for example, for concluding what impact a lost picture has on other pictures. The decoded picture hash SEI message provides a checksum derived from the sample values of a decoded picture. It can be used for detecting whether a picture was correctly received and decoded.
The active parameter sets SEI message includes the IDs of the active video parameter set and the active sequence parameter set and can be used to activate VPSs and SPSs. In addition, the SEI message includes the following indications: 1) An indication of whether "full random accessibility" is supported (when supported, all parameter sets needed for decoding of the remaining of the bitstream when random accessing from the beginning of the current CVS by completely discarding all access units earlier in decoding order are present in the remaining bitstream, and all coded pictures in the remaining bitstream can be correctly decoded); 2) An indication of whether there is no parameter set within the current CVS that updates another parameter set of the same type preceding in decoding order. An update of a parameter set refers to the use of the same parameter set ID but with some other parameters changed. If this property is true for all CVSs in the bitstream, then all parameter sets can be sent out-of-band before session start. The decoding unit information SEI message provides information regarding coded picture buffer removal delay for a decoding unit. The message can be used in very-low-delay buffering operations. The region refresh information SEI message can be used together with the recovery point SEI message (present in both H.264 and HEVC) for improved support of gradual decoding refresh. This supports random access from inter-coded pictures, wherein complete pictures can be correctly decoded or recovered after an indicated number of pictures in output/display order. 1.1.3. Parallel Processing Support The reportedly significantly higher encoding computational demand of HEVC over H.264, in conjunction with the ever-increasing video resolution (both spatially and temporally) required by the market, led to the adoption of VCL coding tools specifically targeted to allow for parallelization on the sub-picture level. That is, parallelization occurs, at the minimum, at the granularity of an integer number of CTUs. The targets for this type of high-level parallelization are multicore CPUs and DSPs as well as multiprocessor systems. In a system design, to be useful, these tools require signaling support, which is provided in Section 7 of this memo. This section provides a brief overview of the tools available in [HEVC]. Many of the tools incorporated in HEVC were designed keeping in mind the potential parallel implementations in multicore/multiprocessor architectures. Specifically, for parallelization, four picture partition strategies, as described below, are available.
Slices are segments of the bitstream that can be reconstructed independently from other slices within the same picture (though there may still be interdependencies through loop filtering operations). Slices are the only tool that can be used for parallelization that is also available, in virtually identical form, in H.264. Parallelization based on slices does not require much inter-processor or inter-core communication (except for inter-processor or inter-core data sharing for motion compensation when decoding a predictively coded picture, which is typically much heavier than inter-processor or inter-core data sharing due to in-picture prediction), as slices are designed to be independently decodable. However, for the same reason, slices can require some coding overhead. Further, slices (in contrast to some of the other tools mentioned below) also serve as the key mechanism for bitstream partitioning to match Maximum Transfer Unit (MTU) size requirements, due to the in-picture independence of slices and the fact that each regular slice is encapsulated in its own NAL unit. In many cases, the goal of parallelization and the goal of MTU size matching can place contradicting demands to the slice layout in a picture. The realization of this situation led to the development of the more advanced tools mentioned below. Dependent slice segments allow for fragmentation of a coded slice into fragments at CTU boundaries without breaking any in-picture prediction mechanisms. They are complementary to the fragmentation mechanism described in this memo in that they need the cooperation of the encoder. As a dependent slice segment necessarily contains an integer number of CTUs, a decoder using multiple cores operating on CTUs can process a dependent slice segment without communicating parts of the slice segment's bitstream to other cores. Fragmentation, as specified in this memo, in contrast, does not guarantee that a fragment contains an integer number of CTUs. In Wavefront Parallel Processing (WPP), the picture is partitioned into rows of CTUs. Entropy decoding and prediction are allowed to use data from CTUs in other partitions. Parallel processing is possible through parallel decoding of CTU rows, where the start of the decoding of a row is delayed by two CTUs, so to ensure that data related to a CTU above and to the right of the subject CTU is available before the subject CTU is being decoded. Using this staggered start (which appears like a wavefront when represented graphically), parallelization is possible with up to as many processors/cores as the picture contains CTU rows. Because in-picture prediction between neighboring CTU rows within a picture is allowed, the required inter-processor/inter-core communication to enable in-picture prediction can be substantial. The WPP partitioning does not result in the creation of more NAL
units compared to when it is not applied; thus, WPP cannot be used for MTU size matching, though slices can be used in combination for that purpose. Tiles define horizontal and vertical boundaries that partition a picture into tile columns and rows. The scan order of CTUs is changed to be local within a tile (in the order of a CTU raster scan of a tile), before decoding the top-left CTU of the next tile in the order of tile raster scan of a picture. Similar to slices, tiles break in-picture prediction dependencies (including entropy decoding dependencies). However, they do not need to be included into individual NAL units (same as WPP in this regard); hence, tiles cannot be used for MTU size matching, though slices can be used in combination for that purpose. Each tile can be processed by one processor/core, and the inter-processor/inter-core communication required for in-picture prediction between processing units decoding neighboring tiles is limited to conveying the shared slice header in cases a slice is spanning more than one tile, and loop-filtering- related sharing of reconstructed samples and metadata. Insofar, tiles are less demanding in terms of inter-processor communication bandwidth compared to WPP due to the in-picture independence between two neighboring partitions. 1.1.4. NAL Unit Header HEVC maintains the NAL unit concept of H.264 with modifications. HEVC uses a two-byte NAL unit header, as shown in Figure 1. The payload of a NAL unit refers to the NAL unit excluding the NAL unit header. +---------------+---------------+ |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| Type | LayerId | TID | +-------------+-----------------+ Figure 1: The Structure of the HEVC NAL Unit Header The semantics of the fields in the NAL unit header are as specified in [HEVC] and described briefly below for convenience. In addition to the name and size of each field, the corresponding syntax element name in [HEVC] is also provided. F: 1 bit forbidden_zero_bit. Required to be zero in [HEVC]. Note that the inclusion of this bit in the NAL unit header was to enable transport of HEVC video over MPEG-2 transport systems (avoidance of start code emulations) [MPEG2S]. In the context of this memo,
the value 1 may be used to indicate a syntax violation, e.g., for a NAL unit resulted from aggregating a number of fragmented units of a NAL unit but missing the last fragment, as described in Section 4.4.3. Type: 6 bits nal_unit_type. This field specifies the NAL unit type as defined in Table 7-1 of [HEVC]. If the most significant bit of this field of a NAL unit is equal to 0 (i.e., the value of this field is less than 32), the NAL unit is a VCL NAL unit. Otherwise, the NAL unit is a non-VCL NAL unit. For a reference of all currently defined NAL unit types and their semantics, please refer to Section 7.4.2 in [HEVC]. LayerId: 6 bits nuh_layer_id. Required to be equal to zero in [HEVC]. It is anticipated that in future scalable or 3D video coding extensions of this specification, this syntax element will be used to identify additional layers that may be present in the CVS, wherein a layer may be, e.g., a spatial scalable layer, a quality scalable layer, a texture view, or a depth view. TID: 3 bits nuh_temporal_id_plus1. This field specifies the temporal identifier of the NAL unit plus 1. The value of TemporalId is equal to TID minus 1. A TID value of 0 is illegal to ensure that there is at least one bit in the NAL unit header equal to 1, so to enable independent considerations of start code emulations in the NAL unit header and in the NAL unit payload data. 1.2. Overview of the Payload Format This payload format defines the following processes required for transport of HEVC coded data over RTP [RFC3550]: o Usage of RTP header with this payload format o Packetization of HEVC coded NAL units into RTP packets using three types of payload structures: a single NAL unit packet, aggregation packet, and fragment unit o Transmission of HEVC NAL units of the same bitstream within a single RTP stream or multiple RTP streams (within one or more RTP sessions), where within an RTP stream transmission of NAL units may be either non-interleaved (i.e., the transmission order of NAL units is the same as their decoding order) or interleaved (i.e., the transmission order of NAL units is different from the decoding order)
o Media type parameters to be used with the Session Description Protocol (SDP) [RFC4566] o A payload header extension mechanism and data structures for enhanced support of temporal scalability based on that extension mechanism. 2. Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119]. In this document, the above key words will convey that interpretation only when in ALL CAPS. Lowercase uses of these words are not to be interpreted as carrying the significance described in RFC 2119. This specification uses the notion of setting and clearing a bit when bit fields are handled. Setting a bit is the same as assigning that bit the value of 1 (On). Clearing a bit is the same as assigning that bit the value of 0 (Off). 3. Definitions and Abbreviations 3.1. Definitions This document uses the terms and definitions of [HEVC]. Section 3.1.1 lists relevant definitions from [HEVC] for convenience. Section 3.1.2 provides definitions specific to this memo. 3.1.1. Definitions from the HEVC Specification access unit: A set of NAL units that are associated with each other according to a specified classification rule, that are consecutive in decoding order, and that contain exactly one coded picture. BLA access unit: An access unit in which the coded picture is a BLA picture. BLA picture: An IRAP picture for which each VCL NAL unit has nal_unit_type equal to BLA_W_LP, BLA_W_RADL, or BLA_N_LP. Coded Video Sequence (CVS): A sequence of access units that consists, in decoding order, of an IRAP access unit with NoRaslOutputFlag equal to 1, followed by zero or more access units that are not IRAP access units with NoRaslOutputFlag equal to 1, including all subsequent access units up to but not including any subsequent access unit that is an IRAP access unit with NoRaslOutputFlag equal to 1.
Informative note: An IRAP access unit may be an IDR access unit, a BLA access unit, or a CRA access unit. The value of NoRaslOutputFlag is equal to 1 for each IDR access unit, each BLA access unit, and each CRA access unit that is the first access unit in the bitstream in decoding order, is the first access unit that follows an end of sequence NAL unit in decoding order, or has HandleCraAsBlaFlag equal to 1. CRA access unit: An access unit in which the coded picture is a CRA picture. CRA picture: A RAP picture for which each VCL NAL unit has nal_unit_type equal to CRA_NUT. IDR access unit: An access unit in which the coded picture is an IDR picture. IDR picture: A RAP picture for which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or IDR_N_LP. IRAP access unit: An access unit in which the coded picture is an IRAP picture. IRAP picture: A coded picture for which each VCL NAL unit has nal_unit_type in the range of BLA_W_LP (16) to RSV_IRAP_VCL23 (23), inclusive. layer: A set of VCL NAL units that all have a particular value of nuh_layer_id and the associated non-VCL NAL units, or one of a set of syntactical structures having a hierarchical relationship. operation point: bitstream created from another bitstream by operation of the sub-bitstream extraction process with the another bitstream, a target highest TemporalId, and a target-layer identifier list as input. random access: The act of starting the decoding process for a bitstream at a point other than the beginning of the bitstream. sub-layer: A temporal scalable layer of a temporal scalable bitstream consisting of VCL NAL units with a particular value of the TemporalId variable, and the associated non-VCL NAL units. sub-layer representation: A subset of the bitstream consisting of NAL units of a particular sub-layer and the lower sub-layers. tile: A rectangular region of coding tree blocks within a particular tile column and a particular tile row in a picture.
tile column: A rectangular region of coding tree blocks having a height equal to the height of the picture and a width specified by syntax elements in the picture parameter set. tile row: A rectangular region of coding tree blocks having a height specified by syntax elements in the picture parameter set and a width equal to the width of the picture. 3.1.2. Definitions Specific to This Memo dependee RTP stream: An RTP stream on which another RTP stream depends. All RTP streams in a Multiple RTP streams on a Single media Transport (MRST) or Multiple RTP streams on Multiple media Transports (MRMT), except for the highest RTP stream, are dependee RTP streams. highest RTP stream: The RTP stream on which no other RTP stream depends. The RTP stream in a Single RTP stream on a Single media Transport (SRST) is the highest RTP stream. Media-Aware Network Element (MANE): A network element, such as a middlebox, selective forwarding unit, or application-layer gateway that is capable of parsing certain aspects of the RTP payload headers or the RTP payload and reacting to their contents. Informative note: The concept of a MANE goes beyond normal routers or gateways in that a MANE has to be aware of the signaling (e.g., to learn about the payload type mappings of the media streams), and in that it has to be trusted when working with Secure RTP (SRTP). The advantage of using MANEs is that they allow packets to be dropped according to the needs of the media coding. For example, if a MANE has to drop packets due to congestion on a certain link, it can identify and remove those packets whose elimination produces the least adverse effect on the user experience. After dropping packets, MANEs must rewrite RTCP packets to match the changes to the RTP stream, as specified in Section 7 of [RFC3550]. Media Transport: As used in the MRST, MRMT, and SRST definitions below, Media Transport denotes the transport of packets over a transport association identified by a 5-tuple (source address, source port, destination address, destination port, transport protocol). See also Section 2.1.13 of [RFC7656]. Informative note: The term "bitstream" in this document is equivalent to the term "encoded stream" in [RFC7656].
Multiple RTP streams on a Single media Transport (MRST): Multiple RTP streams carrying a single HEVC bitstream on a Single Transport. See also Section 3.5 of [RFC7656]. Multiple RTP streams on Multiple media Transports (MRMT): Multiple RTP streams carrying a single HEVC bitstream on Multiple Transports. See also Section 3.5 of [RFC7656]. NAL unit decoding order: A NAL unit order that conforms to the constraints on NAL unit order given in Section 22.214.171.124 in [HEVC]. NAL unit output order: A NAL unit order in which NAL units of different access units are in the output order of the decoded pictures corresponding to the access units, as specified in [HEVC], and in which NAL units within an access unit are in their decoding order. NAL-unit-like structure: A data structure that is similar to NAL units in the sense that it also has a NAL unit header and a payload, with a difference that the payload does not follow the start code emulation prevention mechanism required for the NAL unit syntax as specified in Section 126.96.36.199 of [HEVC]. Examples of NAL-unit-like structures defined in this memo are packet payloads of Aggregation Packet (AP), PAyload Content Information (PACI), and Fragmentation Unit (FU) packets. NALU-time: The value that the RTP timestamp would have if the NAL unit would be transported in its own RTP packet. RTP stream: See [RFC7656]. Within the scope of this memo, one RTP stream is utilized to transport one or more temporal sub-layers. Single RTP stream on a Single media Transport (SRST): Single RTP stream carrying a single HEVC bitstream on a Single (Media) Transport. See also Section 3.5 of [RFC7656]. transmission order: The order of packets in ascending RTP sequence number order (in modulo arithmetic). Within an aggregation packet, the NAL unit transmission order is the same as the order of appearance of NAL units in the packet.
3.2. Abbreviations AP Aggregation Packet BLA Broken Link Access CRA Clean Random Access CTB Coding Tree Block CTU Coding Tree Unit CVS Coded Video Sequence DPH Decoded Picture Hash FU Fragmentation Unit HRD Hypothetical Reference Decoder IDR Instantaneous Decoding Refresh IRAP Intra Random Access Point MANE Media-Aware Network Element MRMT Multiple RTP streams on Multiple media Transports MRST Multiple RTP streams on a Single media Transport MTU Maximum Transfer Unit NAL Network Abstraction Layer NALU Network Abstraction Layer Unit PACI PAyload Content Information PHES Payload Header Extension Structure PPS Picture Parameter Set RADL Random Access Decodable Leading (Picture) RASL Random Access Skipped Leading (Picture) RPS Reference Picture Set
SEI Supplemental Enhancement Information SPS Sequence Parameter Set SRST Single RTP stream on a Single media Transport STSA Step-wise Temporal Sub-layer Access TSA Temporal Sub-layer Access TSCI Temporal Scalability Control Information VCL Video Coding Layer VPS Video Parameter Set