RFC 7798

RTP Payload Format for High Efficiency Video Coding (HEVC)

Pages: 86
Proposed Standard

Part 1 of 4 – Pages 1 to 20

RFC7798 - Page 1

Internet Engineering Task Force (IETF)                        Y.-K. Wang
Request for Comments: 7798                                      Qualcomm
Category: Standards Track                                     Y. Sanchez
ISSN: 2070-1721                                               T. Schierl
                                                          Fraunhofer HHI
                                                               S. Wenger
                                                                   Vidyo
                                                        M. M. Hannuksela
                                                                   Nokia
                                                              March 2016


       RTP Payload Format for High Efficiency Video Coding (HEVC)

Abstract

   This memo describes an RTP payload format for the video coding
   standard ITU-T Recommendation H.265 and ISO/IEC International
   Standard 23008-2, both also known as High Efficiency Video Coding
   (HEVC) and developed by the Joint Collaborative Team on Video Coding
   (JCT-VC).  The RTP payload format allows for packetization of one or
   more Network Abstraction Layer (NAL) units in each RTP packet payload
   as well as fragmentation of a NAL unit into multiple RTP packets.
   Furthermore, it supports transmission of an HEVC bitstream over a
   single stream as well as multiple RTP streams.  When multiple RTP
   streams are used, a single transport or multiple transports may be
   utilized.  The payload format has wide applicability in
   videoconferencing, Internet video streaming, and high-bitrate
   entertainment-quality video, among others.

Status of This Memo

   This is an Internet Standards Track document.

   This document is a product of the Internet Engineering Task Force
   (IETF).  It represents the consensus of the IETF community.  It has
   received public review and has been approved for publication by the
   Internet Engineering Steering Group (IESG).  Further information on
   Internet Standards is available in Section 2 of RFC 5741.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   http://www.rfc-editor.org/info/rfc7798.

RFC7798 - Page 2

Copyright Notice

   Copyright (c) 2016 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1. Introduction ....................................................3
      1.1. Overview of the HEVC Codec .................................4
           1.1.1. Coding-Tool Features ................................4
           1.1.2. Systems and Transport Interfaces ....................6
           1.1.3. Parallel Processing Support ........................11
           1.1.4. NAL Unit Header ....................................13
      1.2. Overview of the Payload Format ............................14
   2. Conventions ....................................................15
   3. Definitions and Abbreviations ..................................15
      3.1. Definitions ...............................................15
           3.1.1.  Definitions from the HEVC Specification ...........15
           3.1.2.  Definitions Specific to This Memo .................17
      3.2. Abbreviations .............................................19
   4. RTP Payload Format .............................................20
      4.1. RTP Header Usage ..........................................20
      4.2. Payload Header Usage ......................................22
      4.3. Transmission Modes ........................................23
      4.4. Payload Structures ........................................24
           4.4.1. Single NAL Unit Packets ............................24
           4.4.2. Aggregation Packets (APs) ..........................25
           4.4.3. Fragmentation Units ................................29
           4.4.4. PACI Packets .......................................32
                  4.4.4.1. Reasons for the PACI Rules (Informative) ..34
                  4.4.4.2. PACI Extensions (Informative) .............35
      4.5. Temporal Scalability Control Information ..................36
      4.6. Decoding Order Number .....................................37
   5. Packetization Rules ............................................39
   6. De-packetization Process .......................................40
   7. Payload Format Parameters ......................................42
      7.1. Media Type Registration ...................................42
      7.2. SDP Parameters ............................................64

RFC7798 - Page 3

           7.2.1. Mapping of Payload Type Parameters to SDP ..........64
           7.2.2. Usage with SDP Offer/Answer Model ..................65
           7.2.3. Usage in Declarative Session Descriptions ..........73
           7.2.4. Considerations for Parameter Sets ..................75
           7.2.5. Dependency Signaling in Multi-Stream Mode ..........75
   8. Use with Feedback Messages .....................................75
      8.1. Picture Loss Indication (PLI) .............................75
      8.2. Slice Loss Indication (SLI) ...............................76
      8.3. Reference Picture Selection Indication (RPSI) .............77
      8.4. Full Intra Request (FIR) ..................................77
   9. Security Considerations ........................................78
   10. Congestion Control ............................................79
   11. IANA Considerations ...........................................80
   12. References ....................................................80
      12.1. Normative References .....................................80
      12.2. Informative References ...................................82
   Acknowledgments ...................................................85
   Authors' Addresses ................................................86


1.  Introduction

   The High Efficiency Video Coding specification, formally published as
   both ITU-T Recommendation H.265 [HEVC] and ISO/IEC International
   Standard 23008-2 [ISO23008-2], was ratified by the ITU-T in April
   2013; reportedly, it provides significant coding efficiency gains
   over H.264 [H.264].

   This memo describes an RTP payload format for HEVC.  It shares its
   basic design with the RTP payload formats of [RFC6184] and [RFC6190].
   With respect to design philosophy, security, congestion control, and
   overall implementation complexity, it has similar properties to those
   earlier payload format specifications.  This is a conscious choice,
   as at least RFC 6184 is widely deployed and generally known in the
   relevant implementer communities.  Mechanisms from RFC 6190 were
   incorporated as HEVC version 1 supports temporal scalability.

   In order to help the overlapping implementer community, frequently
   only the differences between RFCs 6184 and 6190 and the HEVC payload
   format are highlighted in non-normative, explanatory parts of this
   memo.  Basic familiarity with both specifications is assumed for
   those parts.  However, the normative parts of this memo do not
   require study of RFCs 6184 or 6190.

RFC7798 - Page 4

1.1.  Overview of the HEVC Codec

   H.264 and HEVC share a similar hybrid video codec design.  In this
   memo, we provide a very brief overview of those features of HEVC that
   are, in some form, addressed by the payload format specified herein.
   Implementers have to read, understand, and apply the ITU-T/ISO/IEC
   specifications pertaining to HEVC to arrive at interoperable, well-
   performing implementations.  Implementers should consider testing
   their design (including the interworking between the payload format
   implementation and the core video codec) using the tools provided by
   ITU-T/ISO/IEC, for example, conformance bitstreams as specified in
   [H.265.1].  Not doing so has historically led to systems that perform
   badly and that are not secure.

   Conceptually, both H.264 and HEVC include a Video Coding Layer (VCL),
   which is often used to refer to the coding-tool features, and a
   Network Abstraction Layer (NAL), which is often used to refer to the
   systems and transport interface aspects of the codecs.

1.1.1.  Coding-Tool Features

   Similar to earlier hybrid-video-coding-based standards, including
   H.264, the following basic video coding design is employed by HEVC.
   A prediction signal is first formed by either intra- or motion-
   compensated prediction, and the residual (the difference between the
   original and the prediction) is then coded.  The gains in coding
   efficiency are achieved by redesigning and improving almost all parts
   of the codec over earlier designs.  In addition, HEVC includes
   several tools to make the implementation on parallel architectures
   easier.  Below is a summary of HEVC coding-tool features.

   Quad-tree block and transform structure

   One of the major tools that contributes significantly to the coding
   efficiency of HEVC is the use of flexible coding blocks and
   transforms, which are defined in a hierarchical quad-tree manner.
   Unlike H.264, where the basic coding block is a macroblock of fixed-
   size 16x16, HEVC defines a Coding Tree Unit (CTU) of a maximum size
   of 64x64.  Each CTU can be divided into smaller units in a
   hierarchical quad-tree manner and can represent smaller blocks down
   to size 4x4.  Similarly, the transforms used in HEVC can have
   different sizes, starting from 4x4 and going up to 32x32.  Utilizing
   large blocks and transforms contributes to the major gain of HEVC,
   especially at high resolutions.

RFC7798 - Page 5

   Entropy coding

   HEVC uses a single entropy-coding engine, which is based on Context
   Adaptive Binary Arithmetic Coding (CABAC) [CABAC], whereas H.264 uses
   two distinct entropy coding engines.  CABAC in HEVC shares many
   similarities with CABAC of H.264, but contains several improvements.
   Those include improvements in coding efficiency and lowered
   implementation complexity, especially for parallel architectures.

   In-loop filtering

   H.264 includes an in-loop adaptive deblocking filter, where the
   blocking artifacts around the transform edges in the reconstructed
   picture are smoothed to improve the picture quality and compression
   efficiency.  In HEVC, a similar deblocking filter is employed but
   with somewhat lower complexity.  In addition, pictures undergo a
   subsequent filtering operation called Sample Adaptive Offset (SAO),
   which is a new design element in HEVC.  SAO basically adds a pixel-
   level offset in an adaptive manner and usually acts as a de-ringing
   filter.  It is observed that SAO improves the picture quality,
   especially around sharp edges, contributing substantially to visual
   quality improvements of HEVC.

   Motion prediction and coding

   There have been a number of improvements in this area that are
   summarized as follows.  The first category is motion merge and
   Advanced Motion Vector Prediction (AMVP) modes.  The motion
   information of a prediction block can be inferred from the spatially
   or temporally neighboring blocks.  This is similar to the DIRECT mode
   in H.264 but includes new aspects to incorporate the flexible quad-
   tree structure and methods to improve the parallel implementations.
   In addition, the motion vector predictor can be signaled for improved
   efficiency.  The second category is high-precision interpolation.
   The interpolation filter length is increased to 8-tap from 6-tap,
   which improves the coding efficiency but also comes with increased
   complexity.  In addition, the interpolation filter is defined with
   higher precision without any intermediate rounding operations to
   further improve the coding efficiency.

   Intra prediction and intra-coding

   Compared to 8 intra prediction modes in H.264, HEVC supports angular
   intra prediction with 33 directions.  This increased flexibility
   improves both objective coding efficiency and visual quality as the
   edges can be better predicted and ringing artifacts around the edges
   can be reduced.  In addition, the reference samples are adaptively
   smoothed based on the prediction direction.  To avoid contouring

RFC7798 - Page 6

   artifacts a new interpolative prediction generation is included to
   improve the visual quality.  Furthermore, Discrete Sine Transform
   (DST) is utilized instead of traditional Discrete Cosine Transform
   (DCT) for 4x4 intra-transform blocks.

   Other coding-tool features

   HEVC includes some tools for lossless coding and efficient screen-
   content coding, such as skipping the transform for certain blocks.
   These tools are particularly useful, for example, when streaming the
   user interface of a mobile device to a large display.

1.1.2.  Systems and Transport Interfaces

   HEVC inherited the basic systems and transport interfaces designs
   from H.264.  These include the NAL-unit-based syntax structure, the
   hierarchical syntax and data unit structure, the Supplemental
   Enhancement Information (SEI) message mechanism, and the video
   buffering model based on the Hypothetical Reference Decoder (HRD).
   The hierarchical syntax and data unit structure consists of sequence-
   level parameter sets, multi-picture-level or picture-level parameter
   sets, slice-level header parameters, and lower-level parameters.  In
   the following, a list of differences in these aspects compared to
   H.264 is summarized.

   Video parameter set

   A new type of parameter set, called Video Parameter Set (VPS), was
   introduced.  For the first (2013) version of [HEVC], the VPS NAL unit
   is required to be available prior to its activation, while the
   information contained in the VPS is not necessary for operation of
   the decoding process.  For future HEVC extensions, such as the 3D or
   scalable extensions, the VPS is expected to include information
   necessary for operation of the decoding process, e.g., decoding
   dependency or information for reference picture set construction of
   enhancement layers.  The VPS provides a "big picture" of a bitstream,
   including what types of operation points are provided, the profile,
   tier, and level of the operation points, and some other high-level
   properties of the bitstream that can be used as the basis for session
   negotiation and content selection, etc. (see Section 7.1).

   Profile, tier, and level

   The profile, tier, and level syntax structure that can be included in
   both the VPS and Sequence Parameter Set (SPS) includes 12 bytes of
   data to describe the entire bitstream (including all temporally
   scalable layers, which are referred to as sub-layers in the HEVC
   specification), and can optionally include more profile, tier, and

RFC7798 - Page 7

   level information pertaining to individual temporally scalable
   layers.  The profile indicator shows the "best viewed as" profile
   when the bitstream conforms to multiple profiles, similar to the
   major brand concept in the ISO Base Media File Format (ISOBMFF)
   [IS014496-12] [IS015444-12] and file formats derived based on
   ISOBMFF, such as the 3GPP file format [3GPPFF].  The profile, tier,
   and level syntax structure also includes indications such as 1)
   whether the bitstream is free of frame-packed content, 2) whether the
   bitstream is free of interlaced source content, and 3) whether the
   bitstream is free of field pictures.  When the answer is yes for both
   2) and 3), the bitstream contains only frame pictures of progressive
   source.  Based on these indications, clients/players without support
   of post-processing functionalities for the handling of frame-packed,
   interlaced source content or field pictures can reject those
   bitstreams that contain such pictures.

   Bitstream and elementary stream

   HEVC includes a definition of an elementary stream, which is new
   compared to H.264.  An elementary stream consists of a sequence of
   one or more bitstreams.  An elementary stream that consists of two or
   more bitstreams has typically been formed by splicing together two or
   more bitstreams (or parts thereof).  When an elementary stream
   contains more than one bitstream, the last NAL unit of the last
   access unit of a bitstream (except the last bitstream in the
   elementary stream) must contain an end of bitstream NAL unit, and the
   first access unit of the subsequent bitstream must be an Intra-Random
   Access Point (IRAP) access unit.  This IRAP access unit may be a
   Clean Random Access (CRA), Broken Link Access (BLA), or Instantaneous
   Decoding Refresh (IDR) access unit.

   Random access support

   HEVC includes signaling in the NAL unit header, through NAL unit
   types, of IRAP pictures beyond IDR pictures.  Three types of IRAP
   pictures, namely IDR, CRA, and BLA pictures, are supported: IDR
   pictures are conventionally referred to as closed group-of-pictures
   (closed-GOP) random access points whereas CRA and BLA pictures are
   conventionally referred to as open-GOP random access points.  BLA
   pictures usually originate from splicing of two bitstreams or part
   thereof at a CRA picture, e.g., during stream switching.  To enable
   better systems usage of IRAP pictures, altogether six different NAL
   units are defined to signal the properties of the IRAP pictures,
   which can be used to better match the stream access point types as
   defined in the ISOBMFF [IS014496-12] [IS015444-12], which are
   utilized for random access support in both 3GP-DASH [3GPDASH] and
   MPEG DASH [MPEGDASH].  Pictures following an IRAP picture in decoding
   order and preceding the IRAP picture in output order are referred to

RFC7798 - Page 8

   as leading pictures associated with the IRAP picture.  There are two
   types of leading pictures: Random Access Decodable Leading (RADL)
   pictures and Random Access Skipped Leading (RASL) pictures.  RADL
   pictures are decodable when the decoding started at the associated
   IRAP picture; RASL pictures are not decodable when the decoding
   started at the associated IRAP picture and are usually discarded.
   HEVC provides mechanisms to enable specifying the conformance of a
   bitstream wherein the originally present RASL pictures have been
   discarded.  Consequently, system components can discard RASL
   pictures, when needed, without worrying about causing the bitstream
   to become non-compliant.

   Temporal scalability support

   HEVC includes an improved support of temporal scalability, by
   inclusion of the signaling of TemporalId in the NAL unit header, the
   restriction that pictures of a particular temporal sub-layer cannot
   be used for inter prediction reference by pictures of a lower
   temporal sub-layer, the sub-bitstream extraction process, and the
   requirement that each sub-bitstream extraction output be a conforming
   bitstream.  Media-Aware Network Elements (MANEs) can utilize the
   TemporalId in the NAL unit header for stream adaptation purposes
   based on temporal scalability.

   Temporal sub-layer switching support

   HEVC specifies, through NAL unit types present in the NAL unit
   header, the signaling of Temporal Sub-layer Access (TSA) and Step-
   wise Temporal Sub-layer Access (STSA).  A TSA picture and pictures
   following the TSA picture in decoding order do not use pictures prior
   to the TSA picture in decoding order with TemporalId greater than or
   equal to that of the TSA picture for inter prediction reference.  A
   TSA picture enables up-switching, at the TSA picture, to the sub-
   layer containing the TSA picture or any higher sub-layer, from the
   immediately lower sub-layer.  An STSA picture does not use pictures
   with the same TemporalId as the STSA picture for inter prediction
   reference.  Pictures following an STSA picture in decoding order with
   the same TemporalId as the STSA picture do not use pictures prior to
   the STSA picture in decoding order with the same TemporalId as the
   STSA picture for inter prediction reference.  An STSA picture enables
   up-switching, at the STSA picture, to the sub-layer containing the
   STSA picture, from the immediately lower sub-layer.

   Sub-layer reference or non-reference pictures

   The concept and signaling of reference/non-reference pictures in HEVC
   are different from H.264.  In H.264, if a picture may be used by any
   other picture for inter prediction reference, it is a reference

RFC7798 - Page 9

   picture; otherwise, it is a non-reference picture, and this is
   signaled by two bits in the NAL unit header.  In HEVC, a picture is
   called a reference picture only when it is marked as "used for
   reference".  In addition, the concept of sub-layer reference picture
   was introduced.  If a picture may be used by another other picture
   with the same TemporalId for inter prediction reference, it is a sub-
   layer reference picture; otherwise, it is a sub-layer non-reference
   picture.  Whether a picture is a sub-layer reference picture or sub-
   layer non-reference picture is signaled through NAL unit type values.

   Extensibility

   Besides the TemporalId in the NAL unit header, HEVC also includes the
   signaling of a six-bit layer ID in the NAL unit header, which must be
   equal to 0 for a single-layer bitstream.  Extension mechanisms have
   been included in the VPS, SPS, Picture Parameter Set (PPS), SEI NAL
   unit, slice headers, and so on.  All these extension mechanisms
   enable future extensions in a backward-compatible manner, such that
   bitstreams encoded according to potential future HEVC extensions can
   be fed to then-legacy decoders (e.g., HEVC version 1 decoders), and
   the then-legacy decoders can decode and output the base-layer
   bitstream.

   Bitstream extraction

   HEVC includes a bitstream-extraction process as an integral part of
   the overall decoding process.  The bitstream extraction process is
   used in the process of bitstream conformance tests, which is part of
   the HRD buffering model.

   Reference picture management

   The reference picture management of HEVC, including reference picture
   marking and removal from the Decoded Picture Buffer (DPB) as well as
   Reference Picture List Construction (RPLC), differs from that of
   H.264.  Instead of the reference picture marking mechanism based on a
   sliding window plus adaptive Memory Management Control Operation
   (MMCO) described in H.264, HEVC specifies a reference picture
   management and marking mechanism based on Reference Picture Set
   (RPS), and the RPLC is consequently based on the RPS mechanism.  An
   RPS consists of a set of reference pictures associated with a
   picture, consisting of all reference pictures that are prior to the
   associated picture in decoding order, that may be used for inter
   prediction of the associated picture or any picture following the
   associated picture in decoding order.  The reference picture set
   consists of five lists of reference pictures; RefPicSetStCurrBefore,
   RefPicSetStCurrAfter, RefPicSetStFoll, RefPicSetLtCurr, and
   RefPicSetLtFoll.  RefPicSetStCurrBefore, RefPicSetStCurrAfter, and

RFC7798 - Page 10

   RefPicSetLtCurr contain all reference pictures that may be used in
   inter prediction of the current picture and that may be used in inter
   prediction of one or more of the pictures following the current
   picture in decoding order.  RefPicSetStFoll and RefPicSetLtFoll
   consist of all reference pictures that are not used in inter
   prediction of the current picture but may be used in inter prediction
   of one or more of the pictures following the current picture in
   decoding order.  RPS provides an "intra-coded" signaling of the DPB
   status, instead of an "inter-coded" signaling, mainly for improved
   error resilience.  The RPLC process in HEVC is based on the RPS, by
   signaling an index to an RPS subset for each reference index; this
   process is simpler than the RPLC process in H.264.

   Ultra-low delay support

   HEVC specifies a sub-picture-level HRD operation, for support of the
   so-called ultra-low delay.  The mechanism specifies a standard-
   compliant way to enable delay reduction below a one-picture interval.
   Coded Picture Buffer (CPB) and DPB parameters at the sub-picture
   level may be signaled, and utilization of this information for the
   derivation of CPB timing (wherein the CPB removal time corresponds to
   decoding time) and DPB output timing (display time) is specified.
   Decoders are allowed to operate the HRD at the conventional access-
   unit level, even when the sub-picture-level HRD parameters are
   present.

   New SEI messages

   HEVC inherits many H.264 SEI messages with changes in syntax and/or
   semantics making them applicable to HEVC.  Additionally, there are a
   few new SEI messages reviewed briefly in the following paragraphs.

   The display orientation SEI message informs the decoder of a
   transformation that is recommended to be applied to the cropped
   decoded picture prior to display, such that the pictures can be
   properly displayed, e.g., in an upside-up manner.

   The structure of pictures SEI message provides information on the NAL
   unit types, picture-order count values, and prediction dependencies
   of a sequence of pictures.  The SEI message can be used, for example,
   for concluding what impact a lost picture has on other pictures.

   The decoded picture hash SEI message provides a checksum derived from
   the sample values of a decoded picture.  It can be used for detecting
   whether a picture was correctly received and decoded.

RFC7798 - Page 11

   The active parameter sets SEI message includes the IDs of the active
   video parameter set and the active sequence parameter set and can be
   used to activate VPSs and SPSs.  In addition, the SEI message
   includes the following indications: 1) An indication of whether "full
   random accessibility" is supported (when supported, all parameter
   sets needed for decoding of the remaining of the bitstream when
   random accessing from the beginning of the current CVS by completely
   discarding all access units earlier in decoding order are present in
   the remaining bitstream, and all coded pictures in the remaining
   bitstream can be correctly decoded); 2) An indication of whether
   there is no parameter set within the current CVS that updates another
   parameter set of the same type preceding in decoding order.  An
   update of a parameter set refers to the use of the same parameter set
   ID but with some other parameters changed.  If this property is true
   for all CVSs in the bitstream, then all parameter sets can be sent
   out-of-band before session start.

   The decoding unit information SEI message provides information
   regarding coded picture buffer removal delay for a decoding unit.
   The message can be used in very-low-delay buffering operations.

   The region refresh information SEI message can be used together with
   the recovery point SEI message (present in both H.264 and HEVC) for
   improved support of gradual decoding refresh.  This supports random
   access from inter-coded pictures, wherein complete pictures can be
   correctly decoded or recovered after an indicated number of pictures
   in output/display order.

1.1.3.  Parallel Processing Support

   The reportedly significantly higher encoding computational demand of
   HEVC over H.264, in conjunction with the ever-increasing video
   resolution (both spatially and temporally) required by the market,
   led to the adoption of VCL coding tools specifically targeted to
   allow for parallelization on the sub-picture level.  That is,
   parallelization occurs, at the minimum, at the granularity of an
   integer number of CTUs.  The targets for this type of high-level
   parallelization are multicore CPUs and DSPs as well as multiprocessor
   systems.  In a system design, to be useful, these tools require
   signaling support, which is provided in Section 7 of this memo.  This
   section provides a brief overview of the tools available in [HEVC].

   Many of the tools incorporated in HEVC were designed keeping in mind
   the potential parallel implementations in multicore/multiprocessor
   architectures.  Specifically, for parallelization, four picture
   partition strategies, as described below, are available.

RFC7798 - Page 12

   Slices are segments of the bitstream that can be reconstructed
   independently from other slices within the same picture (though there
   may still be interdependencies through loop filtering operations).
   Slices are the only tool that can be used for parallelization that is
   also available, in virtually identical form, in H.264.
   Parallelization based on slices does not require much inter-processor
   or inter-core communication (except for inter-processor or inter-core
   data sharing for motion compensation when decoding a predictively
   coded picture, which is typically much heavier than inter-processor
   or inter-core data sharing due to in-picture prediction), as slices
   are designed to be independently decodable.  However, for the same
   reason, slices can require some coding overhead.  Further, slices (in
   contrast to some of the other tools mentioned below) also serve as
   the key mechanism for bitstream partitioning to match Maximum
   Transfer Unit (MTU) size requirements, due to the in-picture
   independence of slices and the fact that each regular slice is
   encapsulated in its own NAL unit.  In many cases, the goal of
   parallelization and the goal of MTU size matching can place
   contradicting demands to the slice layout in a picture.  The
   realization of this situation led to the development of the more
   advanced tools mentioned below.

   Dependent slice segments allow for fragmentation of a coded slice
   into fragments at CTU boundaries without breaking any in-picture
   prediction mechanisms.  They are complementary to the fragmentation
   mechanism described in this memo in that they need the cooperation of
   the encoder.  As a dependent slice segment necessarily contains an
   integer number of CTUs, a decoder using multiple cores operating on
   CTUs can process a dependent slice segment without communicating
   parts of the slice segment's bitstream to other cores.
   Fragmentation, as specified in this memo, in contrast, does not
   guarantee that a fragment contains an integer number of CTUs.

   In Wavefront Parallel Processing (WPP), the picture is partitioned
   into rows of CTUs.  Entropy decoding and prediction are allowed to
   use data from CTUs in other partitions.  Parallel processing is
   possible through parallel decoding of CTU rows, where the start of
   the decoding of a row is delayed by two CTUs, so to ensure that data
   related to a CTU above and to the right of the subject CTU is
   available before the subject CTU is being decoded.  Using this
   staggered start (which appears like a wavefront when represented
   graphically), parallelization is possible with up to as many
   processors/cores as the picture contains CTU rows.

   Because in-picture prediction between neighboring CTU rows within a
   picture is allowed, the required inter-processor/inter-core
   communication to enable in-picture prediction can be substantial.
   The WPP partitioning does not result in the creation of more NAL

RFC7798 - Page 13

   units compared to when it is not applied; thus, WPP cannot be used
   for MTU size matching, though slices can be used in combination for
   that purpose.

   Tiles define horizontal and vertical boundaries that partition a
   picture into tile columns and rows.  The scan order of CTUs is
   changed to be local within a tile (in the order of a CTU raster scan
   of a tile), before decoding the top-left CTU of the next tile in the
   order of tile raster scan of a picture.  Similar to slices, tiles
   break in-picture prediction dependencies (including entropy decoding
   dependencies).  However, they do not need to be included into
   individual NAL units (same as WPP in this regard); hence, tiles
   cannot be used for MTU size matching, though slices can be used in
   combination for that purpose.  Each tile can be processed by one
   processor/core, and the inter-processor/inter-core communication
   required for in-picture prediction between processing units decoding
   neighboring tiles is limited to conveying the shared slice header in
   cases a slice is spanning more than one tile, and loop-filtering-
   related sharing of reconstructed samples and metadata.  Insofar,
   tiles are less demanding in terms of inter-processor communication
   bandwidth compared to WPP due to the in-picture independence between
   two neighboring partitions.

1.1.4.  NAL Unit Header

   HEVC maintains the NAL unit concept of H.264 with modifications.
   HEVC uses a two-byte NAL unit header, as shown in Figure 1.  The
   payload of a NAL unit refers to the NAL unit excluding the NAL unit
   header.

            +---------------+---------------+
            |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            |F|   Type    |  LayerId  | TID |
            +-------------+-----------------+

   Figure 1: The Structure of the HEVC NAL Unit Header

   The semantics of the fields in the NAL unit header are as specified
   in [HEVC] and described briefly below for convenience.  In addition
   to the name and size of each field, the corresponding syntax element
   name in [HEVC] is also provided.

   F: 1 bit
      forbidden_zero_bit.  Required to be zero in [HEVC].  Note that the
      inclusion of this bit in the NAL unit header was to enable
      transport of HEVC video over MPEG-2 transport systems (avoidance
      of start code emulations) [MPEG2S].  In the context of this memo,

RFC7798 - Page 14

      the value 1 may be used to indicate a syntax violation, e.g., for
      a NAL unit resulted from aggregating a number of fragmented units
      of a NAL unit but missing the last fragment, as described in
      Section 4.4.3.

   Type: 6 bits
      nal_unit_type.  This field specifies the NAL unit type as defined
      in Table 7-1 of [HEVC].  If the most significant bit of this field
      of a NAL unit is equal to 0 (i.e., the value of this field is less
      than 32), the NAL unit is a VCL NAL unit.  Otherwise, the NAL unit
      is a non-VCL NAL unit.  For a reference of all currently defined
      NAL unit types and their semantics, please refer to Section 7.4.2
      in [HEVC].

   LayerId: 6 bits
      nuh_layer_id.  Required to be equal to zero in [HEVC].  It is
      anticipated that in future scalable or 3D video coding extensions
      of this specification, this syntax element will be used to
      identify additional layers that may be present in the CVS, wherein
      a layer may be, e.g., a spatial scalable layer, a quality scalable
      layer, a texture view, or a depth view.

   TID: 3 bits
      nuh_temporal_id_plus1.  This field specifies the temporal
      identifier of the NAL unit plus 1.  The value of TemporalId is
      equal to TID minus 1.  A TID value of 0 is illegal to ensure that
      there is at least one bit in the NAL unit header equal to 1, so to
      enable independent considerations of start code emulations in the
      NAL unit header and in the NAL unit payload data.

1.2.  Overview of the Payload Format

   This payload format defines the following processes required for
   transport of HEVC coded data over RTP [RFC3550]:

   o  Usage of RTP header with this payload format

   o  Packetization of HEVC coded NAL units into RTP packets using three
      types of payload structures: a single NAL unit packet, aggregation
      packet, and fragment unit

   o  Transmission of HEVC NAL units of the same bitstream within a
      single RTP stream or multiple RTP streams (within one or more RTP
      sessions), where within an RTP stream transmission of NAL units
      may be either non-interleaved (i.e., the transmission order of NAL
      units is the same as their decoding order) or interleaved (i.e.,
      the transmission order of NAL units is different from the decoding
      order)

RFC7798 - Page 15

   o  Media type parameters to be used with the Session Description
      Protocol (SDP) [RFC4566]

   o  A payload header extension mechanism and data structures for
      enhanced support of temporal scalability based on that extension
      mechanism.

2.  Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in BCP 14 [RFC2119].

   In this document, the above key words will convey that interpretation
   only when in ALL CAPS.  Lowercase uses of these words are not to be
   interpreted as carrying the significance described in RFC 2119.

   This specification uses the notion of setting and clearing a bit when
   bit fields are handled.  Setting a bit is the same as assigning that
   bit the value of 1 (On).  Clearing a bit is the same as assigning
   that bit the value of 0 (Off).

3.  Definitions and Abbreviations

3.1.  Definitions

   This document uses the terms and definitions of [HEVC].  Section
   3.1.1 lists relevant definitions from [HEVC] for convenience.
   Section 3.1.2 provides definitions specific to this memo.

3.1.1.  Definitions from the HEVC Specification

   access unit: A set of NAL units that are associated with each other
   according to a specified classification rule, that are consecutive in
   decoding order, and that contain exactly one coded picture.

   BLA access unit: An access unit in which the coded picture is a BLA
   picture.

   BLA picture: An IRAP picture for which each VCL NAL unit has
   nal_unit_type equal to BLA_W_LP, BLA_W_RADL, or BLA_N_LP.

   Coded Video Sequence (CVS): A sequence of access units that consists,
   in decoding order, of an IRAP access unit with NoRaslOutputFlag equal
   to 1, followed by zero or more access units that are not IRAP access
   units with NoRaslOutputFlag equal to 1, including all subsequent
   access units up to but not including any subsequent access unit that
   is an IRAP access unit with NoRaslOutputFlag equal to 1.

RFC7798 - Page 16

      Informative note: An IRAP access unit may be an IDR access unit, a
      BLA access unit, or a CRA access unit.  The value of
      NoRaslOutputFlag is equal to 1 for each IDR access unit, each BLA
      access unit, and each CRA access unit that is the first access
      unit in the bitstream in decoding order, is the first access unit
      that follows an end of sequence NAL unit in decoding order, or has
      HandleCraAsBlaFlag equal to 1.

   CRA access unit: An access unit in which the coded picture is a CRA
   picture.

   CRA picture: A RAP picture for which each VCL NAL unit has
   nal_unit_type equal to CRA_NUT.

   IDR access unit: An access unit in which the coded picture is an IDR
   picture.

   IDR picture: A RAP picture for which each VCL NAL unit has
   nal_unit_type equal to IDR_W_RADL or IDR_N_LP.

   IRAP access unit: An access unit in which the coded picture is an
   IRAP picture.

   IRAP picture: A coded picture for which each VCL NAL unit has
   nal_unit_type in the range of BLA_W_LP (16) to RSV_IRAP_VCL23 (23),
   inclusive.

   layer: A set of VCL NAL units that all have a particular value of
   nuh_layer_id and the associated non-VCL NAL units, or one of a set of
   syntactical structures having a hierarchical relationship.

   operation point: bitstream created from another bitstream by
   operation of the sub-bitstream extraction process with the another
   bitstream, a target highest TemporalId, and a target-layer identifier
   list as input.

   random access: The act of starting the decoding process for a
   bitstream at a point other than the beginning of the bitstream.

   sub-layer: A temporal scalable layer of a temporal scalable bitstream
   consisting of VCL NAL units with a particular value of the TemporalId
   variable, and the associated non-VCL NAL units.

   sub-layer representation: A subset of the bitstream consisting of NAL
   units of a particular sub-layer and the lower sub-layers.

   tile: A rectangular region of coding tree blocks within a particular
   tile column and a particular tile row in a picture.

RFC7798 - Page 17

   tile column: A rectangular region of coding tree blocks having a
   height equal to the height of the picture and a width specified by
   syntax elements in the picture parameter set.

   tile row: A rectangular region of coding tree blocks having a height
   specified by syntax elements in the picture parameter set and a width
   equal to the width of the picture.

3.1.2.  Definitions Specific to This Memo

   dependee RTP stream: An RTP stream on which another RTP stream
   depends.  All RTP streams in a Multiple RTP streams on a Single media
   Transport (MRST) or Multiple RTP streams on Multiple media Transports
   (MRMT), except for the highest RTP stream, are dependee RTP streams.

   highest RTP stream: The RTP stream on which no other RTP stream
   depends.  The RTP stream in a Single RTP stream on a Single media
   Transport (SRST) is the highest RTP stream.

   Media-Aware Network Element (MANE): A network element, such as a
   middlebox, selective forwarding unit, or application-layer gateway
   that is capable of parsing certain aspects of the RTP payload headers
   or the RTP payload and reacting to their contents.

      Informative note: The concept of a MANE goes beyond normal routers
      or gateways in that a MANE has to be aware of the signaling (e.g.,
      to learn about the payload type mappings of the media streams),
      and in that it has to be trusted when working with Secure RTP
      (SRTP).  The advantage of using MANEs is that they allow packets
      to be dropped according to the needs of the media coding.  For
      example, if a MANE has to drop packets due to congestion on a
      certain link, it can identify and remove those packets whose
      elimination produces the least adverse effect on the user
      experience.  After dropping packets, MANEs must rewrite RTCP
      packets to match the changes to the RTP stream, as specified in
      Section 7 of [RFC3550].

   Media Transport: As used in the MRST, MRMT, and SRST definitions
   below, Media Transport denotes the transport of packets over a
   transport association identified by a 5-tuple (source address, source
   port, destination address, destination port, transport protocol).
   See also Section 2.1.13 of [RFC7656].

      Informative note: The term "bitstream" in this document is
      equivalent to the term "encoded stream" in [RFC7656].

RFC7798 - Page 18

   Multiple RTP streams on a Single media Transport (MRST):  Multiple
   RTP streams carrying a single HEVC bitstream on a Single Transport.
   See also Section 3.5 of [RFC7656].

   Multiple RTP streams on Multiple media Transports (MRMT):  Multiple
   RTP streams carrying a single HEVC bitstream on Multiple Transports.
   See also Section 3.5 of [RFC7656].

   NAL unit decoding order: A NAL unit order that conforms to the
   constraints on NAL unit order given in Section 7.4.2.4 in [HEVC].

   NAL unit output order: A NAL unit order in which NAL units of
   different access units are in the output order of the decoded
   pictures corresponding to the access units, as specified in [HEVC],
   and in which NAL units within an access unit are in their decoding
   order.

   NAL-unit-like structure: A data structure that is similar to NAL
   units in the sense that it also has a NAL unit header and a payload,
   with a difference that the payload does not follow the start code
   emulation prevention mechanism required for the NAL unit syntax as
   specified in Section 7.3.1.1 of [HEVC].  Examples of NAL-unit-like
   structures defined in this memo are packet payloads of Aggregation
   Packet (AP), PAyload Content Information (PACI), and Fragmentation
   Unit (FU) packets.

   NALU-time: The value that the RTP timestamp would have if the NAL
   unit would be transported in its own RTP packet.

   RTP stream: See [RFC7656].  Within the scope of this memo, one RTP
   stream is utilized to transport one or more temporal sub-layers.

   Single RTP stream on a Single media Transport (SRST):  Single RTP
   stream carrying a single HEVC bitstream on a Single (Media)
   Transport.  See also Section 3.5 of [RFC7656].

   transmission order: The order of packets in ascending RTP sequence
   number order (in modulo arithmetic).  Within an aggregation packet,
   the NAL unit transmission order is the same as the order of
   appearance of NAL units in the packet.

RFC7798 - Page 19

3.2.  Abbreviations

   AP       Aggregation Packet

   BLA      Broken Link Access

   CRA      Clean Random Access

   CTB      Coding Tree Block

   CTU      Coding Tree Unit

   CVS      Coded Video Sequence

   DPH      Decoded Picture Hash

   FU       Fragmentation Unit

   HRD      Hypothetical Reference Decoder

   IDR      Instantaneous Decoding Refresh

   IRAP     Intra Random Access Point

   MANE     Media-Aware Network Element

   MRMT     Multiple RTP streams on Multiple media Transports

   MRST     Multiple RTP streams on a Single media Transport

   MTU      Maximum Transfer Unit

   NAL      Network Abstraction Layer

   NALU     Network Abstraction Layer Unit

   PACI     PAyload Content Information

   PHES     Payload Header Extension Structure

   PPS      Picture Parameter Set

   RADL     Random Access Decodable Leading (Picture)

   RASL     Random Access Skipped Leading (Picture)

   RPS      Reference Picture Set

RFC7798 - Page 20

   SEI      Supplemental Enhancement Information

   SPS      Sequence Parameter Set

   SRST     Single RTP stream on a Single media Transport

   STSA     Step-wise Temporal Sub-layer Access

   TSA      Temporal Sub-layer Access

   TSCI     Temporal Scalability Control Information

   VCL      Video Coding Layer

   VPS      Video Parameter Set

(page 20 continued on part 2)