Internet Engineering Task Force (IETF) S. Wenger
Request for Comments: 6190 Independent
Category: Standards Track Y.-K. Wang
ISSN: 2070-1721 Huawei Technologies
May 2011 RTP Payload Format for Scalable Video Coding
This memo describes an RTP payload format for Scalable Video Coding
(SVC) as defined in Annex G of ITU-T Recommendation H.264, which is
technically identical to Amendment 3 of ISO/IEC International
Standard 14496-10. The RTP payload format allows for packetization
of one or more Network Abstraction Layer (NAL) units in each RTP
packet payload, as well as fragmentation of a NAL unit in multiple
RTP packets. Furthermore, it supports transmission of an SVC stream
over a single as well as multiple RTP sessions. The payload format
defines a new media subtype name "H264-SVC", but is still backward
compatible to RFC 6184 since the base layer, when encapsulated in its
own RTP stream, must use the H.264 media subtype name ("H264") and
the packetization method specified in RFC 6184. The payload format
has wide applicability in videoconferencing, Internet video
streaming, and high-bitrate entertainment-quality video, among
Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in Section 2 of RFC 5741.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
6. De-Packetization Process .......................................516.1. De-Packetization Process for Single-Session Transmission ..516.2. De-Packetization Process for Multi-Session Transmission ...516.2.1. Decoding Order Recovery for the NI-T and
NI-TC Modes ........................................5126.96.36.199. Informative Algorithm for NI-T
Decoding Order Recovery within
an Access Unit ............................556.2.2. Decoding Order Recovery for the NI-C,
NI-TC, and I-C Modes ...............................577. Payload Format Parameters ......................................597.1. Media Type Registration ...................................607.2. SDP Parameters ............................................757.2.1. Mapping of Payload Type Parameters to SDP ..........757.2.2. Usage with the SDP Offer/Answer Model ..............767.2.3. Dependency Signaling in Multi-Session
Transmission .......................................847.2.4. Usage in Declarative Session Descriptions ..........857.3. Examples ..................................................867.3.1. Example for Offering a Single SVC Session ..........867.3.2. Example for Offering a Single SVC Session Using
scalable-layer-id ..................................877.3.3. Example for Offering Multiple Sessions in MST ......877.3.4. Example for Offering Multiple Sessions in
MST Including Operation with Answerer Using
scalable-layer-id ..................................897.3.5. Example for Negotiating an SVC Stream with
a Constrained Base Layer in SST ....................907.4. Parameter Set Considerations ..............................918. Security Considerations ........................................919. Congestion Control .............................................9210. IANA Considerations ...........................................9311. Informative Appendix: Application Examples ....................9311.1. Introduction .............................................9311.2. Layered Multicast ........................................9311.3. Streaming ................................................9411.4. Videoconferencing (Unicast to MANE, Unicast to
Endpoints) ...............................................9511.5. Mobile TV (Multicast to MANE, Unicast to Endpoint) .......9612. Acknowledgements ..............................................9713. References ....................................................9713.1. Normative References .....................................9713.2. Informative References ...................................98
This memo specifies an RTP [RFC3550] payload format for the Scalable
Video Coding (SVC) extension of the H.264/AVC video coding standard.
SVC is specified in Amendment 3 to ISO/IEC 14496 Part 10
[ISO/IEC14496-10] and equivalently in Annex G of ITU-T Rec. H.264
[H.264]. In this memo, unless explicitly stated otherwise,
"H.264/AVC" refers to the specification of [H.264] excluding Annex G.
SVC covers the entire application range of H.264/AVC, from low-
bitrate mobile applications, to High-Definition Television (HDTV)
broadcasting, and even Digital Cinema that requires nearly lossless
coding and hundreds of megabits per second. The scalability features
that SVC adds to H.264/AVC enable several system-level
functionalities related to the ability of a system to adapt the
signal to different system conditions with no or minimal processing.
The adaptation relates both to the capabilities of potentially
heterogeneous receivers (differing in screen resolution, processing
speed, etc.), and to differing or time-varying network conditions.
The adaptation can be performed at the source, the destination, or in
intermediate media-aware network elements (MANEs). The payload
format specified in this memo exposes these system-level
functionalities so that system designers can take direct advantage of
Informative note: Since SVC streams contain, by design, a sub-
stream that is compliant with H.264/AVC, it is trivial for a MANE
to filter the stream so that all SVC-specific information is
removed. This memo, in fact, defines a media type parameter
(sprop-avc-ready, Section 7.2) that indicates whether or not the
stream can be converted to one compliant with [RFC6184] by
eliminating RTP packets, and rewriting RTP Control Protocol (RTCP)
to match the changes to the RTP packet stream as specified in
Section 7 of [RFC3550].
This memo defines two basic modes for transmission of SVC data,
single-session transmission (SST) and multi-session transmission
(MST). In SST, a single RTP session is used for the transmission of
all scalability layers comprising an SVC bitstream; in MST, the
scalability layers are transported on different RTP sessions. In
SST, packetization is a straightforward extension of [RFC6184]. For
MST, four different modes are defined in this memo. They differ on
whether or not they allow interleaving, i.e., transmitting Network
Abstraction Layer (NAL) units in an order different than the decoding
order, and by the technique used to effect inter-session NAL unit
decoding order recovery. Decoding order recovery is performed using
either inter-session timestamp alignment [RFC3550] or cross-session
decoding order numbers (CS-DONs). One of the MST modes supports both
decoding order recovery techniques, so that receivers can select
their preferred technique. More details can be found in Section
This memo further defines three new NAL unit types. The first type
is the payload content scalability information (PACSI) NAL unit,
which is used to provide an informative summary of the scalability
information of the data contained in an RTP packet, as well as
ancillary data (e.g., CS-DON values). The second and third new NAL
unit types are the empty NAL unit and the non-interleaved multi-time
aggregation packet (NI-MTAP) NAL unit. The empty NAL unit is used to
ensure inter-session timestamp alignment required for decoding order
recovery in MST. The NI-MTAP is used as a new payload structure
allowing the grouping of NAL units of different time instances in
decoding order. More details about the new packet structures can be
found in Section 1.2.3.
This memo also defines the signaling support for SVC transport over
RTP, including a new media subtype name (H264-SVC).
A non-normative overview of the SVC codec and the payload is given in
the remainder of this section.
1.1. The SVC Codec
SVC defines a coded video representation in which a given bitstream
offers representations of the source material at different levels of
fidelity (hence the term "scalable"). Scalable video coding
bitstreams, or scalable bitstreams, are constructed in a pyramidal
fashion: the coding process creates bitstream components that improve
the fidelity of hierarchically lower components.
The fidelity dimensions offered by SVC are spatial (picture size),
quality (or Signal-to-Noise Ratio (SNR)), and temporal (pictures per
second). Bitstream components associated with a given level of
spatial, quality, and temporal fidelity are identified using
corresponding parameters in the bitstream: dependency_id, quality_id,
and temporal_id (see also Section 1.1.3). The fidelity identifiers
have integer values, where higher values designate components that
are higher in the hierarchy. It is noted that SVC offers significant
flexibility in terms of how an encoder may choose to structure the
dependencies between the various components. Decoding of a
particular component requires the availability of all the components
it depends upon, either directly, or indirectly. An operation point
of an SVC bitstream consists of the bitstream components required to
be able to decode a particular dependency_id, quality_id, and
The term "layer" is used in various contexts in this memo. For
example, in the terms "Video Coding Layer" and "Network Abstraction
Layer" it refers to conceptual organization levels. When referring
to bitstream syntax elements such as block layer or macroblock layer,
it refers to hierarchical bitstream structure levels. When used in
the context of bitstream scalability, e.g., "AVC base layer", it
refers to a level of representation fidelity of the source signal
with a specific set of NAL units included. The correct
interpretation is supported by providing the appropriate context.
SVC maintains the bitstream organization introduced in H.264/AVC.
Specifically, all bitstream components are encapsulated in Network
Abstraction Layer (NAL) units, which are organized as Access Units
(AUs). An AU is associated with a single sampling instance in time.
A subset of the NAL unit types correspond to the Video Coding Layer
(VCL), and contain the coded picture data associated with the source
content. Non-VCL NAL units carry ancillary data that may be
necessary for decoding (e.g., parameter sets as explained below) or
that facilitate certain system operations but are not needed by the
decoding process itself. Coded picture data at the various fidelity
dimensions are organized in slices. Within one AU, a coded picture
of an operation point consists of all the coded slices required for
decoding up to the particular combination of dependency_id and
quality_id values at the time instance corresponding to the AU.
It is noted that the concept of temporal scalability is already
present in H.264/AVC, as profiles defined in Annex A of [H.264]
already support it. Specifically, in H.264/AVC, the concept of sub-
sequences has been introduced to allow optional use of temporal
layers through Supplemental Enhancement Information (SEI) messages.
SVC extends this approach by exposing the temporal scalability
information using the temporal_id parameter, alongside (and unified
with) the dependency_id and quality_id values that are used for
spatial and quality scalability, respectively. For coded picture
data defined in Annex G of [H.264], this is accomplished by using a
new type of NAL unit, namely, coded slice in scalable extension NAL
unit (type 20), where the fidelity parameters are part of its header.
For coded picture data that follow H.264/AVC, and to ensure
compatibility with existing H.264/AVC decoders, another new type of
NAL unit, namely, prefix NAL unit (type 14), has been defined to
carry this header information. SVC additionally specifies a third
new type of NAL unit, namely, subset sequence parameter set NAL unit
(type 15), to contain sequence parameter set information for quality
and spatial enhancement layers. All these three newly specified NAL
unit types (14, 15, and 20) are among those reserved in H.264/AVC and
are to be ignored by decoders conforming to one or more of the
profiles specified in Annex A of [H.264].
Within an AU, the VCL NAL units associated with a given dependency_id
and quality_id are referred to as a "layer representation". The
layer representation corresponding to the lowest values of
dependency_id and quality_id (i.e., zero for both) is compliant by
design to H.264/AVC. The set of VCL and associated non-VCL NAL units
across all AUs in a bitstream associated with a particular
combination of values of dependency_id and quality_id, and regardless
of the value of temporal_id, is conceptually a scalable layer. For
backward compatibility with H.264/AVC, it is important to
differentiate, however, whether or not SVC-specific NAL units are
present in a given bitstream. This is particularly important for the
lowest fidelity values in terms of dependency_id and quality_id (zero
for both), as the corresponding VCL data are compliant with
H.264/AVC, and may or may not be accompanied by associated prefix NAL
units. This memo therefore uses the term "AVC base layer" to
designate the layer that does not contain SVC-specific NAL units, and
"SVC base layer" to designate the same layer but with the addition of
the associated SVC prefix NAL units. Note that the SVC specification
uses the term "base layer" for what in this memo will be referred to
as "AVC base layer". Similarly, it is also important to be able to
differentiate, within a layer, the temporal fidelity components it
contains. This memo uses the term "T0" to indicate, within a
particular layer, the subset that contains the NAL units associated
with temporal_id equal to 0.
SNR scalability in SVC is offered in two different ways. In what is
called coarse-grain scalability (CGS), scalability is provided by
including or excluding a complete layer when decoding a particular
bitstream. In contrast, in medium-grain scalability (MGS),
scalability is provided by selectively omitting the decoding of
specific NAL units belonging to MGS layers. The selection of the NAL
units to omit can be based on fixed-length fields present in the NAL
unit header (see also Sections 1.1.3 and 4.2).
1.1.2. Parameter Sets
SVC maintains the parameter sets concept in H.264/AVC and introduces
a new type of sequence parameter set, referred to as the subset
sequence parameter set [H.264]. Subset sequence parameter sets have
NAL unit type equal to 15, which is different from the NAL unit type
value (7) of sequence parameter sets. VCL NAL units of NAL unit type
1 to 5 must only (indirectly) refer to sequence parameter sets, while
VCL NAL units of NAL unit type 20 must only (indirectly) refer to
subset sequence parameter sets. The references are indirect because
VCL NAL units refer to picture parameter sets (in their slice
header), which in turn refer to regular or subset sequence parameter
sets. Subset sequence parameter sets use a separate identifier value
space than sequence parameter sets.
In SVC, coded picture data from different layers may use the same or
different sequence and picture parameter sets. Let the variable DQId
be equal to dependency_id * 16 + quality_id. At any time instant
during the decoding process there is one active sequence parameter
set for the layer representation with the highest value of DQId and
one or more active layer SVC sequence parameter set(s) for layer
representations with lower values of DQId. The active sequence
parameter set or an active layer SVC sequence parameter set remains
unchanged throughout a coded video sequence in the scalable layer in
which the active sequence parameter set or active layer SVC sequence
parameter set is referred to. This means that the referred sequence
parameter set or subset sequence parameter set can only change at
instantaneous decoding refresh (IDR) access units for any layer. At
any time instant during the decoding process there may be one active
picture parameter set (for the layer representation with the highest
value of DQId) and one or more active layer picture parameter set(s)
(for layer representations with lower values of DQId). The active
picture parameter set or an active layer picture parameter set
remains unchanged throughout a layer representation in which the
active picture parameter set or active layer picture parameter set is
referred to, but may change from one AU to the next.
1.1.3. NAL Unit Header
SVC extends the one-byte H.264/AVC NAL unit header by three
additional octets for NAL units of types 14 and 20. The header
indicates the type of the NAL unit, the (potential) presence of bit
errors or syntax violations in the NAL unit payload, information
regarding the relative importance of the NAL unit for the decoding
process, the layer identification information, and other fields as
The syntax and semantics of the NAL unit header are specified in
[H.264], but the essential properties of the NAL unit header are
summarized below for convenience.
The first byte of the NAL unit header has the following format (the
bit fields are the same as defined for the one-byte H.264/AVC NAL
unit header, while the semantics of some fields have changed
slightly, in a backward-compatible way):
|F|NRI| Type |
The semantics of the components of the NAL unit type octet, as
specified in [H.264], are described briefly below. In addition to
the name and size of each field, the corresponding syntax element
name in [H.264] is also provided.
F: 1 bit
forbidden_zero_bit. H.264/AVC declares a value of 1 as a
NRI: 2 bits
nal_ref_idc. A value of "00" (in binary form) indicates that
the content of the NAL unit is not used to reconstruct
reference pictures for future prediction. Such NAL units can
be discarded without risking the integrity of the reference
pictures in the same layer. A value greater than "00"
indicates that the decoding of the NAL unit is required to
maintain the integrity of reference pictures in the same layer
or that the NAL unit contains parameter sets.
Type: 5 bits
nal_unit_type. This component specifies the NAL unit type as
defined in Table 7-1 of [H.264], and later within this memo.
For a reference of all currently defined NAL unit types and
their semantics, please refer to Section 7.4.1 in [H.264].
In H.264/AVC, NAL unit types 14, 15, and 20 are reserved for
future extensions. SVC uses these three NAL unit types as
follows: NAL unit type 14 is used for prefix NAL unit, NAL unit
type 15 is used for subset sequence parameter set, and NAL unit
type 20 is used for coded slice in scalable extension (see
Section 7.4.1 in [H.264]). NAL unit types 14 and 20 indicate
the presence of three additional octets in the NAL unit header,
as shown below.
|R|I| PRID |N| DID | QID | TID |U|D|O| RR|
R: 1 bit
reserved_one_bit. Reserved bit for future extension. R must
be equal to 1. The value of R must be ignored by decoders.
I: 1 bit
idr_flag. This component specifies whether the layer
representation is an instantaneous decoding refresh (IDR) layer
representation (when equal to 1) or not (when equal to 0).
PRID: 6 bits
priority_id. This flag specifies a priority identifier for the
NAL unit. A lower value of PRID indicates a higher priority.
N: 1 bit
no_inter_layer_pred_flag. This flag specifies, when present in
a coded slice NAL unit, whether inter-layer prediction may be
used for decoding the coded slice (when equal to 1) or not
(when equal to 0).
DID: 3 bits
dependency_id. This component indicates the inter-layer coding
dependency level of a layer representation. At any access
unit, a layer representation with a given dependency_id may be
used for inter-layer prediction for coding of a layer
representation with a higher dependency_id, while a layer
representation with a given dependency_id shall not be used for
inter-layer prediction for coding of a layer representation
with a lower dependency_id.
QID: 4 bits
quality_id. This component indicates the quality level of an
MGS layer representation. At any access unit and for identical
dependency_id values, a layer representation with quality_id
equal to ql uses a layer representation with quality_id equal
to ql-1 for inter-layer prediction.
TID: 3 bits
temporal_id. This component indicates the temporal level of a
layer representation. The temporal_id is associated with the
frame rate, with lower values of _temporal_id corresponding to
lower frame rates. A layer representation at a given
temporal_id typically depends on layer representations with
lower temporal_id values, but it never depends on layer
representations with higher temporal_id values.
U: 1 bit
use_ref_base_pic_flag. A value of 1 indicates that only
reference base pictures are used during the inter prediction
process. A value of 0 indicates that the reference base
pictures are not used during the inter prediction process.
D: 1 bit
discardable_flag. A value of 1 indicates that the current NAL
unit is not used for decoding NAL units with values of
dependency_id higher than the one of the current NAL unit, in
the current and all subsequent access units. Such NAL units
can be discarded without risking the integrity of layers with
higher dependency_id values. discardable_flag equal to 0
indicates that the decoding of the NAL unit is required to
maintain the integrity of layers with higher dependency_id.
O: 1 bit
output_flag: Affects the decoded picture output process as
defined in Annex C of [H.264].
RR: 2 bits
reserved_three_2bits. Reserved bits for future extension. RR
MUST be equal to "11" (in binary form). The value of RR must
be ignored by decoders.
This memo extends the semantics of F, NRI, I, PRID, DID, QID, TID, U,
and D per Annex G of [H.264] as described in Section 4.2.
1.2. Overview of the Payload Format
Similar to [RFC6184], this payload format can only be used to carry
the raw NAL unit stream over RTP and not the bytestream format
specified in Annex B of [H.264].
The design principles, transmission modes, and packetization modes as
well as new payload structures are summarized in this section. It is
assumed that the reader is familiar with the terminology and concepts
defined in [RFC6184].
1.2.1. Design Principles
The following design principles have been observed for this payload
o Backward compatibility with [RFC6184] wherever possible.
o The SVC base layer or any H.264/AVC compatible subset of the SVC
base layer, when transmitted in its own RTP stream, must be
encapsulated using [RFC6184]. This ensures that such an RTP
stream can be understood by [RFC6184] receivers.
o Media-aware network elements (MANEs) as defined in [RFC6184] are
signaling-aware, rely on signaling information, and have state.
o MANEs can aggregate multiple RTP streams, possibly from multiple
o MANEs can perform media-aware stream thinning (selective
elimination of packets or portions thereof). By using the payload
header information identifying layers within an RTP session, MANEs
are able to remove packets or portions thereof from the incoming
RTP packet stream. This implies rewriting the RTP headers of the
outgoing packet stream, and rewriting of RTCP packets as specified
in Section 7 of [RFC3550].
1.2.2. Transmission Modes and Packetization Modes
This memo allows the packetization of SVC data for both single-
session transmission (SST) and multi-session transmission (MST). In
the case of SST all SVC data are carried in a single RTP session. In
the case of MST two or more RTP sessions are used to carry the SVC
data, in accordance with the MST-specific packetization modes defined
in this memo, which are based on the packetization modes defined in
[RFC6184]. In MST, each RTP session is associated with one RTP
stream, which may carry one or more layers.
The base layer is, by design, compatible to H.264/AVC. During
transmission, the associated prefix NAL units, which are introduced
by SVC and, when present, are ignored by H.264/AVC decoders, may be
encapsulated within the same RTP packet stream as the H.264/AVC VCL
NAL units or in a different RTP packet stream (when MST is used).
For convenience, the term "AVC base layer" is used to refer to the
base layer without prefix NAL units, while the term "SVC base layer"
is used to refer to the base layer with prefix NAL units.
Furthermore, the base layer may have multiple temporal components
(i.e., supporting different frame rates). As a result, the lowest
temporal component ("T0") of the AVC or SVC base layer is used as the
starting point of the SVC bitstream hierarchy.
This memo allows encapsulating in a given RTP stream any of the
following three alternatives of layer combinations:
1. the T0 AVC base layer or the T0 SVC base layer only;
2. one or more enhancement layers only; or
3. the T0 SVC base layer, and one or more enhancement layers.
SST should be used in point-to-point unicast applications and, in
general, whenever the potential benefit of using multiple RTP
sessions does not justify the added complexity. When SST is used,
the layer combination cases 1 and 3 above can be used. When an
H.264/AVC compatible subset of the SVC base layer is transmitted
using SST, the packetization of [RFC6184] must be used, thus ensuring
compatibility with [RFC6184] receivers. When, however, one or more
SVC quality or spatial enhancement layers are transmitted using SST,
the packetization defined in this memo must be used. In SST, any of
the three [RFC6184] packetization modes, namely, single NAL unit
mode, non-interleaved mode, and interleaved mode, can be used.
MST should be used in a multicast session when different receivers
may request different layers of the scalable bitstream. An operation
point for an SVC bitstream, as defined in this memo, corresponds to a
set of layers that together conform to one of the profiles defined in
Annex A or G of [H.264] and, when decoded, offer a representation of
the original video at a certain fidelity. The number of streams used
in MST should be at least equal to the number of operation points
that may be requested by the receivers. Depending on the
application, this may result in each layer being carried in its own
RTP session, or in having multiple layers encapsulated within one RTP
Informative note: Layered multicast is a term commonly used to
describe the application where multicast is used to transmit
layered or scalable data that has been encapsulated into more than
one RTP session. This application allows different receivers in
the multicast session to receive different operation points of the
scalable bitstream. Layered multicast, among other application
examples, is discussed in more detail in Section 11.2.
When MST is used, any of the three layer combinations above can be
used for each of the sessions. When an H.264/AVC compatible subset
of the SVC base layer is transmitted in its own session in MST, the
packetization of [RFC6184] must be used, such that [RFC6184]
receivers can be part of the MST and receive only this session. For
MST, this memo defines four different MST-specific packetization
modes, namely, non-interleaved timestamp (NI-T) based mode, non-
interleaved CS-DON (NI-C) based mode, non-interleaved combined
timestamp and CS-DON mode (NI-TC), and interleaved CS-DON (I-C) based
mode (detailed in Section 4.5.2). The modes differ depending on
whether the SVC data are allowed to be interleaved, i.e., to be
transmitted in an order different than the intended decoding order,
and they also differ in the mechanisms provided in order to recover
the correct decoding order of the NAL units across the multiple RTP
sessions. These four MST modes reuse the packetization modes
introduced in [RFC6184] for the packetization of NAL units in each of
their individual RTP sessions.
As the names of the MST packetization modes imply, the NI-T, NI-C,
and NI-TC modes do not allow interleaved transmission, while the I-C
mode allows interleaved transmission. With any of the three non-
interleaved MST packetization modes, legacy [RFC6184] receivers with
implementation of the non-interleaved mode specified in [RFC6184] can
join a multi-session transmission of SVC, to receive the base RTP
session encapsulated according to [RFC6184].
1.2.3. New Payload Structures
[RFC6184] specifies three basic payload structures, namely, single
NAL unit packet, aggregation packet, and fragmentation unit.
Depending on the basic payload structure, an RTP packet may contain a
NAL unit not aggregating other NAL units, one or more NAL units
aggregated in another NAL unit, or a fragment of a NAL unit not
aggregating other NAL units. Each NAL unit of a type specified in
[H.264] (i.e., 1 to 23, inclusive) may be carried in its entirety in
a single NAL unit packet, may be aggregated in an aggregation packet,
or may be fragmented and carried in a number of fragmentation unit
packets. To enable aggregation or fragmentation of NAL units while
still ensuring that the RTP packet payload is only composed of NAL
units, [RFC6184] introduced six new NAL unit types (24-29) to be used
as payload structures, selected from the NAL unit types left
unspecified in [H.264].
This memo reuses all the payload structures used in [RFC6184].
Furthermore, three new types of NAL units are defined: payload
content scalability information (PACSI) NAL unit, empty NAL unit, and
non-interleaved multi-time aggregation packet (NI-MTAP) (specified in
Sections 4.9, 4.10, and 4.7.1, respectively).
PACSI NAL units may be used for the following purposes:
o To enable MANEs to decide whether to forward, process, or discard
aggregation packets, by checking in PACSI NAL units the
scalability information and other characteristics of the
aggregated NAL units, rather than looking into the aggregated NAL
units themselves, which are defined by the video coding
o To enable correct decoding order recovery in MST using the NI-C or
NI-TC mode, with the help of the CS-DON information included in
PACSI NAL units.
o To improve resilience to packet losses, e.g., by utilizing the
following data or information included in PACSI NAL units:
repeated Supplemental Enhancement Information (SEI) messages,
information regarding the start and end of layer representations,
and the indices to layer representations of the lowest temporal
Empty NAL units may be used to enable correct decoding order recovery
in MST using the NI-T or NI-TC mode. NI-MTAP NAL units may be used
to aggregate NAL units from multiple access units but without
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14, RFC 2119
This specification uses the notion of setting and clearing a bit when
bit fields are handled. Setting a bit is the same as assigning that
bit the value of 1 (On). Clearing a bit is the same as assigning
that bit the value of 0 (Off).
3. Definitions and Abbreviations
This document uses the terms and definitions of [H.264]. Section
3.1.1 lists relevant definitions copied from [H.264] for convenience.
When there is discrepancy, the definitions in [H.264] take
precedence. Section 3.1.2 gives definitions specific to this memo.
Some of the definitions in Section 3.1.2 are also present in
[RFC6184] and copied here with slight adaptations as needed.
3.1.1. Definitions from the SVC Specification
access unit: A set of NAL units always containing exactly one primary
coded picture. In addition to the primary coded picture, an access
unit may also contain one or more redundant coded pictures, one
auxiliary coded picture, or other NAL units not containing slices or
slice data partitions of a coded picture. The decoding of an access
unit always results in a decoded picture.
base layer: A bitstream subset that contains all the NAL units with
the nal_unit_type syntax element equal to 1 or 5 of the bitstream and
does not contain any NAL unit with the nal_unit_type syntax element
equal to 14, 15, or 20 and conforms to one or more of the profiles
specified in Annex A of [H.264].
base quality layer representation: The layer representation of the
target dependency representation of an access unit that is associated
with the quality_id syntax element equal to 0.
coded video sequence: A sequence of access units that consists, in
decoding order, of an IDR access unit followed by zero or more non-
IDR access units including all subsequent access units up to but not
including any subsequent IDR access unit.
dependency representation: A subset of Video Coding Layer (VCL) NAL
units within an access unit that are associated with the same value
of the dependency_id syntax element, which is provided as part of the
NAL unit header or by an associated prefix NAL unit. A dependency
representation consists of one or more layer representations.
IDR access unit: An access unit in which the primary coded picture is
an IDR picture.
IDR picture: Instantaneous decoding refresh picture. A coded picture
in which all slices of the target dependency representation within
the access unit are I or EI slices that causes the decoding process
to mark all reference pictures as "unused for reference" immediately
after decoding the IDR picture. After the decoding of an IDR picture
all following coded pictures in decoding order can be decoded without
inter prediction from any picture decoded prior to the IDR picture.
The first picture of each coded video sequence is an IDR picture.
layer representation: A subset of VCL NAL units within an access unit
that are associated with the same values of the dependency_id and
quality_id syntax elements, which are provided as part of the VCL NAL
unit header or by an associated prefix NAL unit. One or more layer
representations represent a dependency representation.
prefix NAL unit: A NAL unit with nal_unit_type equal to 14 that
immediately precedes in decoding order a NAL unit with nal_unit_type
equal to 1, 5, or 12. The NAL unit that immediately succeeds in
decoding order the prefix NAL unit is referred to as the associated
NAL unit. The prefix NAL unit contains data associated with the
associated NAL unit, which are considered to be part of the
associated NAL unit.
reference base picture: A reference picture that is obtained by
decoding a base quality layer representation with the nal_ref_idc
syntax element not equal to 0 and the store_ref_base_pic_flag syntax
element equal to 1 of an access unit and all layer representations of
the access unit that are referred to by inter-layer prediction of the
base quality layer representation. A reference base picture is not
an output of the decoding process, but the samples of a reference
base picture may be used for inter prediction in the decoding process
of subsequent pictures in decoding order. Reference base picture is
a collective term for a reference base field or a reference base
scalable bitstream: A bitstream with the property that one or more
bitstream subsets that are not identical to the scalable bitstream
form another bitstream that conforms to the SVC specification
target dependency representation: The dependency representation of an
access unit that is associated with the largest value of the
dependency_id syntax element for all dependency representations of
the access unit.
target layer representation: The layer representation of the target
dependency representation of an access unit that is associated with
the largest value of the quality_id syntax element for all layer
representations of the target dependency representation of the access
3.1.2. Definitions Specific to This Memo
anchor layer representation: An anchor layer representation is such a
layer representation that, if decoding of the operation point
corresponding to the layer starts from the access unit containing
this layer representation, all the following layer representations of
the layer, in output order, can be correctly decoded. The output
order is defined in [H.264] as the order in which decoded pictures
are output from the decoded picture buffer of the decoder. As H.264
does not specify the picture display process, this more general term
is used instead of display order. An anchor layer representation is
a random access point to the layer the anchor layer representation
belongs. However, some layer representations, succeeding an anchor
layer representation in decoding order but preceding the anchor layer
representation in output order, may refer to earlier layer
representations for inter prediction, and hence the decoding may be
incorrect if random access is performed at the anchor layer
AVC base layer: The subset of the SVC base layer in which all prefix
NAL units (type 14) are removed. Note that this is equivalent to the
term "base layer" as defined in Annex G of [H.264].
base RTP session: When multi-session transmission is used, the RTP
session that carries the RTP stream containing the T0 AVC base layer
or the T0 SVC base layer, and zero or more enhancement layers. This
RTP session does not depend on any other RTP session as indicated by
mechanisms defined in Section 7.2.3. The base RTP session may carry
NAL units of NAL unit type equal to 14 and 15.
decoding order number (DON): A field in the payload structure or a
derived variable indicating NAL unit decoding order. Values of DON
are in the range of 0 to 65535, inclusive. After reaching the
maximum value, the value of DON wraps around to 0. Note that this
definition also exists in [RFC6184] in exactly the same form.
Empty NAL unit: A NAL unit with NAL unit type equal to 31 and sub-
type equal to 1. An empty NAL unit consists of only the two-byte NAL
unit header with an empty payload.
enhancement RTP session: When multi-session transmission is used, an
RTP session that is not the base RTP session. An enhancement RTP
session typically contains an RTP stream that depends on at least one
other RTP session as indicated by mechanisms defined in Section
7.2.3. A lower RTP session to an enhancement RTP session is an RTP
session on which the enhancement RTP session depends. The lowest RTP
session for a receiver is the RTP session that does not depend on any
other RTP session received by the receiver. The highest RTP session
for a receiver is the RTP session on which no other RTP session
received by the receiver depends.
cross-session decoding order number (CS-DON): A derived variable
indicating NAL unit decoding order number over all NAL units within
all the session-multiplexed RTP sessions that carry the same SVC
default level: The level indicated by the profile-level-id parameter.
In Session Description Protocol (SDP) Offer/Answer, the level is
downgradable, i.e., the answer may either use the default level or a
lower level. Note that this definition also exists in [RFC6184] in a
slightly different form.
default sub-profile: The subset of coding tools, which may be all
coding tools of one profile or the common subset of coding tools of
more than one profile, indicated by the profile-level-id parameter.
In SDP Offer/Answer, the default sub-profile must be used in a
symmetric manner, i.e., the answer must either use the same sub-
profile as the offer or reject the offer. Note that this definition
also exists in [RFC6184] in a slightly different form.
enhancement layer: A layer in which at least one of the values of
dependency_id or quality_id is higher than 0, or a layer in which
none of the NAL units is associated with the value of temporal_id
equal to 0. An operation point constructed using the maximum
temporal_id, dependency_id, and quality_id values associated with an
enhancement layer may or may not conform to one or more of the
profiles specified in Annex A of [H.264].
H.264/AVC compatible: The property of a bitstream subset of
conforming to one or more of the profiles specified in Annex A of
intra layer representation: A layer representation that contains
only slices that use intra prediction, and hence do not refer to any
earlier layer representation in decoding order in the same layer.
Note that in SVC intra prediction includes intra-layer intra
prediction as well as inter-layer intra prediction.
layer: A bitstream subset in which all NAL units of type 1, 5, 12,
14, or 20 have the same values of dependency_id and quality_id,
either directly through their NAL unit header (for NAL units of type
14 or 20) or through association to a prefix (type 14) NAL unit (for
NAL unit type 1, 5, or 12). A layer may contain NAL units associated
with more than one values of temporal_id.
media-aware network element (MANE): A network element, such as a
middlebox or application layer gateway that is capable of parsing
certain aspects of the RTP payload headers or the RTP payload and
reacting to their contents. Note that this definition also exists in
[RFC6184] in exactly the same form.
Informative note: The concept of a MANE goes beyond normal routers
or gateways in that a MANE has to be aware of the signaling (e.g.,
to learn about the payload type mappings of the media streams),
and in that it has to be trusted when working with Secure Real-
time Transport Protocol (SRTP). The advantage of using MANEs is
that they allow packets to be dropped according to the needs of
the media coding. For example, if a MANE has to drop packets due
to congestion on a certain link, it can identify and remove those
packets whose elimination produces the least adverse effect on the
user experience. After dropping packets, MANEs must rewrite RTCP
packets to match the changes to the RTP packet stream as specified
in Section 7 of [RFC3550].
multi-session transmission: The transmission mode in which the SVC
stream is transmitted over multiple RTP sessions. Dependency between
RTP sessions MUST be signaled according to Section 7.2.3 of this
NAL unit decoding order: A NAL unit order that conforms to the
constraints on NAL unit order given in Section G.188.8.131.52 in [H.264].
Note that this definition also exists in [RFC6184] in a slightly
NALU-time: The value that the RTP timestamp would have if the NAL
unit would be transported in its own RTP packet. Note that this
definition also exists in [RFC6184] in exactly the same form.
operation point: An operation point is identified by a set of values
of temporal_id, dependency_id, and quality_id. A bitstream
corresponding to an operation point can be constructed by removing
all NAL units associated with a higher value of dependency_id, and
all NAL units associated with the same value of dependency_id but
higher values of quality_id or temporal_id. An operation point
bitstream conforms to at least one of the profiles defined in Annex A
or G of [H.264], and offers a representation of the original video
signal at a certain fidelity.
Informative note: Additional NAL units may be removed (with lower
dependency_id or same dependency_id but lower quality_id) if they
are not required for decoding the bitstream at the particular
operation point. The resulting bitstream, however, may no longer
conform to any of the profiles defined in Annex A or G of [H.264].
operation point representation: The set of all NAL units of an
operation point within the same access unit.
RTP packet stream: A sequence of RTP packets with increasing sequence
numbers (except for wrap-around), identical payload type and
identical SSRC (Synchronization Source), carried in one RTP session.
Within the scope of this memo, one RTP packet stream is utilized to
transport one or more layers.
single-session transmission: The transmission mode in which the SVC
bitstream is transmitted over a single RTP session.
SVC base layer: The layer that includes all NAL units associated with
dependency_id and quality_id values both equal to 0, including prefix
NAL units (NAL unit type 14).
SVC enhancement layer: A layer in which at least one of the values of
dependency_id or quality_id is higher than 0. An operation point
constructed using the maximum dependency_id and quality_id values and
any temporal_id value associated with an SVC enhancement layer does
not conform to any of the profiles specified in Annex A of [H.264].
SVC NAL unit: A NAL unit of NAL unit type 14, 15, or 20 as specified
in Annex G of [H.264].
SVC NAL unit header: A four-byte header resulting from the addition
of a three-byte SVC-specific header extension added in NAL unit types
14 and 20.
SVC RTP session: Either the base RTP session or an enhancement RTP
T0 AVC base layer: A subset of the AVC base layer constructed by
removing all VCL NAL units associated with temporal_id values higher
than 0 and non-VCL NAL units and SEI messages associated only with
the VCL NAL units being removed.
T0 SVC base layer: A subset of the SVC base layer constructed by
removing all VCL NAL units associated with temporal_id values higher
than 0 as well as prefix NAL units, non-VCL NAL units, and SEI
messages associated only with the VCL NAL units being removed.
transmission order: The order of packets in ascending RTP sequence
number order (in modulo arithmetic). Within an aggregation packet,
the NAL unit transmission order is the same as the order of
appearance of NAL units in the packet. Note that this definition
also exists in [RFC6184] in exactly the same form.
In addition to the abbreviations defined in [RFC6184], the following
abbreviations are used in this memo.
CGS: Coarse-Grain Scalability
CS-DON: Cross-Session Decoding Order Number
MGS: Medium-Grain Scalability
MST: Multi-Session Transmission
PACSI: Payload Content Scalability Information
SST: Single-Session Transmission
SNR: Signal-to-Noise Ratio
SVC: Scalable Video Coding