2. Protocol Overview
This section provides an informative overview of the different
mechanisms in the RTSP 2.0 protocol to give the reader a high-level
understanding before getting into all the specific details. In case
of conflict with this description and the later sections, the later
sections take precedence. For more information about use cases
considered for RTSP, see Appendix E.
RTSP 2.0 is a bidirectional request and response protocol that first
establishes a context including content resources (the media) and
then controls the delivery of these content resources from the
provider to the consumer. RTSP has three fundamental parts: Session
Establishment, Media Delivery Control, and an extensibility model
described below. The protocol is based on some assumptions about
existing functionality to provide a complete solution for client-
controlled real-time media delivery.
RTSP uses text-based messages, requests and responses, that may
contain a binary message body. An RTSP request starts with a method
line that identifies the method, the protocol, and version and the
resource on which to act. The resource is identified by a URI and
the hostname part of the URI is used by RTSP client to resolve the
IPv4 or IPv6 address of the RTSP server. Following the method line
are a number of RTSP headers. These lines are ended by two
consecutive carriage return line feed (CRLF) character pairs. The
message body, if present, follows the two CRLF character pairs, and
the body's length is described by a message header. RTSP responses
are similar, but they start with a response line with the protocol
and version followed by a status code and a reason phrase. RTSP
messages are sent over a reliable transport protocol between the
client and server. RTSP 2.0 requires clients and servers to
implement TCP and TLS over TCP as mandatory transports for RTSP
2.1. Presentation Description
RTSP exists to provide access to multimedia presentations and content
but tries to be agnostic about the media type or the actual media
delivery protocol that is used. To enable a client to implement a
complete system, an RTSP-external mechanism for describing the
presentation and the delivery protocol(s) is used. RTSP assumes that
this description is either delivered completely out of band or as a
data object in the response to a client's request using the DESCRIBE
method (Section 13.2).
Parameters that commonly have to be included in the presentation
description are the following:
o The number of media streams;
o the resource identifier for each media stream/resource that is to
be controlled by RTSP;
o the protocol that will be used to deliver each media stream;
o the transport protocol parameters that are not negotiated or vary
with each client;
o the media-encoding information enabling a client to correctly
decode the media upon reception; and
o an aggregate control resource identifier.
RTSP uses its own URI schemes ("rtsp" and "rtsps") to reference media
resources and aggregates under common control (see Section 4.2).
This specification describes in Appendix D how one uses SDP [RFC4566]
for describing the presentation.
2.2. Session Establishment
The RTSP client can request the establishment of an RTSP session
after having used the presentation description to determine which
media streams are available, which media delivery protocol is used,
and the resource identifiers of the media streams. The RTSP session
is a common context between the client and the server that consists
of one or more media resources that are to be under common media
The client creates an RTSP session by sending a request using the
SETUP method (Section 13.3) to the server. In the Transport header
(Section 18.54) of the SETUP request, the client also includes all
the transport parameters necessary to enable the media delivery
protocol to function. This includes parameters that are
preestablished by the presentation description but necessary for any
middlebox to correctly handle the media delivery protocols. The
Transport header in a request may contain multiple alternatives for
media delivery in a prioritized list, which the server can select
from. These alternatives are typically based on information in the
When receiving a SETUP request, the server determines if the media
resource is available and if one or more of the of the transport
parameter specifications are acceptable. If that is successful, an
RTSP session context is created and the relevant parameters and state
is stored. An identifier is created for the RTSP session and
included in the response in the Session header (Section 18.49). The
SETUP response includes a Transport header that specifies which of
the alternatives has been selected and relevant parameters.
A SETUP request that references an existing RTSP session but
identifies a new media resource is a request to add that media
resource under common control with the already-present media
resources in an aggregated session. A client can expect this to work
for all media resources under RTSP control within a multimedia
content container. However, a server will likely refuse to aggregate
resources from different content containers. Even if an RTSP session
contains only a single media stream, the RTSP session can be
referenced by the aggregate control URI.
To avoid an extra round trip in the session establishment of
aggregated RTSP sessions, RTSP 2.0 supports pipelined requests; i.e.,
the client can send multiple requests back-to-back without waiting
first for the completion of any of them. The client uses a client-
selected identifier in the Pipelined-Requests header (Section 18.33)
to instruct the server to bind multiple requests together as if they
included the session identifier.
The SETUP response also provides additional information about the
established sessions in a couple of different headers. The Media-
Properties header (Section 18.29) includes a number of properties
that apply for the aggregate that is valuable when doing media
delivery control and configuring user interface. The Accept-Ranges
header (Section 18.5) informs the client about range formats that the
server supports for these media resources. The Media-Range header
(Section 18.30) informs the client about the time range of the media
2.3. Media Delivery Control
After having established an RTSP session, the client can start
controlling the media delivery. The basic operations are "begin
playback", using the PLAY method (Section 13.4) and "suspend (pause)
playback" by using the PAUSE method (Section 13.6). PLAY also allows
for choosing the starting media position from which the server should
deliver the media. The positioning is done by using the Range header
(Section 18.40) that supports several different time formats: Normal
Play Time (NPT) (Section 4.4.2), Society of Motion Picture and
Television Engineers (SMPTE) Timestamps (Section 4.4.1), and absolute
time (Section 4.4.3). The Range header also allows the client to
specify a position where delivery should end, thus allowing a
specific interval to be delivered.
The support for positioning/searching within media content depends on
the content's media properties. Content exists in a number of
different types, such as on-demand, live, and live with simultaneous
recording. Even within these categories, there are differences in
how the content is generated and distributed, which affect how it can
be accessed for playback. The properties applicable for the RTSP
session are provided by the server in the SETUP response using the
Media-Properties header (Section 18.29). These are expressed using
one or several independent attributes. A first attribute is Random-
Access, which indicates whether positioning is possible, and with
what granularity. Another aspect is whether the content will change
during the lifetime of the session. While on-demand content will be
provided in full from the beginning, a live stream being recorded
results in the length of the accessible content growing as the
session goes on. There also exists content that is dynamically built
by a protocol other than RTSP and, thus, also changes in steps during
the session, but maybe not continuously. Furthermore, when content
is recorded, there are cases where the complete content is not
maintained, but, for example, only the last hour. All of these
properties result in the need for mechanisms that will be discussed
When the client accesses on-demand content that allows random access,
the client can issue the PLAY request for any point in the content
between the start and the end. The server will deliver media from
the closest random access point prior to the requested point and
indicate that in its PLAY response. If the client issues a PAUSE,
the delivery will be halted and the point at which the server stopped
will be reported back in the response. The client can later resume
by sending a PLAY request without a Range header. When the server is
about to complete the PLAY request by delivering the end of the
content or the requested range, the server will send a PLAY_NOTIFY
request (Section 13.5) indicating this.
When playing live content with no extra functions, such as recording,
the client will receive the live media from the server after having
sent a PLAY request. Seeking in such content is not possible as the
server does not store it, but only forwards it from the source of the
session. Thus, delivery continues until the client sends a PAUSE
request, tears down the session, or the content ends.
For live sessions that are being recorded, the client will need to
keep track of how the recording progresses. Upon session
establishment, the client will learn the current duration of the
recording from the Media-Range header. Because the recording is
ongoing, the content grows in direct relation to the time passed.
Therefore, each server's response to a PLAY request will contain the
current Media-Range header. The server should also regularly send
(approximately every 5 minutes) the current media range in a
PLAY_NOTIFY request (Section 13.5.2). If the live transmission ends,
the server must send a PLAY_NOTIFY request with the updated Media-
Properties indicating that the content stopped being a recorded live
session and instead became on-demand content; the request also
contains the final media range. While the live delivery continues,
the client can request to play the current live point by using the
NPT timescale symbol "now", or it can request a specific point in the
available content by an explicit range request for that point. If
the requested point is outside of the available interval, the server
will adjust the position to the closest available point, i.e., either
at the beginning or the end.
A special case of recording is that where the recording is not
retained longer than a specific time period; thus, as the live
delivery continues, the client can access any media within a moving
window that covers, for example, "now" to "now" minus 1 hour. A
client that pauses on a specific point within the content may not be
able to retrieve the content anymore. If the client waits too long
before resuming the pause point, the content may no longer be
available. In this case, the pause point will be adjusted to the
closest point in the available media.
2.4. Session Parameter Manipulations
A session may have additional state or functionality that affects how
the server or client treats the session or content, how it functions,
or feedback on how well the session works. Such extensions are not
defined in this specification, but they may be covered in various
extensions. RTSP has two methods for retrieving and setting
parameter values on either the client or the server: GET_PARAMETER
(Section 13.8) and SET_PARAMETER (Section 13.9). These methods carry
the parameters in a message body of the appropriate format. One can
also use headers to query state with the GET_PARAMETER method. As an
example, clients needing to know the current media range for a time-
progressing session can use the GET_PARAMETER method and include the
media range. Furthermore, synchronization information can be
requested by using a combination of RTP-Info (Section 18.45) and
Range (Section 18.40).
RTSP 2.0 does not have a strong mechanism for negotiating the headers
or parameters and their formats. However, responses will indicate
request-headers or parameters that are not supported. A priori
determination of what features are available needs to be done through
out-of-band mechanisms, like the session description, or through the
usage of feature tags (Section 4.5).
2.5. Media Delivery
This document specifies how media is delivered with RTP [RFC3550]
over UDP [RFC768], TCP [RFC793], or the RTSP connection. Additional
protocols may be specified in the future as needed.
The usage of RTP as a media delivery protocol requires some
additional information to function well. The PLAY response contains
information to enable reliable and timely delivery of how a client
should synchronize different sources in the different RTP sessions.
It also provides a mapping between RTP timestamps and the content-
time scale. When the server wants to notify the client about the
completion of the media delivery, it sends a PLAY_NOTIFY request to
the client. The PLAY_NOTIFY request includes information about the
stream end, including the last RTP sequence number for each stream,
thus enabling the client to empty the buffer smoothly.
2.5.1. Media Delivery Manipulations
The basic playback functionality of RTSP enables delivery of a range
of requested content to the client at the pace intended by the
content's creator. However, RTSP can also manipulate the delivery to
the client in two ways.
Scale: The ratio of media-content time delivered per unit of
Speed: The ratio of playback time delivered per unit of wallclock
Both affect the media delivery per time unit. However, they
manipulate two independent timescales and the effects are possible to
Scale (Section 18.46) is used for fast-forward or slow-motion control
as it changes the amount of content timescale that should be played
back per time unit. Scale > 1.0, means fast forward, e.g., scale =
2.0 results in that 2 seconds of content being played back every
second of playback. Scale = 1.0 is the default value that is used if
no scale is specified, i.e., playback at the content's original rate.
Scale values between 0 and 1.0 provide for slow motion. Scale can be
negative to allow for reverse playback in either regular pace
(scale = -1.0), fast backwards (scale < -1.0), or slow-motion
backwards (-1.0 < scale < 0). Scale = 0 would be equal to pause and
is not allowed.
In most cases, the realization of scale means server-side
manipulation of the media to ensure that the client can actually play
it back. The nature of these media manipulations and when they are
needed is highly media-type dependent. Let's consider two common
media types, audio and video.
It is very difficult to modify the playback rate of audio.
Typically, no more than a factor of two is possible while maintaining
intelligibility by changing the pitch and rate of speech. Music goes
out of tune if one tries to manipulate the playback rate by
resampling it. This is a well-known problem, and audio is commonly
muted or played back in short segments with skips to keep up with the
current playback point.
For video, it is possible to manipulate the frame rate, although the
rendering capabilities are often limited to certain frame rates.
Also, the allowed bitrates in decoding, the structure used in the
encoding, and the dependency between frames and other capabilities of
the rendering device limits the possible manipulations. Therefore,
the basic fast-forward capabilities often are implemented by
selecting certain subsets of frames.
Due to the media restrictions, the possible scale values are commonly
restricted to the set of realizable scale ratios. To enable the
clients to select from the possible scale values, RTSP can signal the
supported scale ratios for the content. To support aggregated or
dynamic content, where this may change during the ongoing session and
dependent on the location within the content, a mechanism for
updating the media properties and the scale factor currently in use,
Speed (Section 18.50) affects how much of the playback timeline is
delivered in a given wallclock period. The default is Speed = 1
which means to deliver at the same rate the media is consumed.
Speed > 1 means that the receiver will get content faster than it
regularly would consume it. Speed < 1 means that delivery is slower
than the regular media rate. Speed values of 0 or lower have no
meaning and are not allowed. This mechanism enables two general
functionalities. One is client-side scale operations, i.e., the
client receives all the frames and makes the adjustment to the
playback locally. The second is delivery control for the buffering
of media. By specifying a speed over 1.0, the client can build up
the amount of playback time it has present in its buffers to a level
that is sufficient for its needs.
A naive implementation of Speed would only affect the transmission
schedule of the media and has a clear impact on the needed bandwidth.
This would result in the data rate being proportional to the speed
factor. Speed = 1.5, i.e., 50% faster than normal delivery, would
result in a 50% increase in the data-transport rate. Whether or not
that can be supported depends solely on the underlying network path.
Scale may also have some impact on the required bandwidth due to the
manipulation of the content in the new playback schedule. An example
is fast forward where only the independently decodable intra-frames
are included in the media stream. This usage of solely intra-frames
increases the data rate significantly compared to a normal sequence
with the same number of frames, where most frames are encoded using
This potential increase of the data rate needs to be handled by the
media sender. The client has requested that the media be delivered
in a specific way, which should be honored. However, the media
sender cannot ignore if the network path between the sender and the
receiver can't handle the resulting media stream. In that case, the
media stream needs to be adapted to fit the available resources of
the path. This can result in a reduced media quality.
The need for bitrate adaptation becomes especially problematic in
connection with the Speed semantics. If the goal is to fill up the
buffer, the client may not want to do that at the cost of reduced
quality. If the client wants to make local playout changes, then it
may actually require that the requested speed be honored. To resolve
this issue, Speed uses a range so that both cases can be supported.
The server is requested to use the highest possible speed value
within the range, which is compatible with the available bandwidth.
As long as the server can maintain a speed value within the range, it
shall not change the media quality, but instead modify the actual
delivery rate in response to available bandwidth and reflect this in
the Speed value in the response. However, if this is not possible,
the server should instead modify the media quality to respect the
lowest speed value and the available bandwidth.
This functionality enables the local scaling implementation to use a
tight range, or even a range where the lower bound equals the upper
bound, to identify that it requires the server to deliver the
requested amount of media time per delivery time, independent of how
much it needs to adapt the media quality to fit within the available
path bandwidth. For buffer filling, it is suitable to use a range
with a reasonable span and with a lower bound at the nominal media
rate 1.0, such as 1.0 - 2.5. If the client wants to reduce the
buffer, it can specify an upper bound that is below 1.0 to force the
server to deliver slower than the nominal media rate.
2.6. Session Maintenance and Termination
The session context that has been established is kept alive by having
the client show liveness. This is done in two main ways:
o Media-transport protocol keep-alive. RTP Control Protocol (RTCP)
may be used when using RTP.
o Any RTSP request referencing the session context.
Section 10.5 discusses the methods for showing liveness in more
depth. If the client fails to show liveness for more than the
established session timeout value (normally 60 seconds), the server
may terminate the context. Other values may be selected by the
server through the inclusion of the timeout parameter in the session
The session context is normally terminated by the client sending a
TEARDOWN request (Section 13.7) to the server referencing the
aggregated control URI. An individual media resource can be removed
from a session context by a TEARDOWN request referencing that
particular media resource. If all media resources are removed from a
session context, the session context is terminated.
A client may keep the session alive indefinitely if allowed by the
server; however, a client is advised to release the session context
when an extended period of time without media delivery activity has
passed. The client can re-establish the session context if required
later. What constitutes an extended period of time is dependent on
the client, server, and their usage. It is recommended that the
client terminate the session before ten times the session timeout
value has passed. A server may terminate the session after one
session timeout period without any client activity beyond keep-alive.
When a server terminates the session context, it does so by sending a
TEARDOWN request indicating the reason.
A server can also request that the client tear down the session and
re-establish it at an alternative server, as may be needed for
maintenance. This is done by using the REDIRECT method
(Section 13.10). The Terminate-Reason header (Section 18.52) is used
to indicate when and why. The Location header indicates where it
should connect if there is an alternative server available. When the
deadline expires, the server simply stops providing the service. To
achieve a clean closure, the client needs to initiate session
termination prior to the deadline. In case the server has no other
server to redirect to, and it wants to close the session for
maintenance, it shall use the TEARDOWN method with a Terminate-Reason
2.7. Extending RTSP
RTSP is quite a versatile protocol that supports extensions in many
different directions. Even this core specification contains several
blocks of functionality that are optional to implement. The use case
and need for the protocol deployment should determine what parts are
implemented. Allowing for extensions makes it possible for RTSP to
address additional use cases. However, extensions will affect the
interoperability of the protocol; therefore, it is important that
they can be added in a structured way.
The client can learn the capability of a server by using the OPTIONS
method (Section 13.1) and the Supported header (Section 18.51). It
can also try and possibly fail using new methods or require that
particular features be supported using the Require (Section 18.43) or
Proxy-Require (Section 18.37) header.
The RTSP, in itself, can be extended in three ways, listed here in
increasing order of the magnitude of changes supported:
o Existing methods can be extended with new parameters, for example,
headers, as long as these parameters can be safely ignored by the
recipient. If the client needs negative acknowledgment when a
method extension is not supported, a tag corresponding to the
extension may be added in the field of the Require or Proxy-
o New methods can be added. If the recipient of the message does
not understand the request, it must respond with error code 501
(Not Implemented) so that the sender can avoid using this method
again. A client may also use the OPTIONS method to inquire about
methods supported by the server. The server must list the methods
it supports using the Public response-header.
o A new version of the protocol can be defined, allowing almost all
aspects (except the position of the protocol version number) to
change. A new version of the protocol must be registered through
a Standards Track document.
The basic capability discovery mechanism can be used to both discover
support for a certain feature and to ensure that a feature is
available when performing a request. For a detailed explanation of
this, see Section 11.
New media delivery protocols may be added and negotiated at session
establishment, in addition to extensions to the core protocol.
Certain types of protocol manipulations can be done through parameter
formats using SET_PARAMETER and GET_PARAMETER.
3. Document Conventions
3.1. Notational Conventions
All the mechanisms specified in this document are described in both
prose and the Augmented Backus-Naur form (ABNF) described in detail
Indented paragraphs are used to provide informative background and
motivation. This is intended to give readers who were not involved
with the formulation of the specification an understanding of why
things are the way they are in RTSP.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
The word, "unspecified" is used to indicate functionality or features
that are not defined in this specification. Such functionality
cannot be used in a standardized manner without further definition in
an extension specification to RTSP.
Aggregate control: The concept of controlling multiple streams using
a single timeline, generally one maintained by the server. A
client, for example, uses aggregate control when it issues a
single play or pause message to simultaneously control both the
audio and video in a movie. A session that is under aggregate
control is referred to as an "aggregated session".
Aggregate control URI: The URI used in an RTSP request to refer to
and control an aggregated session. It normally, but not always,
corresponds to the presentation URI specified in the session
description. See Section 13.3 for more information.
Client: The client is the requester of media service from the media
Connection: A transport-layer virtual circuit established between
two programs for the purpose of communication.
Container file: A file that may contain multiple media streams that
often constitute a presentation when played together. The concept
of a container file is not embedded in the protocol. However,
RTSP servers may offer aggregate control on the media streams
within these files.
Continuous media: Data where there is a timing relationship between
source and sink; that is, the sink needs to reproduce the timing
relationship that existed at the source. The most common examples
of continuous media are audio and motion video. Continuous media
can be real time (interactive or conversational), where there is a
"tight" timing relationship between source and sink or it can be
streaming where the relationship is less strict.
Feature tag: A tag representing a certain set of functionality,
i.e., a feature.
IRI: An Internationalized Resource Identifier is similar to a URI
but allows characters from the whole Universal Character Set
(Unicode/ISO 10646), rather than the US-ASCII only. See [RFC3987]
for more information.
Live: A live presentation or session originates media from an event
taking place at the same time as the media delivery. Live
sessions often have an unbound or only loosely defined duration
and seek operations may not be possible.
Media initialization: The datatype- or codec-specific
initialization. This includes such things as clock rates, color
tables, etc. Any transport-independent information that is
required by a client for playback of a media stream occurs in the
media initialization phase of stream setup.
Media parameter: A parameter specific to a media type that may be
changed before or during stream delivery.
Media server: The server providing media-delivery services for one
or more media streams. Different media streams within a
presentation may originate from different media servers. A media
server may reside on the same host or on a different host from
which the presentation is invoked.
(Media) Stream: A single media instance, e.g., an audio stream or a
video stream as well as a single whiteboard or shared application
group. When using RTP, a stream consists of all RTP and RTCP
packets created by a media source within an RTP session.
Message: The basic unit of RTSP communication, consisting of a
structured sequence of octets matching the syntax defined in
Section 20 and transmitted over a transport between RTSP agents.
A message is either a request or a response.
Message body: The information transferred as the payload of a
message (request or response). A message body consists of meta-
information in the form of message body headers and content in the
form of an arbitrary number of data octets, as described in
Non-aggregated control: Control of a single media stream.
Presentation: A set of one or more streams presented to the client
as a complete media feed and described by a presentation
description as defined below. Presentations with more than one
media stream are often handled in RTSP under aggregate control.
Presentation description: A presentation description contains
information about one or more media streams within a presentation,
such as the set of encodings, network addresses, and information
about the content. Other IETF protocols, such as SDP ([RFC4566]),
use the term "session" for a presentation. The presentation
description may take several different formats, including but not
limited to SDP format.
Response: An RTSP response to a request. One type of RTSP message.
If an HTTP response is meant, it is indicated explicitly.
Request: An RTSP request. One type of RTSP message. If an HTTP
request is meant, it is indicated explicitly.
Request-URI: The URI used in a request to indicate the resource on
which the request is to be performed.
RTSP agent: Either an RTSP client, an RTSP server, or an RTSP proxy.
In this specification, there are many capabilities that are common
to these three entities such as the capability to send requests or
receive responses. This term will be used when describing
functionality that is applicable to all three of these entities.
RTSP session: A stateful abstraction upon which the main control
methods of RTSP operate. An RTSP session is a common context; it
is created and maintained on a client's request and can be
destroyed by either the client or server. It is established by an
RTSP server upon the completion of a successful SETUP request
(when a 200 OK response is sent) and is labeled with a session
identifier at that time. The session exists until timed out by
the server or explicitly removed by a TEARDOWN request. An RTSP
session is a stateful entity; an RTSP server maintains an explicit
session state machine (see Appendix B) where most state
transitions are triggered by client requests. The existence of a
session implies the existence of state about the session's media
streams and their respective transport mechanisms. A given
session can have one or more media streams associated with it. An
RTSP server uses the session to aggregate control over multiple
Origin server: The server on which a given resource resides.
Seeking: Requesting playback from a particular point in the content
Transport initialization: The negotiation of transport information
(e.g., port numbers, transport protocols) between the client and
URI: A Universal Resource Identifier; see [RFC3986]. The URIs used
in RTSP are generally URLs as they give a location for the
resource. As URLs are a subset of URIs, they will be referred to
as URIs to cover also the cases when an RTSP URI would not be a
URL: A Universal Resource Locator is a URI that identifies the
resource through its primary access mechanism rather than
identifying the resource by name or by some other attribute(s) of
4. Protocol Parameters
4.1. RTSP Version
This specification defines version 2.0 of RTSP.
RTSP uses a "<major>.<minor>" numbering scheme to indicate versions
of the protocol. The protocol versioning policy is intended to allow
the sender to indicate the format of a message and its capacity for
understanding further RTSP communication rather than the features
obtained via that communication. No change is made to the version
number for the addition of message components that do not affect
communication behavior or that only add to extensible field values.
The <minor> number is incremented when the changes made to the
protocol add features that do not change the general message parsing
algorithm but that may add to the message semantics and imply
additional capabilities of the sender. The <major> number is
incremented when the format of a message within the protocol is
changed. The version of an RTSP message is indicated by an RTSP-
Version field in the first line of the message. Note that the major
and minor numbers MUST be treated as separate integers and that each
MAY be incremented higher than a single digit. Thus, RTSP/2.4 is a
lower version than RTSP/2.13, which, in turn, is lower than
RTSP/12.3. Leading zeros SHALL NOT be sent and MUST be ignored by
4.2. RTSP IRI and URI
RTSP 2.0 defines and registers or updates three URI schemes "rtsp",
"rtsps", and "rtspu". The usage of the last, "rtspu", is unspecified
in RTSP 2.0 and is defined here to register the URI scheme that was
defined in RTSP 1.0. The "rtspu" scheme indicates unspecified
transport of the RTSP messages over unreliable transport means (UDP
in RTSP 1.0). An RTSP server MUST respond with an error code
indicating the "rtspu" scheme is not implemented (501) to a request
that carries a "rtspu" URI scheme.
The details of the syntax of "rtsp" and "rtsps" URIs have been
changed from RTSP 1.0. These changes include the addition of:
o Support for an IPv6 literal in the host part and future IP
literals through a mechanism defined in [RFC3986].
o A new relative format to use in the RTSP elements that is not
required to start with "/".
Neither should have any significant impact on interoperability. If
IPv6 literals are needed in the RTSP URI, then that RTSP server must
be IPv6 capable, and RTSP 1.0 is not a fully IPv6 capable protocol.
If an RTSP 1.0 client attempts to process the URI, the URI will not
match the allowed syntax, it will be considered invalid, and
processing will be stopped. This is clearly a failure to reach the
resource; however, it is not a signification issue as RTSP 2.0
support was needed anyway in both server and client. Thus, failure
will only occur in a later step when there is an RTSP version
mismatch between client and server. The second change will only
occur inside RTSP message headers, as the Request-URI must be an
absolute URI. Thus, such usages will only occur after an agent has
accepted and started processing RTSP 2.0 messages, and an agent using
RTSP 1.0 only will not be required to parse such types of relative
This specification also defines the format of RTSP IRIs [RFC3987]
that can be used as RTSP resource identifiers and locators on web
pages, user interfaces, on paper, etc. However, the RTSP request
message format only allows usage of the absolute URI format. The
RTSP IRI format MUST use the rules and transformation for IRIs to
URIs, as defined in [RFC3987]. This allows a URI that matches the
RTSP 2.0 specification, and so is suitable for use in a request, to
be created from an RTSP IRI.
The RTSP IRI and URI are both syntax restricted compared to the
generic syntax defined in [RFC3986] and [RFC3987]:
o An absolute URI requires the authority part; i.e., a host identity
MUST be provided.
o Parameters in the path element are prefixed with the reserved
The "scheme" and "host" parts of all URIs [RFC3986] and IRIs
[RFC3987] are case insensitive. All other parts of RTSP URIs and
IRIs are case sensitive, and they MUST NOT be case mapped.
The fragment identifier is used as defined in Sections 3.5 and 4.3 of
[RFC3986], i.e., the fragment is to be stripped from the IRI by the
requester and not included in the Request-URI. The user agent needs
to interpret the value of the fragment based on the media type the
request relates to; i.e., the media type indicated in Content-Type
header in the response to a DESCRIBE request.
The syntax of any URI query string is unspecified and responder
(usually the server) specific. The query is, from the requester's
perspective, an opaque string and needs to be handled as such.
Please note that relative URIs with queries are difficult to handle
due to the relative URI handling rules of RFC 3986. Any change of
the path element using a relative URI results in the stripping of the
query, which means the relative part needs to contain the query.
The URI scheme "rtsp" requires that commands be issued via a reliable
protocol (within the Internet, TCP), while the scheme "rtsps"
identifies a reliable transport using secure transport (TLS
[RFC5246]); see Section 19.
For the scheme "rtsp", if no port number is provided in the authority
part of the URI, the port number 554 MUST be used. For the scheme
"rtsps", if no port number is provided in the authority part of the
URI port number, the TCP port 322 MUST be used.
A presentation or a stream is identified by a textual media
identifier, using the character set and escape conventions of URIs
[RFC3986]. URIs may refer to a stream or an aggregate of streams;
i.e., a presentation. Accordingly, requests described in Section 13
can apply to either the whole presentation or an individual stream
within the presentation. Note that some request methods can only be
applied to streams, not presentations, and vice versa.
For example, the RTSP URI:
may identify the audio stream within the presentation "twister",
which can be controlled via RTSP requests issued over a TCP
connection to port 554 of host media.example.com.
Also, the RTSP URI:
identifies the presentation "twister", which may be composed of audio
and video streams, but could also be something else, such as a random
This does not imply a standard way to reference streams in URIs.
The presentation description defines the hierarchical
relationships in the presentation and the URIs for the individual
streams. A presentation description may name a stream "a.mov" and
the whole presentation "b.mov".
The path components of the RTSP URI are opaque to the client and do
not imply any particular file system structure for the server.
This decoupling also allows presentation descriptions to be used
with non-RTSP media control protocols simply by replacing the
scheme in the URI.
4.3. Session Identifiers
Session identifiers are strings of a length between 8-128 characters.
A session identifier MUST be generated using methods that make it
cryptographically random (see [RFC4086]). It is RECOMMENDED that a
session identifier contain 128 bits of entropy, i.e., approximately
22 characters from a high-quality generator (see Section 21).
However, note that the session identifier does not provide any
security against session hijacking unless it is kept confidential by
the client, server, and trusted proxies.
4.4. Media-Time Formats
RTSP currently supports three different media-time formats defined
below. Additional time formats may be specified in the future.
These time formats can be used with the Range header (Section 18.40)
to request playback and specify at which media position protocol
requests actually will or have taken place. They are also used in
description of the media's properties using the Media-Range header
(Section 18.30). The unqualified format identifier is used on its
own in Accept-Ranges header (Section 18.5) to declare supported time
formats and also in the Range header (Section 18.40) to request the
time format used in the response.
4.4.1. SMPTE-Relative Timestamps
A timestamp may use a format derived from a Society of Motion Picture
and Television Engineers (SMPTE) specification and expresses time
offsets anchored at the start of the media clip. Relative timestamps
are expressed as SMPTE time codes [SMPTE-TC] for frame-level access
accuracy. The time code has the format:
with the origin at the start of the clip. The default SMPTE format
is "SMPTE 30 drop" format, with a frame rate of 29.97 frames per
second. Other SMPTE codes MAY be supported (such as "SMPTE 25")
through the use of "smpte-type". For SMPTE 30, the "frames" field in
the time value can assume the values 0 through 29. The difference
between 30 and 29.97 frames per second is handled by dropping the
first two frame indices (values 00 and 01) of every minute, except
every tenth minute. If the frame and the subframe values are zero,
they may be omitted. Subframes are measured in hundredths of a
4.4.2. Normal Play Time
Normal Play Time (NPT) indicates the stream-absolute position
relative to the beginning of the presentation. The timestamp
consists of two parts: The mandatory first part may be expressed in
either seconds only or in hours, minutes, and seconds. The optional
second part consists of a decimal point and decimal figures and
indicates fractions of a second.
The beginning of a presentation corresponds to 0.0 seconds. Negative
values are not defined.
The special constant "now" is defined as the current instant of a
live event. It MAY only be used for live events and MUST NOT be used
for on-demand (i.e., non-live) content.
NPT is defined as in Digital Storage Media Command and Control
Intuitively, NPT is the clock the viewer associates with a
program. It is often digitally displayed on a DVD player. NPT
advances normally when in normal play mode (scale = 1), advances
at a faster rate when in fast-scan forward (high positive scale
ratio), decrements when in scan reverse (negative scale ratio) and
is fixed in pause mode. NPT is (logically) equivalent to SMPTE
The syntax is based on ISO 8601 [ISO.8601.2000] and expresses the
time elapsed since presentation start, with two different notations
o The npt-hhmmss notation uses an ISO 8601 extended complete
representation of the time of the day format (Section 22.214.171.124 of
[ISO.8601.2000] ) using colons (":") as separators between hours,
minutes, and seconds (hh:mm:ss). The hour counter is not limited
to 0-24 hours; up to nineteen (19) hour digits are allowed.
* In accordance with the requirements of the ISO 8601 time
format, the hours, minutes, and seconds MUST all be present,
with two digits used for minutes and for seconds and with at
least two digits for hours. An NPT of 7 minutes and 0 seconds
is represented as "00:07:00", and an NPT of 392 hours, 0
minutes, and 6 seconds is represented as "392:00:06".
* RTSP 1.0 allowed NPT in the npt-hhmmss notation without any
leading zeros to ensure that implementations don't fail; for
backward compatibility, all RTSP 2.0 implementations are
REQUIRED to support receiving NPT values, hours, minutes, or
seconds, without leading zeros.
o The npt-sec notation expresses the time in seconds, using between
one and nineteen (19) digits.
Both notations allow decimal fractions of seconds as specified in
Section 126.96.36.199 of [ISO.8601.2000], using at most nine digits, and
allowing only "." (full stop) as the decimal separator.
The npt-sec notation is optimized for automatic generation; the npt-
hhmmss notation is optimized for consumption by human readers. The
"now" constant allows clients to request to receive the live feed
rather than the stored or time-delayed version. This is needed since
neither absolute time nor zero time are appropriate for this case.
4.4.3. Absolute Time
Absolute time is expressed using a timestamp based on ISO 8601
[ISO.8601.2000]. The date is a complete representation of the
calendar date in basic format (YYYYMMDD) without separators (per
Section 188.8.131.52 of [ISO.8601.2000]). The time of day is provided in
the complete representation basic format (hhmmss) as specified in
Section 184.108.40.206 of [ISO.8601.2000], allowing decimal fractions of
seconds following Section 220.127.116.11 requiring "." (full stop) as
decimal separator and limiting the number of digits to no more than
nine. The time expressed MUST use UTC (GMT), i.e., no time zone
offsets are allowed. The full date and time specification is the
eight-digit date followed by a "T" followed by the six-digit time
value, optionally followed by a full stop followed by one to nine
fractions of a second and ended by "Z", e.g., YYYYMMDDThhmmss.ssZ.
The reasons for this time format rather than using "Date and Time
on the Internet: Timestamps" [RFC3339] are historic. We continue
to use the format specified in RTSP 1.0. The motivations raised
in RFC 3339 apply to why a selection from ISO 8601 was made;
however, a different and even more restrictive selection was
applied in this case.
Below are three examples of media time formats, first, a request for
a clock format range request for a starting time of November 8, 1996
at 14 h 37 min and 20 1/4 seconds UTC playing for 10 min and 5
seconds, followed by a Media-Properties header's "Time-Limited" UTC
property for the 24th of December 2014 at 15 hours and 00 minutes,
and finally a Terminate-Reason header "time" property for the 18th of
June 2013 at 16 hours, 12 minutes, and 56 seconds:
4.5. Feature Tags
Feature tags are unique identifiers used to designate features in
RTSP. These tags are used in Require (Section 18.43), Proxy-Require
(Section 18.37), Proxy-Supported (Section 18.38), Supported
(Section 18.51), and Unsupported (Section 18.55) header fields.
A feature tag definition MUST indicate which combination of clients,
servers, or proxies to which it applies.
The creator of a new RTSP feature tag should either prefix the
feature tag with a reverse domain name (e.g.,
"com.example.mynewfeature" is an apt name for a feature whose
inventor can be reached at "example.com") or register the new feature
tag with the Internet Assigned Numbers Authority (IANA). (See
Section 22, "IANA Considerations".)
The usage of feature tags is further described in Section 11, which
deals with capability handling.
4.6. Message Body Tags
Message body tags are opaque strings that are used to compare two
message bodies from the same resource, for example, in caches or to
optimize setup after a redirect. Message body tags can be carried in
the MTag header (see Section 18.31) or in SDP (see Appendix D.1.9).
MTag is similar to ETag in HTTP/1.1 (see Section 3.11 of [RFC2068]).
A message body tag MUST be unique across all versions of all message
bodies associated with a particular resource. A given message body
tag value MAY be used for message bodies obtained by requests on
different URIs. The use of the same message body tag value in
conjunction with message bodies obtained by requests on different
URIs does not imply the equivalence of those message bodies.
Message body tags are used in RTSP to make some methods conditional.
The methods are made conditional through the inclusion of headers;
see Section 18.24 and Section 18.26 for information on the If-Match
and If-None-Match headers, respectively. Note that RTSP message body
tags apply to the complete presentation, i.e., both the presentation
description and the individual media streams. Thus, message body
tags can be used to verify at setup time after a redirect that the
same session description applies to the media at the new location
using the If-Match header.
4.7. Media Properties
When an RTSP server handles media, it is important to consider the
different properties a media instance for delivery and playback can
have. This specification considers the media properties listed below
in its protocol operations. They are derived from the differences
between a number of supported usages.
On-demand: Media that has a fixed (given) duration that doesn't
change during the lifetime of the RTSP session and is known at the
time of the creation of the session. It is expected that the
content of the media will not change, even if the representation,
such as encoding, or quality, may change. Generally, one can
seek, i.e., request any range, within the media.
Dynamic On-demand: This is a variation of the on-demand case where
external methods are used to manipulate the actual content of the
media setup for the RTSP session. The main example is content
defined by a playlist.
Live: Live media represents a progressing content stream (such as
broadcast TV) where the duration may or may not be known. It is
not seekable, only the content presently being delivered can be
Live with Recording: A live stream that is combined with a server-
side capability to store and retain the content of the live
session and allow for random access delivery within the part of
the already-recorded content. The actual behavior of the media
stream is very much dependent on the retention policy for the
media stream; either the server will be able to capture the
complete media stream or it will have a limitation in how much
will be retained. The media range will dynamically change as the
session progress. For servers with a limited amount of storage
available for recording, there will typically be a sliding window
that moves forward while new data is made available and older data
To cover the above usages, the following media properties with
appropriate values are specified.
4.7.1. Random Access and Seeking
Random access is the ability to specify and get media delivered
starting from any time (instant) within the content, an operation
called "seeking". The Media-Properties header will indicate the
general capability for a media resource to perform random access.
Random-Access: The media is seekable to any out of a large number of
points within the media. Due to media-encoding limitations, a
particular point may not be reachable, but seeking to a point
close by is enabled. A floating-point number of seconds may be
provided to express the worst-case distance between random access
Beginning-Only: Seeking is only possible to the beginning of the
No-Seeking: Seeking is not possible at all.
If random access is possible, as indicated by the Media-Properties
header, the actual behavior policy when seeking can be controlled
using the Seek-Style header (Section 18.47).
The following retention policies are used by media to limit possible
Unlimited: The media will not be removed as long as the RTSP session
is in existence.
Time-Limited: The media will not be removed before the given
wallclock time. After that time, it may or may not be available
Time-Duration: The media (on fragment or unit basis) will be
retained for the specified duration.
4.7.3. Content Modifications
The media content and its timeline can be of different types, e.g.
pre-produced content on demand, a live source that is being generated
as time progresses, or something that is dynamically altered or
recomposed during playback. Therefore, a media property for content
modifications is needed and the following initial values are defined:
Immutable: The content of the media will not change, even if the
representation, such as encoding or quality changes.
Dynamic: The content can change due to external methods or triggers,
such as playlists, but this will be announced by explicit updates.
Time-Progressing: As time progresses, new content will become
available. If the content is also retained, it will become longer
as everything between the start point and the point currently
being made available can be accessed. If the media server uses a
sliding-window policy for retention, the start point will also
change as time progresses.
4.7.4. Supported Scale Factors
A particular media content item often supports only a limited set or
range of scales when delivering the media. To enable the client to
know what values or ranges of scale operations that the whole content
or the current position supports, a media properties attribute for
this is defined that contains a list with the values or ranges that
are supported. The attribute is named "Scales". The "Scales"
attribute may be updated at any point in the content due to content
consisting of spliced pieces or content being dynamically updated by
4.7.5. Mapping to the Attributes
This section shows examples of how one would map the above usages to
the properties and their values.
Example of On-Demand:
Random Access: Random-Access=5.0, Content Modifications:
Immutable, Retention: Unlimited or Time-Limited.
Example of Dynamic On-Demand:
Random Access: Random-Access=3.0, Content Modifications: Dynamic,
Retention: Unlimited or Time-Limited.
Example of Live:
Random Access: No-Seeking, Content Modifications: Time-
Progressing, Retention: Time-Duration=0.0
Example of Live with Recording:
Random Access: Random-Access=3.0, Content Modifications: Time-
Progressing, Retention: Time-Duration=7200.0