Tech-invite3GPPspecsGlossariesIETFRFCsGroupsSIPABNFsWorld Map

RFC 4396


RTP Payload Format for 3rd Generation Partnership Project (3GPP) Timed Text

Part 3 of 3, p. 45 to 66
Prev RFC Part


prevText      Top      Up      ToC       Page 45 
5.  Resilient Transport

   Apart from the basic fragmentation guidelines described in the
   section above, the simplest option for packet-loss-resilient
   transport is packet repetition.  This mechanism may consist of a
   strict window-based repetition mechanism or, simply, a repetition
   mechanism in a wider sense, where new and old packets are mixed, for

   A server MAY decide to use repetition as a measure for packet loss
   resilience.  Thereby, a server MAY send the same RTP payloads or just
   some of the units from the payloads.

   As for the case of complete payloads, single repeated units MUST
   exactly match the same units sent in the first transmission; i.e., if
   fragmentation is needed, it SHALL be performed only once for each
   text sample.  Only then, a receiver can use the already received and
   the repeated units to reconstruct the original text samples.  Since
   the RTP timestamp is used to group together the fragments of a
   sample, care must taken to preserve the timing of units when
   constructing new RTP packets.

        For example, if a text sample was originally sent as a single
        non-fragmented text sample (one TYPE 1 unit), a repetition of
        that sample MUST be sent also as a single non-fragmented text
        sample in one unit.  Likewise, if the original text sample was
        fragmented and spread over several RTP packets (say, a total of
        3 units), then the repeated fragments SHALL also have the same
        byte boundaries and use the same unit headers and bytes per

Top      Up      ToC       Page 46 
   With repetition, repeated units resolve to the same timestamp as
   their originals.  Where redundant units are available, only one of
   them SHALL be used.

   Regarding the RTP header fields:

   o If the whole RTP payload is repeated, all payload-specific fields
     in the RTP header (the M, TS and PT fields) MUST keep their
     original values except the sequence number, which MUST be
     incremented to comply with RTP (the fields TOTAL/THIS enable to
     re-assemble fragments with different sequence numbers).

   o In packets containing single repeated units, the general rules in
     Section 3 for assigning values to the RTP header fields apply.
     Keeping the value of the RTP timestamp to preserve the timing of
     the units is particularly relevant here.

   Apart from repetition, other mechanisms such as FEC [7],
   retransmission [11], or similar techniques could be used to cope with
   packet losses.

6.  Congestion Control

   Congestion control for RTP SHALL be implemented in accordance with
   RTP [3] and the applicable RTP profile, e.g., RTP/AVP [17].

   When using this payload format, mainly two factors may affect the
   congestion control:

   o The use of (unit) aggregation may make the payload format more
     bandwidth efficient, by avoiding header overhead and thus reducing
     the used bitrate.

   o The use of resilient transport mechanisms: Although timed text
     applications typically operate at low bitrates, the increase due to
     resilient transport shall be considered for congestion control
     mechanisms.  This applies to all mechanisms but especially to less
     efficient ones like repetition.

Top      Up      ToC       Page 47 
7.  Scene Description

7.1.  Text Rendering Position and Composition

   In order to set up a timed text session, regardless of the stream
   being stored in a 3GP file or streamed live, some initial layout
   information is needed by the communicating peers.

      |      <-> tx                               |    +-------------+
      |     +-------------------------------+     |<---|Display Area |
      |  ^  |                               |     |    +-------------+
      |  :  |                               |     |
      |  :ty|                               |     |    +-------------+
      |  :  |                               |<---------|Video track  |
      |  :  |                               |     |    +-------------+
      |  :  |                               |     |
      |  :  |                               |     |
      |  :  |                               |     |
      |  v  |                               |     |
      |  -  |   x-------------------------+ |     |    +-------------+
      |h ^  |   |                         |<-----------|Text Track   |
      |e :  +---|-------------------------|-+     |    +-------------+
      |i :      | +---------------------+ |       |
      |g :      | |                     | |       |    +-------------+
      |h :      | |                     |<------------ |Text Box     |
      |t v      | +---------------------+ |       |    +-------------+
      |  -      +-------------------------+       |
                        w i d t h

   Figure 18.  Illustration of text rendering position and composition

   The parameters used for negotiating the position and size of the text
   track in the display area are shown in Figure 18.  These are the
   "width" and "height" of the text track, its translation values, "tx"
   and "ty", and its "layer" or proximity to the user.

   At the same time, the sender of the stream needs to know the
   receiver's capabilities.  In this case, the maximum allowable values
   for the text track height and width: "max-h" and "max-w", for the
   stream the receiver shall display.

   This layout information MUST be conveyed in a reliable form before
   the start of the session, e.g., during session announcement or in an
   Offer/Answer (O/A) exchange.  An example of a reliable transport may
   be the out-of-band channel used for SDP.  Sections 8 and 9 provide

Top      Up      ToC       Page 48 
   details on the mapping of these parameters to SDP descriptions and
   their usage in O/A.

   For stored content, the layout values expressing stream properties
   MUST be obtained from the Track Header Box.  See Section 7.3.

   For live streaming, appropriate values as negotiated during session
   setup shall be used.

7.2.  SMIL Usage

   The attributes contained in the Track Header Boxes of a 3GP file only
   specify the spatial relationship of the tracks within the given 3GP

   If multiple 3GP files are sent, they require spatial synchronization.
   For example, for a text and video stream, the positions of the text
   and video tracks in Figure 18 shall be determined.  For this purpose,
   SMIL [9] MAY be used.

   SMIL assigns regions in the display to each of those files and places
   the tracks within those regions.  Generally, in SMIL, the position of
   one track (or stream) is expressed relative to another track.  This
   is different from the 3GP file, where the upper left corner is the
   reference for all translation offsets.  Hence, only if the position
   in SMIL is relative to the video track origin, then this translation
   offset has the same value as (tx, ty) in the 3GP file.

   Note also that the original track header information is used for each
   track only within its region, as assigned by SMIL.  Therefore, even
   if SMIL scene description is used, the track header information
   pieces SHOULD be sent anyway, as they represent the intrinsic media
   properties.  See 3GPP SMIL Language Profile in [27] for details.

7.3.  Finding Layout Values in a 3GP File

   In a 3GP file, within the Track Header Box (tkhd):

        o tx, ty: These values specify the translation offset of the
          (text) track relative to the upper left corner of the video
          track, if present.  They are the second but last and third but
          last values in the unity matrix; values are fixed-point 16.16
          values, restricted to be (signed) integers (i.e., the lower 16
          bits of each value shall be all zeros).  Therefore, only the
          first 16 bits are used for obtaining the value of the media
          type parameters.

Top      Up      ToC       Page 49 
        o width, height: They have the same name in the tkhd box.  All
          (unsigned) 32 bits are meaningful.

        o layer: All (signed) 16 bits are used.

8.  3GPP Timed Text Media Type

   The media subtype for the 3GPP Timed Text codec is allocated from the
   standards tree.  The top-level media type under which this payload
   format is registered is 'video'.  This registration is done using the
   template defined in [29] and following RFC 3555 [28].

   The receiver MUST ignore any unrecognized parameter.

   Media type: video

   Media subtype: 3gpp-tt

   Required parameters

                Refer to Section 3 in RFC 4396.

                The parameter "sver" contains a list of supported
                backwards-compatible versions of the timed text format
                specification (3GPP TS 26.245) that the sender accepts
                to receive (and that are the same that it would be
                willing to send).  The first value is the value
                preferred to receive (or preferred to send).  The first
                value MAY be followed by a comma-separated list of
                versions that SHOULD be used as alternatives.  The order
                is meaningful, being first the most preferred and last
                the least preferred.  Each entry has the format
                Zi(xi*256+yi), where "Zi" is the number of the Release
                and "xi" and "yi" are taken from the 3GPP specification
                version (i.e., vZi.xi.yi).  For example, for 3GPP TS
                26.245 v6.0.0, Zi(xi*256+yi)=6(0), the version value is
                "60".  (Note that "60" is the concatenation of the
                values Zi=6 and (xi*256+yi)=0 and not their product.)

                If no "sver" value is available, for example, when
                streaming out of a 3GP file, the default value "60",
                corresponding to the 3GPP Release 6 version of 3GPP TS
                26.245, SHALL be used.

Top      Up      ToC       Page 50 
   Optional parameters:

                This parameter indicates the horizontal translation
                offset in pixels of the text track with respect to the
                origin of the video track.  This value is the decimal
                representation of a 16-bit signed integer.  Refer to TS
                3GPP 26.245 for an illustration of this parameter.

                This parameter indicates the vertical translation offset
                in pixels of the text track with respect to the origin
                of the video track.  This value is the decimal
                representation of a 16-bit signed integer.  Refer to TS
                3GPP 26.245 for an illustration of this parameter.

                This parameter indicates the proximity of the text track
                to the viewer.  More negative values mean closer to the
                viewer.  This parameter has no units.  This value is the
                decimal representation of a 16-bit signed integer.

                This parameter MUST be used for conveying sample
                descriptions out-of-band.  It contains a comma-separated
                list of base64-encoded entries.  The entries of this
                list MAY follow any particular order and the list SHALL
                NOT be empty.  Each entry is the result of running
                base64 encoding over the concatenation of the (static)
                SIDX value as an 8-bit unsigned integer and the (static)
                sample description for that SIDX, in that order.  The
                format of a sample description entry can be found in
                3GPP TS 26.245 Release 6 and later releases.  All
                servers and clients MUST understand this parameter and
                MUST be capable of using the sample description(s)
                contained in it.  Please refer to RFC 3548 [6] for
                details on the base64 encoding.

                This parameter indicates the width in pixels of the text
                track or area of the text being sent.  This value is the
                decimal representation of a 32-bit unsigned integer.
                Refer to TS 3GPP 26.245 for an illustration of this

Top      Up      ToC       Page 51 
                This parameter indicates the height in pixels of the
                text track being sent.  This value is the decimal
                representation of a 32-bit unsigned integer.  Refer to
                TS 3GPP 26.245 for an illustration of this parameter.

                This parameter indicates display capabilities.  This is
                the maximum "width" value that the sender of this
                parameter supports.  This value is the decimal
                representation of a 32-bit unsigned integer.

                This parameter indicates display capabilities.  This is
                the maximum "height" value that the sender of this
                parameter supports.  This value is the decimal
                representation of a 32-bit unsigned integer.

   Encoding considerations:

        This media type is framed (see Section 4.8 in [29]) and
        partially contains binary data.

   Restrictions on usage:

        This media type depends on RTP framing, and hence is only
        defined for transfer via RTP [3].  Transport within other
        framing protocols is not defined at this time.

   Security considerations:

        Please refer to Section 11 of RFC 4396.

   Interoperability considerations:

        The 3GPP Timed Text media format and its file storage is
        specified in Release 6 of 3GPP TS 26.245, "Transparent end-to-
        end packet switched streaming service (PSS); Timed Text Format
        (Release 6)".  Note also that 3GPP may in future releases
        specify extensions or updates to the timed text media format in
        a backwards-compatible way, e.g., new modifier boxes or
        extensions to the sample descriptions.  The payload format
        defined in RFC 4396 allows for such extensions.  For future 3GPP
        Releases of the Timed Text Format, the parameter "sver" is used
        to identify the exact specification used.

Top      Up      ToC       Page 52 
        The defined storage format for 3GPP Timed Text format is the
        3GPP File Format (3GP) [30]. 3GP files may be transferred using
        the media type video/3gpp as registered by RFC 3839 [31].  The
        3GPP File Format is a container file that may contain, e.g.,
        audio and video that may be synchronized with the 3GPP Timed

   Published specification: RFC 4396

   Applications which use this media type:

        Multimedia streaming applications.

   Additional information:

        The 3GPP Timed Text media format is specified in 3GPP TS 26.245,
        "Transparent end-to-end packet switched streaming service (PSS);
        Timed Text Format (Release 6)".  This document and future
        extensions to the 3GPP Timed Text format are publicly available

        Magic number(s): None.

        File extension(s): None.

        Macintosh File Type Code(s): None.

   Person & email address to contact for further information:

        Jose Rey,
        Yoshinori Matsui,
        Audio/Video Transport Working Group.

   Intended usage: COMMON

        Jose Rey
        Yoshinori Matsui

   Change controller: IETF Audio/Video Transport Working Group delegated
        from the IESG.

Top      Up      ToC       Page 53 
9.  SDP Usage

9.1.  Mapping to SDP

   The information carried in the media type specification has a
   specific mapping to fields in SDP [4].  If SDP is used to specify
   sessions using this payload format, the mapping is done as follows:

   o The media type ("video") goes in the SDP "m=" as the media name.

       m=video <port number> RTP/<RTP profile> <dynamic payload type>

   o The media subtype ("3gpp-tt") and the timestamp clockrate "rate"
     (the RECOMMENDED 1000 Hz or other value) go in SDP "a=rtpmap" line
     as the encoding name and rate, respectively:

       a=rtpmap:<payload type> 3gpp-tt/1000

   o The REQUIRED parameter "sver" goes in the SDP "a=fmtp" attribute by
     copying it directly from the media type string as a semicolon-
     separated parameter=value pair.

   o The OPTIONAL parameters "tx", "ty", "layer", "tx3g", "width",
     "height", "max-w" and "max-h" go in the SDP "a=fmtp" attribute by
     copying them directly from the media type string as a semicolon
     separated list of parameter=value(s) pairs:

       a=fmtp:<dynamic payload type> <parameter
       name>=<value>[,<value>][; <parameter name>=<value>]

   o   Any parameter unknown to the device that uses the SDP SHALL be
       ignored.  For example, parameters added to the media format in
       later specifications MAY be copied into the SDP and SHALL be
       ignored by receivers that do not understand them.

9.2.  Parameter Usage in the SDP Offer/Answer Model

   In this section, the meaning of the SDP parameters defined in this
   document within the Offer/Answer [13] context is explained.

   In unicast, sender and receiver typically negotiate the streams,
   i.e., which codecs and parameter values are used in the session.
   This is also possible in multicast to a lesser extent.

   Additionally, the meaning of the parameters MAY vary depending on
   which direction is used.  In the following sections, a
   "<directionality> offer" means an offer that contains a stream set to
   <directionality>.  <directionality> may take the values sendrecv,

Top      Up      ToC       Page 54 
   sendonly, and recvonly.  Similar considerations apply for answers.
   For example, an answer to a sendonly offer is a recvonly answer.

9.2.1. Unicast Usage

   The following types of parameters are used in this payload format:

     1. Declarative parameters: Offerer and answerer declare the values
        they will use for the incoming (sendrecv/recvonly) or outgoing
        (sendonly) stream.  Offerer and answerer MAY use different

          a. "tx", "ty", and "layer": These are parameters describing
             where the received text track is placed.  Depending on the

              i. They MUST appear in all sendrecv offers and answers and
                 in all recvonly offers and answers (thus applying to
                 the incoming stream).  In the case of sendrecv offers
                 and answers and in recvonly offers, these values SHOULD
                 be used by the sender of the stream unless it has a
                 particular preference, in which case, it MUST make sure
                 that these different values do not corrupt the
                 presentation.  For recvonly answers, the answerer MAY
                 accept the proposed values for the incoming stream (in
                 a sendonly offer; see ii. below) or respond with
                 different ones.  The offerer MUST use the returned

             ii. They MAY appear in sendonly offers and MUST appear in
                 sendonly answers.  In sendonly offers, they specify the
                 values that the offerer proposes for sending (see
                 example in Section 9.3).  In sendonly answers, these
                 values SHOULD be copied from the corresponding recvonly
                 offer upon accepting the stream, unless a particular
                 preference by the receiver of the stream exists, as
                 explained in the previous point.

     2. Parameters describing the display capabilities, "max-h" and
        "max-w", which indicate the maximum dimensions of the text track
        (text display area) for the incoming stream "tx" and "ty" values
        (see Figure 18).  "max-h" and "max-w" MUST be included in all
        offers and answers where "tx" and "ty" refer to the incoming
        stream, thus excluding sendonly offers and answers (see example
        in Section 9.3), where they SHALL NOT be present.

Top      Up      ToC       Page 55 
     3. Parameters describing the sent stream properties, i.e., the
        sender of the stream decides upon the values of these:

          a. "width" and "height" specify the text track dimensions.
             They SHALL ALWAYS be present in sendrecv and sendonly
             offers and answers.  For recvonly answers, the answerer
             MUST include the offered parameter values (if any) verbatim
             in the answer upon accepting the stream.

          b. "tx3g" contains static sample descriptions.  It MAY only be
             present in sendrecv and sendonly offers and answers.  This
             parameter applies to the stream that offerers or answerers

     4. Negotiable parameters, which MUST be agreed on.  This is the
        case of "sver".  This parameter MUST be present in every offer
        and answer.  The answerer SHALL choose one supported value from
        the offerer's list, or else it MUST remove the stream or reject
        the session.

     5. Symmetric parameters: "rate", timestamp clockrate, belongs to
        this class.  Symmetric parameters MUST be echoed verbatim in the
        answer.  Otherwise, the stream MUST be removed or the session

   The following table summarizes all options:

Top      Up      ToC       Page 56 
     |   ``--..__  Directionality/ | sendrecv | recvonly | sendonly |
     + Type of   ``--..__   O or A +----------+----------+----------+
     |    Parameter      ``--..__  |   O/A    |   O/A    |   O/A    |
     | Declarative  |tx, ty, layer |   M/M    |   M/M    |   m/M    |
     |              |              |          |          |          |
     | Display      |max-h, max-w  |   M/M    |   M/M    |   -/-    |
     | Capabilities |              |          |          |          |
     | Stream       |height, width |   M/M    |   -/(M)  |   M/M    |
     | properties   |tx3g          |   m/m    |   -/-    |   m/m    |
     |              |              |          |          |          |
     |  Negotiable  |sver          |   M/M    |   M/M    |   M/M    |
     |              |              |          |          |          |
     |  Symmetric   |rate          |   M/M    |   M/M    |   M/M    |

          Table 1.  Parameter usage in Unicast Offer / Answer.

        o M means MUST be present.
        o m means MAY be present (such as proposed values).
        o (M) or (m) means MUST or MAY, if applicable.
        o a hyphen ("-") means the parameter MUST NOT be present.

   Other observations regarding parameter usage:

     o Translation and transparency values: In sendonly offers, "tx",
       "ty", and "layer" indicate proposed values.  This is useful for
       visually composed sessions where the different streams occupy
       different parts of the display, e.g., a video stream and the
       captions.  These are just suggested values; the peer rendering
       the text ultimately decides where to place the text track.

     o Text track (area) dimensions, "height" and "width": In the case
       of sendonly offers, an answerer accepting the offer MUST be
       prepared to render the stream using these values.  If any of
       these conditions are not met, the stream MUST be removed or the
       session rejected.

     o Display capabilities, "max-h" and "max-w": An answerer sending a
       stream SHALL ensure that the "height" and "width" values in the
       answer are compatible with the offerer's signaled capabilities.

Top      Up      ToC       Page 57 
     o Version handling via "sver": The idea is that offerer and
       answerer communicate using the same version.  This is achieved by
       letting the answerer choose from a list of supported versions,
       "sver".  For recvonly streams, the first value in the list is the
       preferred version to receive.  Consequently, for sendonly (and
       sendrecv) streams, the first value is the one preferred for
       sending (and receiving).  The answerer MUST choose one value and
       return it in the answer.  Upon receiving the answer, the offerer
       SHALL be prepared to send (sendonly and sendrecv) and receive
       (recvonly and sendrecv) a stream using that version.  If none of
       the versions in the list is supported, the stream MUST be removed
       or the session rejected.  Note that, if alternative non-
       compatible versions are offered, then this SHALL be done using
       different payload types.

9.2.2.  Multicast Usage

   In multicast, the parameter usage is similar to the unicast case,
   except as follows:

   o the parameters "tx", "ty", and "layer" in multicast offers only
     have meaning for sendrecv and recvonly streams.  In order for all
     clients to have the same vision of the session, they MUST be used

   o for "height", "width", and "tx3g" (for sendrecv and sendonly),
     multicast offers specify which values of these parameters the
     participants MUST use for sending.  Thus, if the stream is
     accepted, the answerer MUST also include them verbatim in the
     answer (also "tx3g", if present).

   o The capability parameters, "max-h" and "max-w", SHALL NOT be used
     in multicast.  If the offered text track should change in size, a
     new offer SHALL be used instead.

   o Regarding version handling:

     In the case of multicast offers, an answerer MAY accept a multicast
     offer as long as one of the versions listed in the "sver" is
     supported.  Therefore, if the stream is accepted, the answerer MUST
     choose its preferred version, but, unlike in unicast, the offerer
     SHALL NOT change the offered stream to this chosen version because
     there may be other session participants that do support the newer
     extensions.  Consequently, different session participants may end
     up using different backwards-compatible media format versions.  It
     is RECOMMENDED that the multicast offer contains a limited number
     of versions, in order for all participants to have the same view of
     the session.  This is a responsibility of the session creator.  If

Top      Up      ToC       Page 58 
     none of the offered versions is supported, the stream SHALL be
     removed or the session rejected.  Also in this case, if alternative
     non-compatible versions are offered, then this SHALL be done using
     different payload types.

9.3.  Offer/Answer Examples

   In these unicast O/A examples, the long lines are wrapped around.
   Static sample descriptions are shortened for clarity.

   For sendrecv:

   O -> A

   m=video <port> RTP/AVP 98
   a=rtpmap:98 3gpp-tt/1000
   a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; max-h=120;
   max-w=160; sver=6256,60; tx3g=81...

   A -> O

   m=video <port> RTP/AVP 98..
   a=rtpmap:98 3gpp-tt/1000
   a=fmtp:98 tx=100; ty=95; layer=0; height=90; width=100; max-h=100;
   max-w=160; sver=60; tx3g=82...

   In this example, the offerer is telling the answerer where it will
   place the received stream and what is the maximum height and width
   allowable for the stream that it will receive.  Also, it tells the
   answerer the dimensions of the text track for the stream sent and
   which sample description it shall use.  It offers two versions, 6256
   and 60.  The answerer responds with an equivalent set of parameters
   for the stream it receives.  In this case, the answerer's "max-h" and
   "max-w" are compatible with the offerer's "height" and "width".
   Otherwise, the answerer would have to remove this stream, and the
   offerer would have to issue a new offer taking the answerer's
   capabilities into account.  This is possible only if multiple payload
   types are present in the initial offer so that at least one of them
   matches the answerer's capabilities as expressed by "max-h" and
   "max-w" in the negative answer.  Note also that the answerer's text
   box dimensions fit within the maximum values signaled in the offer.
   Finally, the answerer chooses to use version 60 of the timed text

Top      Up      ToC       Page 59 
   For recvonly:

   Offerer -> Answerer

   m=video <port> RTP/AVP 98
   a=rtpmap:98 3gpp-tt/1000
   a=fmtp:98 tx=100; ty=100; layer=0; max-h=120; max-w=160; sver=6256,60

   A -> O

   m=video <port> RTP/AVP 98..
   a=rtpmap:98 3gpp-tt/1000
   a=fmtp:98 tx=100; ty=100; layer=0; height=90; width=100; sver=60;

   In this case, the offer is different from the previous case: It does
   not include the stream properties "height", "width", and "tx3g".  The
   answerer copies the "tx", "ty", and "layer" values, thus
   acknowledging these.  "max-h" and "max-w" are not present in the
   answer because the "tx" and "ty" (and "layer") in this special case
   do not apply to the received stream, but to the sent stream.  Also,
   if offerer and answerer had very different display sizes, it would
   not be possible to express the answerer's capabilities.  In the
   example above and for an answerer with a 50x50 display, the
   translation values are already out of range.

   For sendonly:

   O -> A

   m=video <port> RTP/AVP 98
   a=rtpmap:98 3gpp-tt/1000
   a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100;
   sver=6256,60; tx3g=81...

   A -> O

   m=video <port> RTP/AVP 98..
   a=rtpmap:98 3gpp-tt/1000
   a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; max-h=100;
   max-w=160; sver=60

Top      Up      ToC       Page 60 
   Note that "max-h" and "max-w" are not present in the offer.  Also,
   with this answer, the answerer would accept the offer as is (thus
   echoing "tx", "ty", "height", "width", and "layer") and additionally
   inform the offerer about its capabilities: "max-h" and "max-w".

   Another possible answer for this case would be:

   A -> O

   m=video <port> RTP/AVP 98..
   a=rtpmap:98 3gpp-tt/1000
   a=fmtp:98 tx=120; ty=105; layer=0; max-h=95; max-w=150; sver=60

   In this case, the answerer does not accept the values offered.  The
   offerer MUST use these values or else remove the stream.

9.4.  Parameter Usage outside of Offer/Answer

   SDP may also be employed outside of the Offer/Answer context, for
   instance for multimedia sessions that are announced through the
   Session Announcement Protocol (SAP) [14] or streamed through the Real
   Time Streaming Protocol (RTSP) [15].

   In this case, the receiver of a session description is required to
   support the parameters and given values for the streams, or else it
   MUST reject the session.  It is the responsibility of the sender (or
   creator) of the session descriptions to define the session parameters
   so that the probability of unsuccessful session setup is minimized.
   This is out of the scope of this document.

10.  IANA Considerations

   IANA has registered the media subtype name "3gpp-tt" for the media
   type "video" as specified in Section 8 of this document.

11.  Security Considerations

   RTP packets using the payload format defined in this specification
   are subject to the security considerations discussed in the RTP
   specification [3] and any applicable RTP profile, e.g., AVP [17].

   In particular, an attacker may invalidate the current set of active
   sample descriptions at the client by means of repeating a packet with
   an old sample description, i.e., replay attack.  This would mean that
   the display of the text would be corrupted, if displayed at all.
   Another form of attack may consist of sending redundant fragments,
   whose boundaries do not match the exact boundaries of the originals

Top      Up      ToC       Page 61 
   (as indicated by LEN) or fragments that carry different sample
   lengths (SLEN).  This may cause a decoder to crash.

   These types of attack may easily be avoided by using source
   authentication and integrity protection.

   Additionally, peers in a timed text session may desire to retain
   privacy in their communication, i.e., confidentiality.

   This payload format does not provide any mechanisms for achieving
   these.  Confidentiality, integrity protection, and authentication
   have to be solved by a mechanism external to this payload format,
   e.g., SRTP [10].

12.  References

12.1.  Normative References

   [1]  Transparent end-to-end packet switched streaming service (PSS);
        Timed Text Format (Release 6), TS 26.245 v 6.0.0, June 2004.

   [2]  ISO/IEC 14496-12:2004 Information technology - Coding of audio-
        visual objects - Part 12: ISO base media file format.

   [3]  Schulzrinne, H.,  Casner, S., Frederick, R., and V. Jacobson,
        "RTP: A Transport Protocol for Real-Time Applications", STD 64,
        RFC 3550, July 2003.

   [4]  Handley, M. and V. Jacobson, "SDP: Session Description
        Protocol", RFC 2327, April 1998.

   [5]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", BCP 14, RFC 2119, March 1997.

   [6]  Josefsson, S., "The Base16, Base32, and Base64 Data Encodings",
        RFC 3548, July 2003.

12.2.  Informative References

   [7]  Rosenberg, J. and H. Schulzrinne, "An RTP Payload Format for
        Generic Forward Error Correction", RFC 2733, December 1999.

   [8]  Perkins, C. and O. Hodson, "Options for Repair of Streaming
        Media", RFC 2354, June 1998.

   [9]  W3C, "Synchronised Multimedia Integration Language (SMIL 2.0)",
        August, 2001.

Top      Up      ToC       Page 62 
   [10] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
        Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC
        3711, March 2004.

   [11] Rey, J., Leon, D., Miyazaki, A., Varsa, V., and R. Hakenberg,
        "RTP Retransmission Payload Format", Work in Progress, September

   [12] van der Meer, J., Mackie, D., Swaminathan, V., Singer, D., and
        P. Gentric, "RTP Payload Format for Transport of MPEG-4
        Elementary Streams", RFC 3640, November 2003.

   [13] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with
        Session Description Protocol (SDP)", RFC 3264, June 2002.

   [14] Handley, M., Perkins, C., and E. Whelan, "Session Announcement
        Protocol", RFC 2974, October 2000.

   [15] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming
        Protocol (RTSP)", RFC 2326, April 1998.

   [16] Transparent end-to-end packet switched streaming service (PSS);
        Protocols and codecs (Release 6), TS 26.234 v 6.1.0, September

   [17] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video
        Conferences with Minimal Control", STD 65, RFC 3551, July 2003.

   [18] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD
        63, RFC 3629, November 2003.

   [19] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646",
        RFC 2781, February 2000.

   [20] Friedman, T., Caceres, R., and A. Clark, "RTP Control Protocol
        Extended Reports (RTCP XR)", RFC 3611, November 2003.

   [21] Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey,
        "Extended RTP Profile for RTCP-based Feedback (RTP/AVPF)", Work
        in Progress, August 2004.

   [22] Hellstrom, G., "RTP Payload for Text Conversation", RFC 2793,
        May 2000.

   [23] Hellstrom, G. and P. Jones, "RTP Payload for Text Conversation",
        RFC 4103, June 2005.

Top      Up      ToC       Page 63 
   [24] ITU-T Recommendation T.140 (1998) - Text conversation protocol
        for multimedia application, with amendment 1, (2000).

   [25] ISO/IEC 10646-1: (1993), Universal Multiple Octet Coded
        Character Set.

   [26] ISO/IEC FCD 14496-17 Information technology - Coding of audio-
        visual objects - Part 17: Streaming text format, Work in
        progress, June 2004.

   [27] Transparent end-to-end Packet-switched Streaming Service (PSS);
        3GPP SMIL language profile, (Release 6), TS 26.246 v 6.0.0, June

   [28] Casner, S. and P. Hoschka, "MIME Type Registration of RTP
        Payload Formats", RFC 3555, July 2003.

   [29] Freed, N. and J. Klensin, "Media Type Specifications and
        Registration Procedures", BCP 13, RFC 4288, December 2005.

   [30] Transparent end-to-end packet switched streaming service (PSS);
        3GPP file format (3GP) (Release 6), TS 26.244 V6.3. March 2005.

   [31] Castagno, R. and D. Singer, "MIME Type Registrations for 3rd
        Generation Partnership Project (3GPP) Multimedia files", RFC
        3839, July 2004.

Top      Up      ToC       Page 64 
13.  Basics of the 3GP File Structure

   This section provides a coarse overview of the 3GP file structure,
   which follows the ISO Base Media file Format [2].

   Each 3GP file consists of "Boxes".  In general, a 3GP file contains
   the File Type Box (ftyp), the Movie Box (moov), and the Media Data
   Box (mdat).  The File Type Box identifies the type and properties of
   the 3GP file itself.  The Movie Box and the Media Data Box, serving
   as containers, include their own boxes for each media.  Boxes start
   with a header, which indicates both size and type (these fields are
   called, namely, "size" and "type").  Additionally, each box type may
   include a number of boxes.

   In the following, only those boxes are mentioned that are useful for
   the purposes of this payload format.

   The Movie Box (moov) contains one or more Track Boxes (trak), which
   include information about each track.  A Track Box contains, among
   others, the Track Header Box (tkhd), the Media Header Box (mdhd), and
   the Media Information Box (minf).

   The Track Header Box specifies the characteristics of a single track,
   where a track is, in this case, the streamed text during a session.
   Exactly one Track Header Box is present for a track.  It contains
   information about the track, such as the spatial layout (width and
   height), the video transformation matrix, and the layer number.
   Since these pieces of information are essential and static (i.e.,
   constant) for the duration of the session, they must be sent prior to
   the transmission of any text samples.

   The Media Header Box contains the "timescale" or number of time units
   that pass in one second, i.e., cycles per second or Hertz.  The Media
   Information Box includes the Sample Table Box (stbl), which contains
   all the time and data indexing of the media samples in a track. Using
   this box, it is possible to locate samples in time and to determine
   their type, size, container, and offset into that container. Inside
   the Sample Table Box, we can find the Sample Description Box (stsd,
   for finding sample descriptions), the Decoding Time to Sample Box
   (stts, for finding sample duration), the Sample Size Box (stsz), and
   the Sample to Chunk Box (stsc, for finding the sample description

   Finally, the Media Data Box contains the media data itself.  In timed
   text tracks, this box contains text samples.  Its equivalent to audio
   and video is audio and video frames, respectively.  The text sample
   consists of the text length, the text string, and one or several
   Modifier Boxes.  The text length is the size of the text in bytes.

Top      Up      ToC       Page 65 
   The text string is plain text to render.  The Modifier Box is
   information to render in addition to the text, such as color, font,

14.  Acknowledgements

   The authors would like to thank Dave Singer, Jan van der Meer, Magnus
   Westerlund, and Colin Perkins for their comments and suggestions
   about this document.

   The authors would also like to thank Markus Gebhard for the free and
   publicly available JavE ASCII Editor (used for the ASCII drawings in
   this document) and Henrik Levkowetz for the Idnits web service.

Authors' Addresses

   Jose Rey
   Panasonic R&D Center Germany GmbH
   Monzastr. 4c
   D-63225 Langen, Germany

   Phone: +49-6103-766-134
   Fax:   +49-6103-766-166

   Yoshinori Matsui
   Matsushita Electric Industrial Co., LTD.
   1006 Kadoma
   Kadoma-shi, Osaka, Japan

   Phone: +81 6 6900 9689
   Fax:   +81 6 6900 9699

Top      Up      ToC       Page 66 
Full Copyright Statement

   Copyright (C) The Internet Society (2006).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at  The IETF invites any interested party to
   bring to its attention any copyrights, patents or patent
   applications, or other proprietary rights that may cover technology
   that may be required to implement this standard.  Please address the
   information to the IETF at


   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).