tech-invite   World Map     

IETF     RFCs     Groups     SIP     ABNFs    |    3GPP     Specs     Gloss.     Arch.     IMS     UICC    |    Misc.    |    search     info

RFC 4867


RTP Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs

Part 2 of 3, p. 15 to 37
Prev RFC Part       Next RFC Part


prevText      Top      Up      ToC       Page 15 
4.  AMR and AMR-WB RTP Payload Formats

   The AMR and AMR-WB payload formats have identical structure, so they
   are specified together.  The only differences are in the types of
   codec frames contained in the payload.  The payload format consists
   of the RTP header, payload header, and payload data.

4.1.  RTP Header Usage

   The format of the RTP header is specified in [8].  This payload
   format uses the fields of the header in a manner consistent with that

   The RTP timestamp corresponds to the sampling instant of the first
   sample encoded for the first frame-block in the packet.  The
   timestamp clock frequency is the same as the sampling frequency, so
   the timestamp unit is in samples.

Top      Up      ToC       Page 16 
   The duration of one speech frame-block is 20 ms for both AMR and
   AMR-WB.  For AMR, the sampling frequency is 8 kHz, corresponding to
   160 encoded speech samples per frame from each channel.  For AMR-WB,
   the sampling frequency is 16 kHz, corresponding to 320 samples per
   frame from each channel.  Thus, the timestamp is increased by 160 for
   AMR and 320 for AMR-WB for each consecutive frame-block.

   A packet may contain multiple frame-blocks of encoded speech or
   comfort noise parameters.  If interleaving is employed, the frame-
   blocks encapsulated into a payload are picked according to the
   interleaving rules as defined in Section 4.4.1.  Otherwise, each
   packet covers a period of one or more contiguous 20 ms frame-block
   intervals.  In case the data from all the channels for a particular
   frame-block in the period is missing (for example, at a gateway from
   some other transport format), it is possible to indicate that no data
   is present for that frame-block rather than breaking a multi-frame-
   block packet into two, as explained in Section 4.3.2.

   To allow for error resiliency through redundant transmission, the
   periods covered by multiple packets MAY overlap in time.  A receiver
   MUST be prepared to receive any speech frame multiple times, in exact
   duplicates, in different AMR rate modes, or with data present in one
   packet and not present in another.  If multiple versions of the same
   speech frame are received, it is RECOMMENDED that the mode with the
   highest rate be used by the speech decoder.  A given frame MUST NOT
   be encoded as speech in one packet and comfort noise parameters in

   The payload length is always made an integral number of octets by
   padding with zero bits if necessary.  If additional padding is
   required to bring the payload length to a larger multiple of octets
   or for some other purpose, then the P bit in the RTP in the header
   may be set and padding appended as specified in [8].

   The RTP header marker bit (M) SHALL be set to 1 if the first frame-
   block carried in the packet contains a speech frame which is the
   first in a talkspurt.  For all other packets the marker bit SHALL be
   set to zero (M=0).

   The assignment of an RTP payload type for this new packet format is
   outside the scope of this document, and will not be specified here.
   It is expected that the RTP profile under which this payload format
   is being used will assign a payload type for this encoding or specify
   that the payload type is to be bound dynamically.

Top      Up      ToC       Page 17 
4.2.  Payload Structure

   The complete payload consists of a payload header, a payload table of
   contents, and speech data representing one or more speech frame-
   blocks.  The following diagram shows the general payload format

   | payload header | table of contents | speech data ...

   Payloads containing more than one speech frame-block are called
   compound payloads.

   The following sections describe the variations taken by the payload
   format depending on whether the AMR session is set up to use the
   bandwidth-efficient mode or octet-aligned mode and any of the
   OPTIONAL functions for robust sorting, interleaving, and frame CRCs.
   Implementations SHOULD support both bandwidth-efficient and octet-
   aligned operation to increase interoperability.

4.3.  Bandwidth-Efficient Mode

4.3.1.  The Payload Header

   In bandwidth-efficient mode, the payload header simply consists of a
   4-bit codec mode request:

    0 1 2 3
   |  CMR  |

   CMR (4 bits): Indicates a codec mode request sent to the speech
      encoder at the site of the receiver of this payload.  The value of
      the CMR field is set to the frame type index of the corresponding
      speech mode being requested.  The frame type index may be 0-7 for
      AMR, as defined in Table 1a in [2], or 0-8 for AMR-WB, as defined
      in Table 1a in [4].  CMR value 15 indicates that no mode request
      is present, and other values are for future use.

   The codec mode request received in the CMR field is valid until the
   next codec mode request is received, i.e., a newly received CMR value
   corresponding to a speech mode, or NO_DATA overrides the previously
   received CMR value corresponding to a speech mode or NO_DATA.
   Therefore, if a terminal continuously wishes to receive frames in the

Top      Up      ToC       Page 18 
   same mode X, it needs to set CMR=X for all its outbound payloads, and
   if a terminal has no preference in which mode to receive, it SHOULD
   set CMR=15 in all its outbound payloads.

   If receiving a payload with a CMR value that is not a speech mode or
   NO_DATA, the CMR MUST be ignored by the receiver.

   In a multi-channel session, the codec mode request SHOULD be
   interpreted by the receiver of the payload as the desired encoding
   mode for all the channels in the session.

   An IP end-point SHOULD NOT set the codec mode request based on packet
   losses or other congestion indications, for several reasons:

      -  The other end of the IP path may be a gateway to a non-IP
         network (such as a radio link) that needs to set the CMR field
         to optimize performance on that network.

      -  Congestion on the IP network is managed by the IP sender, in
         this case, at the other end of the IP path.  Feedback about
         congestion SHOULD be provided to that IP sender through RTCP or
         other means, and then the sender can choose to avoid congestion
         using the most appropriate mechanism.  That may include
         adjusting the codec mode, but also includes adjusting the level
         of redundancy or number of frames per packet.

   The encoder SHOULD follow a received codec mode request, but MAY
   change to a lower-numbered mode if it so chooses, for example, to
   control congestion.

   The CMR field MUST be set to 15 for packets sent to a multicast
   group.  The encoder in the speech sender SHOULD ignore codec mode
   requests when sending speech to a multicast session but MAY use RTCP
   feedback information as a hint that a codec mode change is needed.

   The codec mode selection MAY be restricted by a session parameter to
   a subset of the available modes.  If so, the requested mode MUST be
   among the signalled subset (see Section 8).  If the received CMR
   value is outside the signalled subset of modes, it MUST be ignored.

4.3.2.  The Payload Table of Contents

   The table of contents (ToC) consists of a list of ToC entries, each
   representing a speech frame.

Top      Up      ToC       Page 19 
   In bandwidth-efficient mode, a ToC entry takes the following format:

    0 1 2 3 4 5
   |F|  FT   |Q|

   F (1 bit): If set to 1, indicates that this frame is followed by
      another speech frame in this payload; if set to 0, indicates that
      this frame is the last frame in this payload.

   FT (4 bits): Frame type index, indicating either the AMR or AMR-WB
      speech coding mode or comfort noise (SID) mode of the
      corresponding frame carried in this payload.

   The value of FT is defined in Table 1a in [2] for AMR and in Table 1a
   in [4] for AMR-WB.  FT=14 (SPEECH_LOST, only available for AMR-WB)
   and FT=15 (NO_DATA) are used to indicate frames that are either lost
   or not being transmitted in this payload, respectively.

   NO_DATA (FT=15) frame could mean either that no data for that frame
   has been produced by the speech encoder or that no data for that
   frame is transmitted in the current payload (i.e., valid data for
   that frame could be sent in either an earlier or later packet).

   If receiving a ToC entry with a FT value in the range 9-14 for AMR or
   10-13 for AMR-WB, the whole packet SHOULD be discarded.  This is to
   avoid the loss of data synchronization in the depacketization
   process, which can result in a huge degradation in speech quality.

   Note that packets containing only NO_DATA frames SHOULD NOT be
   transmitted in any payload format configuration, except in the case
   of interleaving.  Also, frame-blocks containing only NO_DATA frames
   at the end of a packet SHOULD NOT be transmitted in any payload
   format configuration, except in the case of interleaving.  The AMR
   SCR/DTX is described in [6] and AMR-WB SCR/DTX in [7].

   The extra comfort noise frame types specified in table 1a in [2]
   (i.e., GSM-EFR CN, IS-641 CN, and PDC-EFR CN) MUST NOT be used in
   this payload format because the standardized AMR codec is only
   required to implement the general AMR SID frame type and not those
   that are native to the incorporated encodings.

   Q (1 bit): Frame quality indicator.  If set to 0, indicates the
      corresponding frame is severely damaged, and the receiver should
      set the RX_TYPE (see [6]) to either SPEECH_BAD or SID_BAD
      depending on the frame type (FT).

Top      Up      ToC       Page 20 
   The frame quality indicator is included for interoperability with the
   ATM payload format described in ITU-T I.366.2, the UMTS Iu interface
   [20], as well as other transport formats.  The frame quality
   indicator enables damaged frames to be forwarded to the speech
   decoder for error concealment.  This can improve the speech quality
   more than dropping the damaged frames.  See Section for more

   For multi-channel sessions, the ToC entries of all frames from a
   frame-block are placed in the ToC in consecutive order as defined in
   Section 4.1 in [12].  When multiple frame-blocks are present in a
   packet in bandwidth-efficient mode, they will be placed in the packet
   in order of their creation time.

   Therefore, with N channels and K speech frame-blocks in a packet,
   there MUST be N*K entries in the ToC, and the first N entries will be
   from the first frame-block, the second N entries will be from the
   second frame-block, and so on.

   The following figure shows an example of a ToC of three entries in a
   single-channel session using bandwidth-efficient mode.

    0                   1
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
   |1|  FT   |Q|1|  FT   |Q|0|  FT   |Q|

   Below is an example of how the ToC entries will appear in the ToC of
   a packet carrying three consecutive frame-blocks in a session with
   two channels (L and R).

   | 1L | 1R | 2L | 2R | 3L | 3R |
     Frame-    Frame-    Frame-
     Block 1   Block 2   Block 3

4.3.3.  Speech Data

   Speech data of a payload contains zero or more speech frames or
   comfort noise frames, as described in the ToC of the payload.

      Note, for ToC entries with FT=14 or 15, there will be no
      corresponding speech frame present in the speech data.

Top      Up      ToC       Page 21 
   Each speech frame represents 20 ms of speech encoded with the mode
   indicated in the FT field of the corresponding ToC entry.  The length
   of the speech frame is implicitly defined by the mode indicated in
   the FT field.  The order and numbering notation of the bits are as
   specified for Interface Format 1 (IF1) in [2] for AMR and [4] for
   AMR-WB.  As specified there, the bits of speech frames have been
   rearranged in order of decreasing sensitivity, while the bits of
   comfort noise frames are in the order produced by the encoder.  The
   resulting bit sequence for a frame of length K bits is denoted d(0),
   d(1), ..., d(K-1).

4.3.4.  Algorithm for Forming the Payload

   The complete RTP payload in bandwidth-efficient mode is formed by
   packing bits from the payload header, table of contents, and speech
   frames in order (as defined by their corresponding ToC entries in the
   ToC list), and to bring the payload to octet alignment, 0 to 7
   padding bits.  Padding bits MUST be set to zero and MUST be ignored
   on reception.  They are packed contiguously into octets beginning
   with the most significant bits of the fields and the octets.

   To be precise, the four-bit payload header is packed into the first
   octet of the payload with bit 0 of the payload header in the most
   significant bit of the octet.  The four most significant bits
   (numbered 0-3) of the first ToC entry are packed into the least
   significant bits of the octet, ending with bit 3 in the least
   significant bit.  Packing continues in the second octet with bit 4 of
   the first ToC entry in the most significant bit of the octet.  If
   more than one frame is contained in the payload, then packing
   continues with the second and successive ToC entries.  Bit 0 of the
   first data frame follows immediately after the last ToC bit,
   proceeding through all the bits of the frame in numerical order.
   Bits from any successive frames follow contiguously in numerical
   order for each frame and in consecutive order of the frames.

   If speech data is missing for one or more speech frame within the
   sequence, because of, for example, DTX, a ToC entry with FT set to
   NO_DATA SHALL be included in the ToC for each of the missing frames,
   but no data bits are included in the payload for the missing frame
   (see Section for an example).

4.3.5.  Payload Examples  Single-Channel Payload Carrying a Single Frame

   The following diagram shows a bandwidth-efficient AMR payload from a
   single-channel session carrying a single speech frame-block.

Top      Up      ToC       Page 22 
   In the payload, no specific mode is requested (CMR=15), the speech
   frame is not damaged at the IP origin (Q=1), and the coding mode is
   AMR 7.4 kbps (FT=4).  The encoded speech bits, d(0) to d(147), are
   arranged in descending sensitivity order according to [2].  Finally,
   two padding bits (P) are added to the end as padding to make the
   payload octet aligned.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   | CMR=15|0| FT=4  |1|d(0)                                       |
   |                                                               |
   |                                                               |
   |                                                               |
   |                                                     d(147)|P|P|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  Single-Channel Payload Carrying Multiple Frames

   The following diagram shows a single-channel, bandwidth-efficient
   compound AMR-WB payload that contains four frames, of which one has
   no speech data.  The first frame is a speech frame at 6.6 kbps mode
   (FT=0) that is composed of speech bits d(0) to d(131).  The second
   frame is an AMR-WB SID frame (FT=9), consisting of bits g(0) to
   g(39).  The third frame is a NO_DATA frame and does not carry any
   speech information, it is represented in the payload by its ToC
   entry.  The fourth frame in the payload is a speech frame at 8.85
   kbps mode (FT=1), it consists of speech bits h(0) to h(176).

   As shown below, the payload carries a mode request for the encoder on
   the receiver's side to change its future coding mode to AMR-WB 8.85
   kbps (CMR=1).  None of the frames are damaged at IP origin (Q=1).
   The encoded speech and SID bits, d(0) to d(131), g(0) to g(39), and
   h(0) to h(176), are arranged in the payload in descending sensitivity
   order according to [4]. (Note, no speech bits are present for the
   third frame.)   Finally, seven zero bits are padded to the end to
   make the payload octet aligned.

Top      Up      ToC       Page 23 
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   | CMR=1 |1| FT=0  |1|1| FT=9  |1|1| FT=15 |1|0| FT=1  |1|d(0)   |
   |                                                               |
   |                                                               |
   |                                                               |
   |                                                         d(131)|
   |g(0)                                                           |
   |          g(39)|h(0)                                           |
   |                                                               |
   |                                                               |
   |                                                               |
   |                                           h(176)|P|P|P|P|P|P|P|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  Multi-Channel Payload Carrying Multiple Frames

   The following diagram shows a two-channel payload carrying 3 frame-
   blocks, i.e., the payload will contain 6 speech frames.

   In the payload, all speech frames contain the same mode 7.4 kbps
   (FT=4) and are not damaged at IP origin.  The CMR is set to 15, i.e.,
   no specific mode is requested.  The two channels are defined as left
   (L) and right (R) in that order.  The encoded speech bits is
   designated dXY(0).. dXY(K-1), where X = block number, Y = channel,
   and K is the number of speech bits for that mode.  Exemplifying this,
   for frame-block 1 of the left channel, the encoded bits are
   designated as d1L(0) to d1L(147).

Top      Up      ToC       Page 24 
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   | CMR=15|1|1L FT=4|1|1|1R FT=4|1|1|2L FT=4|1|1|2R FT=4|1|1|3L FT|
   |4|1|0|3R FT=4|1|d1L(0)                                         |
   |                                                               |
   |                                                               |
   |                                                               |
   |                                               d1L(147)|d1R(0) |
   : ...                                                           :
   |                       d1R(147)|d2L(0)                         |
   : ...                                                           :
   |d2L(147|d2R(0)                                                 |
   : ...                                                           :
   |                                       d2R(147)|d3L(0)         |
   : ...                                                           :
   |               d3L(147)|d3R(0)                                 |
   : ...                                                           :
   |                                                       d3R(147)|

Top      Up      ToC       Page 25 
4.4.  Octet-Aligned Mode

4.4.1.  The Payload Header

   In octet-aligned mode, the payload header consists of a 4-bit CMR, 4
   reserved bits, and optionally, an 8-bit interleaving header, as shown

    0                   1
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
   +-+-+-+-+-+-+-+-+- - - - - - - -
   |  CMR  |R|R|R|R|  ILL  |  ILP  |
   +-+-+-+-+-+-+-+-+- - - - - - - -

   CMR (4 bits): same as defined in Section 4.3.1.

   R: is a reserved bit that MUST be set to zero.  All R bits MUST be
      ignored by the receiver.

   ILL (4 bits, unsigned integer): This is an OPTIONAL field that is
      present only if interleaving is signalled out-of-band for the
      session.  ILL=L indicates to the receiver that the interleaving
      length is L+1, in number of frame-blocks.

   ILP (4 bits, unsigned integer): This is an OPTIONAL field that is
      present only if interleaving is signalled.  ILP MUST take a value
      between 0 and ILL, inclusive, indicating the interleaving index
      for frame-blocks in this payload in the interleaving group.  If
      the value of ILP is found greater than ILL, the payload SHOULD be

   ILL and ILP fields MUST be present in each packet in a session if
   interleaving is signalled for the session.  Interleaving MUST be
   performed on a frame-block basis (i.e., NOT on a frame basis) in a
   multi-channel session.

   The following example illustrates the arrangement of speech frame-
   blocks in an interleaving group during an interleaving session.  Here
   we assume ILL=L for the interleaving group that starts at speech
   frame-block n.  We also assume that the first payload packet of the
   interleaving group is s, and the number of speech frame-blocks
   carried in each payload is N.  Then we will have:

Top      Up      ToC       Page 26 
   Payload s (the first packet of this interleaving group):
      ILL=L, ILP=0,
      Carry frame-blocks: n, n+(L+1), n+2*(L+1), ..., n+(N-1)*(L+1)

   Payload s+1 (the second packet of this interleaving group):
      ILL=L, ILP=1,
      frame-blocks: n+1, n+1+(L+1), n+1+2*(L+1), ..., n+1+(N-1)*(L+1)

   Payload s+L (the last packet of this interleaving group):
      ILL=L, ILP=L,
      frame-blocks: n+L, n+L+(L+1), n+L+2*(L+1), ..., n+L+(N-1)*(L+1)

   The next interleaving group will start at frame-block n+N*(L+1).

   There will be no interleaving effect unless the number of frame-
   blocks per packet (N) is at least 2.  Moreover, the number of frame-
   blocks per payload (N) and the value of ILL MUST NOT be changed
   inside an interleaving group.  In other words, all payloads in an
   interleaving group MUST have the same ILL and MUST contain the same
   number of speech frame-blocks.

   The sender of the payload MUST only apply interleaving if the
   receiver has signalled its use through out-of-band means.  Since
   interleaving will increase buffering requirements at the receiver,
   the receiver uses media type parameter "interleaving=I" to set the
   maximum number of frame-blocks allowed in an interleaving group to I.

   When performing interleaving, the sender MUST use a proper number of
   frame-blocks per payload (N) and ILL so that the resulting size of an
   interleaving group is less or equal to I, that is, N*(L+1)<=I.

4.4.2.  The Payload Table of Contents and Frame CRCs

   The table of contents (ToC) in octet-aligned mode consists of a list
   of ToC entries where each entry corresponds to a speech frame carried
   in the payload and, optionally, a list of speech frame CRCs.  That
   is, the ToC is as follows:

   | list of ToC entries |
   | list of frame CRCs  | (optional)
    - - - - - - - - - - -

      Note, for ToC entries with FT=14 or 15, there will be no
      corresponding speech frame or frame CRC present in the payload.

Top      Up      ToC       Page 27 
   The list of ToC entries is organized in the same way as described for
   bandwidth-efficient mode in 4.3.2, with the following exception:
   when interleaving is used, the frame-blocks in the ToC will almost
   never be placed consecutively in time.  Instead, the presence and
   order of the frame-blocks in a packet will follow the pattern
   described in 4.4.1.

   The following example shows the ToC of three consecutive packets,
   each carrying three frame-blocks, in an interleaved two-channel
   session.  Here, the two channels are left (L) and right (R) with L
   coming before R, and the interleaving length is 3 (i.e., ILL=2).
   This results in the interleaving group size of 9 frame-blocks.

   Packet #1

   ILL=2, ILP=0:
   | 1L | 1R | 4L | 4R | 7L | 7R |
     Frame-    Frame-    Frame-
     Block 1   Block 4   Block 7

   Packet #2

   ILL=2, ILP=1:
   | 2L | 2R | 5L | 5R | 8L | 8R |
     Frame-    Frame-    Frame-
     Block 2   Block 5   Block 8

   Packet #3

   ILL=2, ILP=2:
   | 3L | 3R | 6L | 6R | 9L | 9R |
     Frame-    Frame-    Frame-
     Block 3   Block 6   Block 9

Top      Up      ToC       Page 28 
   A ToC entry takes the following format in octet-aligned mode:

    0 1 2 3 4 5 6 7
   |F|  FT   |Q|P|P|

   F (1 bit): see definition in Section 4.3.2.

   FT (4 bits, unsigned integer): see definition in Section 4.3.2.

   Q (1 bit): see definition in Section 4.3.2.

   P bits: padding bits, MUST be set to zero, and MUST be ignored on

   The list of CRCs is OPTIONAL.  It only exists if the use of CRC is
   signalled out-of-band for the session.  When present, each CRC in the
   list is 8 bits long and corresponds to a speech frame (NOT a frame-
   block) carried in the payload.  Calculation and use of the CRC is
   specified in the next section.  Use of Frame CRC for UED over IP

   The general concept of UED/UEP over IP is discussed in Section 3.6.
   This section provides more details on how to use the frame CRC in the
   octet-aligned payload header together with a partial transport layer
   checksum to achieve UED.

   To achieve UED, one SHOULD use a transport layer checksum (for
   example, the one defined in UDP-Lite [19]) to protect the IP,
   transport protocol (e.g., UDP-Lite), and RTP headers, as well as the
   payload header and the table of contents in the payload.  The frame
   CRC, when used, MUST be calculated only over all class A bits in the
   AMR or AMR-WB frame.  Class B and C bits in the AMR or AMR-WB frame
   MUST NOT be included in the CRC calculation and SHOULD NOT be covered
   by the transport checksum.

      Note, the number of class A bits for various coding modes in AMR
      codec is specified as informative in [2] and is therefore copied
      into Table 1 in Section 3.6 to make it normative for this payload
      format.  The number of class A bits for various coding modes in
      AMR-WB codec is specified as normative in Table 2 in [4], and the
      SID frame (FT=9) has 40 class A bits.  These definitions of class
      A bits MUST be used for this payload format.

Top      Up      ToC       Page 29 
   If the transport layer checksum or link layer checksum detects any
   errors within the protected (sensitive) part, it is assumed that the
   complete packet will be discarded as defined by UDP-Lite [19].

   The receiver of the payload SHOULD examine the data integrity of the
   received class A bits by re-calculating the CRC over the received
   class A bits and comparing the result to the value found in the
   received payload header.  If the two values mismatch, the receiver
   SHALL consider the class A bits in the receiver frame damaged and
   MUST clear the Q flag of the frame (i.e., set it to 0).  This will
   subsequently cause the frame to be marked as SPEECH_BAD, if the FT of
   the frame is 0..7 for AMR or 0..8 for AMR-WB, or SID_BAD if the FT of
   the frame is 8 for AMR or 9 for AMR-WB, before it is passed to the
   speech decoder.  See [6] and [7] more details.

   The following example shows an octet-aligned ToC with a CRC list for
   a payload containing 3 speech frames from a single-channel session
   (assuming none of the FTs is equal to 14 or 15):

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   |1|  FT#1 |Q|P|P|1|  FT#2 |Q|P|P|0|  FT#3 |Q|P|P|     CRC#1     |
   |     CRC#2     |     CRC#3     |

   Each of the CRCs takes 8 bits

     0   1   2   3   4   5   6   7
   | c0| c1| c2| c3| c4| c5| c6| c7|
   (MSB)                       (LSB)

   and is calculated by the cyclic generator polynomial,

     C(x) = 1 + x^2 + x^3 + x^4 + x^8

   where ^ is the exponentiation operator.

   In binary form, the polynomial appears as follows: 101110001

   The actual calculation of the CRC is made as follows:  First, an
   8-bit CRC register is reset to zero: 00000000.  For each bit over
   which the CRC shall be calculated, an XOR operation is made between
   the rightmost (LSB) bit of the CRC register and the bit.  The CRC

Top      Up      ToC       Page 30 
   register is then right-shifted one step (each bit's significance is
   reduced by one), inputting a "0" as the leftmost bit (MSB).  If the
   result of the XOR operation mentioned above is a "1", then "10111000"
   is bit-wise XOR-ed into the CRC register.  This operation is repeated
   for each bit that the CRC should cover.  In this case, the first bit
   would be d(0) for the speech frame for which the CRC should cover.
   When the last bit (e.g., d(54) for AMR 5.9 according to Table 1 in
   Section 3.6) has been used in this CRC calculation, the contents in
   CRC register should simply be copied to the corresponding field in
   the list of CRCs.

   Fast calculation of the CRC on a general-purpose CPU is possible
   using a table-driven algorithm.

4.4.3.  Speech Data

   In octet-aligned mode, speech data is carried in a similar way to
   that in the bandwidth-efficient mode as discussed in Section 4.3.3,
   with the following exceptions:

      -  The last octet of each speech frame MUST be padded with zero
         bits at the end if all bits in the octet are not used.  The
         padding bits MUST be ignored on reception.  In other words,
         each speech frame MUST be octet-aligned.

      -  When multiple speech frames are present in the speech data
         (i.e., compound payload), the speech frames are arranged either
         one whole frame after another as usual, or with the octets of
         all frames interleaved together at the octet level, depending
         on the media type parameters negotiated for the payload type.
         Since the bits within each frame are ordered with the most
         error-sensitive bits first, interleaving the octets collects
         those sensitive bits from all frames to be nearer the beginning
         of the packet.  This is called "robust sorting order" which
         allows the application of UED (such as UDP-Lite [19]) or UEP
         (such as the ULP [22]) mechanisms to the payload data.  The
         details of assembling the payload are given in the next

   The use of robust sorting order for a payload type MUST be agreed via
   out-of-band means.  Section 8 specifies a media type parameter for
   this purpose.

   Note, robust sorting order MUST only be performed on the frame level
   and thus is independent of interleaving, which is at the frame-block
   level, as described in Section 4.4.1. In other words, robust sorting
   can be applied to either non-interleaved or interleaved payload

Top      Up      ToC       Page 31 
4.4.4.  Methods for Forming the Payload

   Two different packetization methods, namely, normal order and robust
   sorting order, exist for forming a payload in octet-aligned mode.  In
   both cases, the payload header and table of contents are packed into
   the payload the same way; the difference is in the packing of the
   speech frames.

   The payload begins with the payload header of one octet, or two
   octets if frame interleaving is selected.  The payload header is
   followed by the table of contents consisting of a list of one-octet
   ToC entries.  If frame CRCs are to be included, they follow the table
   of contents with one 8-bit CRC filling each octet.  Note that if a
   given frame has a ToC entry with FT=14 or 15, there will be no CRC

   The speech data follows the table of contents, or the CRCs if
   present.  For packetization in the normal order, all of the octets
   comprising a speech frame are appended to the payload as a unit.  The
   speech frames are packed in the same order as their corresponding ToC
   entries are arranged in the ToC list, with the exception that if a
   given frame has a ToC entry with FT=14 or 15, there will be no data
   octets present for that frame.

   For packetization in robust sorting order, the octets of all speech
   frames are interleaved together at the octet level.  That is, the
   data portion of the payload begins with the first octet of the first
   frame, followed by the first octet of the second frame, then the
   first octet of the third frame, and so on.  After the first octet of
   the last frame has been appended, the cycle repeats with the second
   octet of each frame.  The process continues for as many octets as are
   present in the longest frame.  If the frames are not all the same
   octet length, a shorter frame is skipped once all octets in it have
   been appended.  The order of the frames in the cycle will be
   sequential if frame interleaving is not in use, or according to the
   interleave pattern specified in the payload header if frame
   interleaving is in use.  Note that if a given frame has a ToC entry
   with FT=14 or 15, there will be no data octets present for that
   frame, so it is skipped in the robust sorting cycle.

   The UED and/or UEP is RECOMMENDED to cover at least the RTP header,
   payload header, table of contents, and class A bits of a sorted
   payload.  Exactly how many octets need to be covered depends on the
   network and application.  If CRCs are used together with robust
   sorting, only the RTP header, the payload header, and the ToC SHOULD
   be covered by UED/UEP.  The means for communicating the number of
   octets to be covered to other layers performing UED/UEP is beyond the
   scope of this specification.

Top      Up      ToC       Page 32 
4.4.5.  Payload Examples  Basic Single-Channel Payload Carrying Multiple Frames

   The following diagram shows an octet aligned payload from a single
   channel payload type that carries two AMR frames of 7.95 kbps coding
   mode (FT=5).  In the payload, a codec mode request is sent (CMR=6),
   requesting the encoder at the receiver's side to use AMR 10.2 kbps
   coding mode.  No frame CRC, interleaving, or robust sorting is in

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   | CMR=6 |R|R|R|R|1|FT#1=5 |Q|P|P|0|FT#2=5 |Q|P|P|   f1(0..7)    |
   |   f1(8..15)   |  f1(16..23)   |  ....                         |
   : ...                                                           :
   |                         ...   |f1(152..158) |P|   f2(0..7)    |
   |   f2(8..15)   |  f2(16..23)   |  ....                         |
   : ...                                                           :
   |                         ...   |f2(152..158) |P|

   Note, in the above example, the last octet in both speech frames is
   padded with one zero bit to make it octet-aligned.  Two-Channel Payload with CRC, Interleaving, and Robust Sorting

   This example shows an octet aligned payload from a two-channel
   payload type.  Two frame-blocks, each containing two speech frames of
   7.95 kbps coding mode (FT=5), are carried in this payload.

   The two channels are left (L) and right (R) with L coming before R.
   In the payload, a codec mode request is also sent (CMR=6), requesting
   the encoder at the receiver's side to use AMR 10.2 kbps coding mode.

   Moreover, frame CRC, robust sorting, and frame-block interleaving are
   all enabled for the payload type.  The interleaving length is 2
   (ILL=1), and this payload is the first one in an interleaving group

Top      Up      ToC       Page 33 
   The first two frames in the payload are the L and R channel speech
   frames of frame-block #1, consisting of bits f1L(0..158) and
   f1R(0..158), respectively.  The next two frames are the L and R
   channel frames of frame-block #3, consisting of bits f3L(0..158) and
   f3R(0..158), respectively, due to interleaving.  For each of the four
   speech frames, a CRC is calculated as CRC1L(0..7), CRC1R(0..7),
   CRC3L(0..7), and CRC3R(0..7), respectively.  Finally, the payload is
   robust sorted.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   | CMR=6 |R|R|R|R| ILL=1 | ILP=0 |1|FT#1L=5|Q|P|P|1|FT#1R=5|Q|P|P|
   |1|FT#3L=5|Q|P|P|0|FT#3R=5|Q|P|P|      CRC1L    |      CRC1R    |
   |      CRC3L    |      CRC3R    |   f1L(0..7)   |   f1R(0..7)   |
   |   f3L(0..7)   |   f3R(0..7)   |  f1L(8..15)   |  f1R(8..15)   |
   |  f3L(8..15)   |  f3R(8..15)   |  f1L(16..23)  |  f1R(16..23)  |
   : ...                                                           :
   | f3L(144..151) | f3R(144..151) |f1L(152..158)|P|f1R(152..158)|P|

   Note, in the above example, the last octet in all four speech frames
   is padded with one zero bit to make it octet-aligned.

4.5.  Implementation Considerations

   An application implementing this payload format MUST understand all
   the payload parameters in the out-of-band signaling used.  For
   example, if an application uses SDP, all the SDP and media type
   parameters in this document MUST be understood.  This requirement
   ensures that an implementation always can decide if it is capable or
   not of communicating.

   No operating mode of the payload format is mandatory to implement.
   The requirements of the application using the payload format should
   be used to determine what to implement.  To achieve basic
   interoperability, an implementation SHOULD at least implement both
   bandwidth-efficient and octet-aligned modes for a single audio

Top      Up      ToC       Page 34 
   channel.  The other operating modes: interleaving, robust sorting,
   and frame-wise CRC (in both single and multi-channel) are OPTIONAL to

   The mode-change-period, mode-change-capability, and mode-change-
   neighbor parameters are intended for signaling with GSM endpoints.
   When interoperability with GSM is desired, encoders SHOULD only
   perform codec mode changes to neighboring modes and in integer
   multiples of 40 ms (two frame-blocks), but decoders SHOULD accept
   codec mode changes at any time, i.e., for every frame-block.  The
   encoder may arbitrarily select the initial phase (odd or even frame-
   block) where codec mode changes are performed, but then SHOULD stick
   to that phase as far as possible.  However, in rare cases, handovers
   or other events (e.g., call forwarding) may change this phase and may
   also cause mode changes to non-neighboring modes.  The decoder SHALL
   therefore be prepared to accept changes also in the other phase and
   to other modes.  Section 8 specifies the usage of the parameters
   mode-change-period and mode-change-capability to indicate the desired
   behavior in applications.

   See 3GPP TS 26.103 [28] for preferred AMR and AMR-WB configurations
   for operation in GSM and 3GPP UMTS networks.  In gateway scenarios,
   encoders can be requested through the "mode-set" parameter to use a
   limited mode-set that is supported by the link beyond the gateway.
   Further, to avoid congestion on that link, the encoder SHOULD limit
   the initial codec mode for a session to a lower mode, until at least
   one frame-block is received with rate control information.

4.5.1.  Decoding Validation

   When processing a received payload packet, if the receiver finds that
   the calculated payload length, based on the information for the
   payload type and the values found in the payload header fields, does
   not match the size of the received packet, the receiver SHOULD
   discard the packet.  This is because decoding a packet that has
   errors in its length field could severely degrade the speech quality.

Top      Up      ToC       Page 35 
5.  AMR and AMR-WB Storage Format

   The storage format is used for storing AMR or AMR-WB speech frames in
   a file or as an email attachment.  Multiple channel content is

   In general, an AMR or AMR-WB file has the following structure:

   | Header           |
   | Speech frame 1   |
   : ...              :
   | Speech frame n   |

   Note, to preserve interoperability with already deployed
   implementations, single-channel content uses a file header format
   different from that of multi-channel content.

   There also exists another storage format for AMR and AMR-WB that is
   suitable for applications with more advanced demands on the storage
   format, like random access or synchronization with video.  This
   format is the 3GPP-specified ISO-based multimedia file format 3GP
   [31].  Its media type is specified by RFC 3839 [32].

5.1.  Single-Channel Header

   A single-channel AMR or AMR-WB file header contains only a magic
   number.  Different magic numbers are defined to distinguish AMR from

   The magic number for single-channel AMR files MUST consist of ASCII
   character string:

      (or 0x2321414d520a in hexadecimal).

   The magic number for single-channel AMR-WB files MUST consist of
   ASCII character string:

      (or 0x2321414d522d57420a in hexadecimal).

Top      Up      ToC       Page 36 
   Note, the "\n" is an important part of the magic numbers and MUST be
   included in the comparison, since, otherwise, the single-channel
   magic numbers above will become indistinguishable from those of the
   multi-channel files defined in the next section.

5.2.  Multi-Channel Header

   The multi-channel header consists of a magic number followed by a
   32-bit channel description field, giving the multi-channel header the
   following structure:

   | magic number     |
   | chan-desc field  |

   The magic number for multi-channel AMR files MUST consist of the
   ASCII character string:

      (or 0x2321414d525F4D43312E300a in hexadecimal).

   The magic number for multi-channel AMR-WB files MUST consist of the
   ASCII character string:

      (or 0x2321414d522d57425F4D43312E300a in hexadecimal).

   The version number in the magic numbers refers to the version of the
   file format.

   The 32 bit channel description field is defined as:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   |      Reserved bits                                    | CHAN  |

   Reserved bits: MUST be set to 0 when written, and a reader MUST
                  ignore them.

   CHAN (4 bits, unsigned integer): Indicates the number of audio
   channels contained in this storage file.  The valid values and the
   order of the channels within a frame-block are specified in Section
   4.1 in [12].

Top      Up      ToC       Page 37 
5.3.  Speech Frames

   After the file header, speech frame-blocks consecutive in time are
   stored in the file.  Each frame-block contains a number of octet-
   aligned speech frames equal to the number of channels, and stored in
   increasing order, starting with channel 1.

   Each stored speech frame starts with a one-octet frame header with
   the following format:

    0 1 2 3 4 5 6 7
   |P|  FT   |Q|P|P|

   The FT field and the Q bit are defined in the same way as in Section
   4.3.2.  The P bits are padding and MUST be set to 0, and MUST be

   Following this one octet header come the speech bits as defined in
   4.4.3.  The last octet of each frame is padded with zeroes, if
   needed, to achieve octet alignment.

   The following example shows an AMR frame in 5.9 kbps coding mode
   (with 118 speech bits) in the storage format.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   |P| FT=2  |Q|P|P|                                               |
   +-+-+-+-+-+-+-+-+                                               +
   |                                                               |
   +          Speech bits for frame-block n, channel k             +
   |                                                               |
   +                                                           +-+-+
   |                                                           |P|P|

   Non-received speech frames or frame-blocks between SID updates during
   non-speech periods MUST be stored as NO_DATA frames (frame type 15,
   as defined in [2] and [4]).  Frames or frame-blocks lost in
   transmission MUST be stored as NO_DATA frames or SPEECH_LOST (frame
   type 14, only available for AMR-WB) in complete frame-blocks to keep
   synchronization with the original media.

   Comfort noise frames of other types than AMR SID (FT=8) (i.e., frame
   type 9, 10, and 11 for AMR) SHALL NOT be used in the AMR file format.

Next RFC Part