Tech-invite3GPPspaceIETFspace
959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 6716

Definition of the Opus Audio Codec

Pages: 326
Proposed Standard
Errata
Updated by:  8251
Part 2 of 14 – Pages 13 to 31
First   Prev   Next

Top   ToC   RFC6716 - Page 13   prevText

3. Internal Framing

The Opus encoder produces "packets", which are each a contiguous set of bytes meant to be transmitted as a single unit. The packets described here do not include such things as IP, UDP, or RTP headers, which are normally found in a transport-layer packet. A single packet may contain multiple audio frames, so long as they share a common set of parameters, including the operating mode, audio bandwidth, frame size, and channel count (mono vs. stereo). This section describes the possible combinations of these parameters and the internal framing used to pack multiple frames into a single packet. This framing is not self-delimiting. Instead, it assumes that a lower layer (such as UDP or RTP [RFC3550] or Ogg [RFC3533] or Matroska [MATROSKA-WEBSITE]) will communicate the length, in bytes, of the packet, and it uses this information to reduce the framing overhead in the packet itself. A decoder implementation MUST support the framing described in this section. An alternative, self- delimiting variant of the framing is described in Appendix B. Support for that variant is OPTIONAL. All bit diagrams in this document number the bits so that bit 0 is the most significant bit of the first byte, and bit 7 is the least significant. Bit 8 is thus the most significant bit of the second byte, etc. Well-formed Opus packets obey certain requirements, marked [R1] through [R7] below. These are summarized in Section 3.4 along with appropriate means of handling malformed packets.

3.1. The TOC Byte

A well-formed Opus packet MUST contain at least one byte [R1]. This byte forms a table-of-contents (TOC) header that signals which of the various modes and configurations a given packet uses. It is composed of a configuration number, "config", a stereo flag, "s", and a frame count code, "c", arranged as illustrated in Figure 1. A description of each of these fields follows.
Top   ToC   RFC6716 - Page 14
                              0
                              0 1 2 3 4 5 6 7
                             +-+-+-+-+-+-+-+-+
                             | config  |s| c |
                             +-+-+-+-+-+-+-+-+

                          Figure 1: The TOC Byte

   The top five bits of the TOC byte, labeled "config", encode one of 32
   possible configurations of operating mode, audio bandwidth, and frame
   size.  As described, the LP (SILK) layer and MDCT (CELT) layer can be
   combined in three possible operating modes:

   1.  A SILK-only mode for use in low bitrate connections with an audio
       bandwidth of WB or less,

   2.  A Hybrid (SILK+CELT) mode for SWB or FB speech at medium
       bitrates, and

   3.  A CELT-only mode for very low delay speech transmission as well
       as music transmission (NB to FB).

   The 32 possible configurations each identify which one of these
   operating modes the packet uses, as well as the audio bandwidth and
   the frame size.  Table 2 lists the parameters for each configuration.
Top   ToC   RFC6716 - Page 15
   +-----------------------+-----------+-----------+-------------------+
   | Configuration         | Mode      | Bandwidth | Frame Sizes       |
   | Number(s)             |           |           |                   |
   +-----------------------+-----------+-----------+-------------------+
   | 0...3                 | SILK-only | NB        | 10, 20, 40, 60 ms |
   |                       |           |           |                   |
   | 4...7                 | SILK-only | MB        | 10, 20, 40, 60 ms |
   |                       |           |           |                   |
   | 8...11                | SILK-only | WB        | 10, 20, 40, 60 ms |
   |                       |           |           |                   |
   | 12...13               | Hybrid    | SWB       | 10, 20 ms         |
   |                       |           |           |                   |
   | 14...15               | Hybrid    | FB        | 10, 20 ms         |
   |                       |           |           |                   |
   | 16...19               | CELT-only | NB        | 2.5, 5, 10, 20 ms |
   |                       |           |           |                   |
   | 20...23               | CELT-only | WB        | 2.5, 5, 10, 20 ms |
   |                       |           |           |                   |
   | 24...27               | CELT-only | SWB       | 2.5, 5, 10, 20 ms |
   |                       |           |           |                   |
   | 28...31               | CELT-only | FB        | 2.5, 5, 10, 20 ms |
   +-----------------------+-----------+-----------+-------------------+

                Table 2: TOC Byte Configuration Parameters

   The configuration numbers in each range (e.g., 0...3 for NB SILK-
   only) correspond to the various choices of frame size, in the same
   order.  For example, configuration 0 has a 10 ms frame size and
   configuration 3 has a 60 ms frame size.

   One additional bit, labeled "s", signals mono vs. stereo, with 0
   indicating mono and 1 indicating stereo.

   The remaining two bits of the TOC byte, labeled "c", code the number
   of frames per packet (codes 0 to 3) as follows:

   o  0: 1 frame in the packet

   o  1: 2 frames in the packet, each with equal compressed size

   o  2: 2 frames in the packet, with different compressed sizes

   o  3: an arbitrary number of frames in the packet

   This document refers to a packet as a code 0 packet, code 1 packet,
   etc., based on the value of "c".
Top   ToC   RFC6716 - Page 16

3.2. Frame Packing

This section describes how frames are packed according to each possible value of "c" in the TOC byte.

3.2.1. Frame Length Coding

When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed length of one or more of these frames is indicated with a one- or two-byte sequence, with the meaning of the first byte as follows: o 0: No frame (Discontinuous Transmission (DTX) or lost packet) o 1...251: Length of the frame in bytes o 252...255: A second byte is needed. The total length is (second_byte*4)+first_byte The special length 0 indicates that no frame is available, either because it was dropped during transmission by some intermediary or because the encoder chose not to transmit it. Any Opus frame in any mode MAY have a length of 0. The maximum representable length is 255*4+255=1275 bytes. For 20 ms frames, this represents a bitrate of 510 kbit/s, which is approximately the highest useful rate for lossily compressed fullband stereo music. Beyond this point, lossless codecs are more appropriate. It is also roughly the maximum useful rate of the MDCT layer as, shortly thereafter, quality no longer improves with additional bits due to limitations on the codebook sizes. No length is transmitted for the last frame in a VBR packet, or for any of the frames in a CBR packet, as it can be inferred from the total size of the packet and the size of all other data in the packet. However, the length of any individual frame MUST NOT exceed 1275 bytes [R2] to allow for repacketization by gateways, conference bridges, or other software.

3.2.2. Code 0: One Frame in the Packet

For code 0 packets, the TOC byte is immediately followed by N-1 bytes of compressed data for a single frame (where N is the size of the packet), as illustrated in Figure 2.
Top   ToC   RFC6716 - Page 17
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|0|0|                                               |
     +-+-+-+-+-+-+-+-+                                               |
     |                    Compressed frame 1 (N-1 bytes)...          :
     :                                                               |
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 2: A Code 0 Packet

3.2.3. Code 1: Two Frames in the Packet, Each with Equal Compressed Size

For code 1 packets, the TOC byte is immediately followed by the (N-1)/2 bytes of compressed data for the first frame, followed by (N-1)/2 bytes of compressed data for the second frame, as illustrated in Figure 3. The number of payload bytes available for compressed data, N-1, MUST be even for all code 1 packets [R3]. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|0|1| | +-+-+-+-+-+-+-+-+ : | Compressed frame 1 ((N-1)/2 bytes)... | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | Compressed frame 2 ((N-1)/2 bytes)... | : +-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3: A Code 1 Packet

3.2.4. Code 2: Two Frames in the Packet, with Different Compressed Sizes

For code 2 packets, the TOC byte is followed by a one- or two-byte sequence indicating the length of the first frame (marked N1 in Figure 4), followed by N1 bytes of compressed data for the first frame. The remaining N-N1-2 or N-N1-3 bytes are the compressed data for the second frame. This is illustrated in Figure 4. A code 2 packet MUST contain enough bytes to represent a valid length. For example, a 1-byte code 2 packet is always invalid, and a 2-byte code 2 packet whose second byte is in the range 252...255 is also invalid.
Top   ToC   RFC6716 - Page 18
   The length of the first frame, N1, MUST also be no larger than the
   size of the payload remaining after decoding that length for all code
   2 packets [R4].  This makes, for example, a 2-byte code 2 packet with
   a second byte in the range 1...251 invalid as well (the only valid
   2-byte code 2 packet is one where the length of both frames is zero).

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|1|0| N1 (1-2 bytes):                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               :
     |               Compressed frame 1 (N1 bytes)...                |
     :                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
     |                     Compressed frame 2...                     :
     :                                                               |
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 4: A Code 2 Packet

3.2.5. Code 3: A Signaled Number of Frames in the Packet

Code 3 packets signal the number of frames, as well as additional padding, called "Opus padding" to indicate that this padding is added at the Opus layer rather than at the transport layer. Code 3 packets MUST have at least 2 bytes [R6,R7]. The TOC byte is followed by a byte encoding the number of frames in the packet in bits 2 to 7 (marked "M" in Figure 5), with bit 1 indicating whether or not Opus padding is inserted (marked "p" in Figure 5), and bit 0 indicating VBR (marked "v" in Figure 5). M MUST NOT be zero, and the audio duration contained within a packet MUST NOT exceed 120 ms [R5]. This limits the maximum frame count for any frame size to 48 (for 2.5 ms frames), with lower limits for longer frame sizes. Figure 5 illustrates the layout of the frame count byte. 0 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |v|p| M | +-+-+-+-+-+-+-+-+ Figure 5: The frame count byte When Opus padding is used, the number of bytes of padding is encoded in the bytes following the frame count byte. Values from 0...254 indicate that 0...254 bytes of padding are included, in addition to
Top   ToC   RFC6716 - Page 19
   the byte(s) used to indicate the size of the padding.  If the value
   is 255, then the size of the additional padding is 254 bytes, plus
   the padding value encoded in the next byte.  There MUST be at least
   one more byte in the packet in this case [R6,R7].  The additional
   padding bytes appear at the end of the packet and MUST be set to zero
   by the encoder to avoid creating a covert channel.  The decoder MUST
   accept any value for the padding bytes, however.

   Although this encoding provides multiple ways to indicate a given
   number of padding bytes, each uses a different number of bytes to
   indicate the padding size and thus will increase the total packet
   size by a different amount.  For example, to add 255 bytes to a
   packet, set the padding bit, p, to 1, insert a single byte after the
   frame count byte with a value of 254, and append 254 padding bytes
   with the value zero to the end of the packet.  To add 256 bytes to a
   packet, set the padding bit to 1, insert two bytes after the frame
   count byte with the values 255 and 0, respectively, and append 254
   padding bytes with the value zero to the end of the packet.  By using
   the value 255 multiple times, it is possible to create a packet of
   any specific, desired size.  Let P be the number of header bytes used
   to indicate the padding size plus the number of padding bytes
   themselves (i.e., P is the total number of bytes added to the
   packet).  Then, P MUST be no more than N-2 [R6,R7].

   In the CBR case, let R=N-2-P be the number of bytes remaining in the
   packet after subtracting the (optional) padding.  Then, the
   compressed length of each frame in bytes is equal to R/M.  The value
   R MUST be a non-negative integer multiple of M [R6].  The compressed
   data for all M frames follows, each of size R/M bytes, as illustrated
   in Figure 6.
Top   ToC   RFC6716 - Page 20
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|1|1|0|p|     M     |  Padding length (Optional)    :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 1 (R/M bytes)...               :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 2 (R/M bytes)...               :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                              ...                              :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame M (R/M bytes)...               :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :                  Opus Padding (Optional)...                   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                       Figure 6: A CBR Code 3 Packet

   In the VBR case, the (optional) padding length is followed by M-1
   frame lengths (indicated by "N1" to "N[M-1]" in Figure 7), each
   encoded in a one- or two-byte sequence as described above.  The
   packet MUST contain enough data for the M-1 lengths after removing
   the (optional) padding, and the sum of these lengths MUST be no
   larger than the number of bytes remaining in the packet after
   decoding them [R7].  The compressed data for all M frames follows,
   each frame consisting of the indicated number of bytes, with the
   final frame consuming any remaining bytes before the final padding,
   as illustrated in Figure 6.  The number of header bytes (TOC byte,
   frame count byte, padding length bytes, and frame length bytes), plus
   the signaled length of the first M-1 frames themselves, plus the
   signaled length of the padding MUST be no larger than N, the total
   size of the packet.
Top   ToC   RFC6716 - Page 21
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|1|1|1|p|     M     | Padding length (Optional)     :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     : N1 (1-2 bytes): N2 (1-2 bytes):     ...       :     N[M-1]    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 1 (N1 bytes)...                :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 2 (N2 bytes)...                :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                              ...                              :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                     Compressed frame M...                     :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :                  Opus Padding (Optional)...                   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                       Figure 7: A VBR Code 3 Packet

3.3. Examples

Simplest case, one NB mono 20 ms SILK frame: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1 |0|0|0| compressed data... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 8
Top   ToC   RFC6716 - Page 22
   Two FB mono 5 ms CELT frames of the same compressed size:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   29    |0|0|1|               compressed data...              :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 9

   Two FB mono 20 ms Hybrid frames of different compressed size:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   15    |0|1|1|1|0|     2     |      N1       |               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
   |                       compressed data...                      :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 10

   Four FB stereo 20 ms CELT frames of the same compressed size:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   31    |1|1|1|0|0|     4     |      compressed data...       :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 11

3.4. Receiving Malformed Packets

A receiver MUST NOT process packets that violate any of the rules above as normal Opus packets. They are reserved for future applications, such as in-band headers (containing metadata, etc.). Packets that violate these constraints may cause implementations of _this_ specification to treat them as malformed and discard them. These constraints are summarized here for reference: [R1] Packets are at least one byte. [R2] No implicit frame length is larger than 1275 bytes. [R3] Code 1 packets have an odd total length, N, so that (N-1)/2 is an integer.
Top   ToC   RFC6716 - Page 23
   [R4]  Code 2 packets have enough bytes after the TOC for a valid
         frame length, and that length is no larger than the number of
         bytes remaining in the packet.

   [R5]  Code 3 packets contain at least one frame, but no more than
         120 ms of audio total.

   [R6]  The length of a CBR code 3 packet, N, is at least two bytes,
         the number of bytes added to indicate the padding size plus the
         trailing padding bytes themselves, P, is no more than N-2, and
         the frame count, M, satisfies the constraint that (N-2-P) is a
         non-negative integer multiple of M.

   [R7]  VBR code 3 packets are large enough to contain all the header
         bytes (TOC byte, frame count byte, any padding length bytes,
         and any frame length bytes), plus the length of the first M-1
         frames, plus any trailing padding bytes.

4. Opus Decoder

The Opus decoder consists of two main blocks: the SILK decoder and the CELT decoder. At any given time, one or both of the SILK and CELT decoders may be active. The output of the Opus decode is the sum of the outputs from the SILK and CELT decoders with proper sample rate conversion and delay compensation on the SILK side, and optional decimation (when decoding to sample rates less than 48 kHz) on the CELT side, as illustrated in the block diagram below. +---------+ +------------+ | SILK | | Sample | +->| Decoder |--->| Rate |----+ Bit- +---------+ | | | | Conversion | v stream | Range |---+ +---------+ +------------+ /---\ Audio ------->| Decoder | | + |------> | |---+ +---------+ +------------+ \---/ +---------+ | | CELT | | Decimation | ^ +->| Decoder |--->| (Optional) |----+ | | | | +---------+ +------------+

4.1. Range Decoder

Opus uses an entropy coder based on range coding [RANGE-CODING] [MARTIN79], which is itself a rediscovery of the FIFO arithmetic code introduced by [CODING-THESIS]. It is very similar to arithmetic encoding, except that encoding is done with digits in any base
Top   ToC   RFC6716 - Page 24
   instead of with bits, so it is faster when using larger bases (i.e.,
   a byte).  All of the calculations in the range coder must use bit-
   exact integer arithmetic.

   Symbols may also be coded as "raw bits" packed directly into the
   bitstream, bypassing the range coder.  These are packed backwards
   starting at the end of the frame, as illustrated in Figure 12.  This
   reduces complexity and makes the stream more resilient to bit errors,
   as corruption in the raw bits will not desynchronize the decoding
   process, unlike corruption in the input to the range decoder.  Raw
   bits are only used in the CELT layer.

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Range coder data (packed MSB to LSB) ->                       :
     +                                                               +
     :                                                               :
     +     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :     | <- Boundary occurs at an arbitrary bit position         :
     +-+-+-+                                                         +
     :                          <- Raw bits data (packed LSB to MSB) |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     Legend:

     LSB = Least Significant Bit
     MSB = Most Significant Bit

          Figure 12: Illustrative Example of Packing Range Coder
                             and Raw Bits Data

   Each symbol coded by the range coder is drawn from a finite alphabet
   and coded in a separate "context", which describes the size of the
   alphabet and the relative frequency of each symbol in that alphabet.

   Suppose there is a context with n symbols, identified with an index
   that ranges from 0 to n-1.  The parameters needed to encode or decode
   symbol k in this context are represented by a three-tuple
   (fl[k], fh[k], ft), all 16-bit unsigned integers, with
   0 <= fl[k] < fh[k] <= ft <= 65535.  The values of this tuple are
   derived from the probability model for the symbol, represented by
   traditional "frequency counts".  Because Opus uses static contexts,
   those are not updated as symbols are decoded.  Let f[i] be the
   frequency of symbol i.  Then, the three-tuple corresponding to symbol
   k is given by the following:
Top   ToC   RFC6716 - Page 25
                   k-1                                   n-1
                   __                                    __
           fl[k] = \  f[i],  fh[k] = fl[k] + f[k],  ft = \  f[i]
                   /_                                    /_
                   i=0                                   i=0

   The range decoder extracts the symbols and integers encoded using the
   range encoder in Section 5.1.  The range decoder maintains an
   internal state vector composed of the two-tuple (val, rng), where val
   represents the difference between the high end of the current range
   and the actual coded value, minus one, and rng represents the size of
   the current range.  Both val and rng are 32-bit unsigned integer
   values.

4.1.1. Range Decoder Initialization

Let b0 be an 8-bit unsigned integer containing first input byte (or containing zero if there are no bytes in this Opus frame). The decoder initializes rng to 128 and initializes val to (127 - (b0>>1)), where (b0>>1) is the top 7 bits of the first input byte. It saves the remaining bit, (b0&1), for use in the renormalization procedure described in Section 4.1.2.1, which the decoder invokes immediately after initialization to read additional bits and establish the invariant that rng > 2**23.

4.1.2. Decoding Symbols

Decoding a symbol is a two-step process. The first step determines a 16-bit unsigned value fs, which lies within the range of some symbol in the current context. The second step updates the range decoder state with the three-tuple (fl[k], fh[k], ft) corresponding to that symbol. The first step is implemented by ec_decode() (entdec.c), which computes val fs = ft - min(------ + 1, ft) rng/ft The divisions here are integer division. The decoder then identifies the symbol in the current context corresponding to fs; i.e., the value of k whose three-tuple (fl[k], fh[k], ft) satisfies fl[k] <= fs < fh[k]. It uses this tuple to update val according to
Top   ToC   RFC6716 - Page 26
                                   rng
                       val = val - --- * (ft - fh[k])
                                   ft

   If fl[k] is greater than zero, then the decoder updates rng using

                              rng
                        rng = --- * (fh[k] - fl[k])
                              ft

   Otherwise, it updates rng using

                                   rng
                       rng = rng - --- * (ft - fh[k])
                                   ft

   Using a special case for the first symbol (rather than the last
   symbol, as is commonly done in other arithmetic coders) ensures that
   all the truncation error from the finite precision arithmetic
   accumulates in symbol 0.  This makes the cost of coding a 0 slightly
   smaller, on average, than its estimated probability indicates and
   makes the cost of coding any other symbol slightly larger.  When
   contexts are designed so that 0 is the most probable symbol, which is
   often the case, this strategy minimizes the inefficiency introduced
   by the finite precision.  It also makes some of the special-case
   decoding routines in Section 4.1.3 particularly simple.

   After the updates, implemented by ec_dec_update() (entdec.c), the
   decoder normalizes the range using the procedure in the next section,
   and returns the index k.

4.1.2.1. Renormalization
To normalize the range, the decoder repeats the following process, implemented by ec_dec_normalize() (entdec.c), until rng > 2**23. If rng is already greater than 2**23, the entire process is skipped. First, it sets rng to (rng<<8). Then, it reads the next byte of the Opus frame and forms an 8-bit value sym, using the leftover bit buffered from the previous byte as the high bit and the top 7 bits of the byte just read as the other 7 bits of sym. The remaining bit in the byte just read is buffered for use in the next iteration. If no more input bytes remain, it uses zero bits instead. See Section 4.1.1 for the initialization used to process the first byte. Then, it sets val = ((val<<8) + (255-sym)) & 0x7FFFFFFF
Top   ToC   RFC6716 - Page 27
   It is normal and expected that the range decoder will read several
   bytes into the data of the raw bits (if any) at the end of the frame
   by the time the frame is completely decoded, as illustrated in
   Figure 13.  This same data MUST also be returned as raw bits when
   requested.  The encoder is expected to terminate the stream in such a
   way that the range decoder will decode the intended values regardless
   of the data contained in the raw bits.  Section 5.1.5 describes a
   procedure for doing this.  If the range decoder consumes all of the
   bytes belonging to the current frame, it MUST continue to use zero
   when any further input bytes are required, even if there is
   additional data in the current packet from padding or other frames.

      n              n+1             n+2             n+3
      0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :     | <----------- Overlap region ------------> |             :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           ^                                           ^
           |   End of data buffered by the range coder |
     ...-----------------------------------------------+
           |
           | End of data consumed by raw bits
           +-------------------------------------------------------...

          Figure 13: Illustrative Example of Raw Bits Overlapping
                             Range Coder Data

4.1.3. Alternate Decoding Methods

The reference implementation uses three additional decoding methods that are exactly equivalent to the above but make assumptions and simplifications that allow for a more efficient implementation.
4.1.3.1. ec_decode_bin()
The first is ec_decode_bin() (entdec.c), defined using the parameter ftb instead of ft. It is mathematically equivalent to calling ec_decode() with ft = (1<<ftb), but it avoids one of the divisions.
4.1.3.2. ec_dec_bit_logp()
The next is ec_dec_bit_logp() (entdec.c), which decodes a single binary symbol, replacing both the ec_decode() and ec_dec_update() steps. The context is described by a single parameter, logp, which is the absolute value of the base-2 logarithm of the probability of a "1". It is mathematically equivalent to calling ec_decode() with ft = (1<<logp), followed by ec_dec_update() with the 3-tuple (fl[k] = 0, fh[k] = (1<<logp) - 1, ft = (1<<logp)) if the returned
Top   ToC   RFC6716 - Page 28
   value of fs is less than (1<<logp) - 1 (a "0" was decoded), and with
   (fl[k] = (1<<logp) - 1, fh[k] = ft = (1<<logp)) otherwise (a "1" was
   decoded).  The implementation requires no multiplications or
   divisions.

4.1.3.3. ec_dec_icdf()
The last is ec_dec_icdf() (entdec.c), which decodes a single symbol with a table-based context of up to 8 bits, also replacing both the ec_decode() and ec_dec_update() steps, as well as the search for the decoded symbol in between. The context is described by two parameters, an icdf ("inverse" cumulative distribution function) table and ftb. As with ec_decode_bin(), (1<<ftb) is equivalent to ft. idcf[k], on the other hand, stores (1<<ftb)-fh[k], which is equal to (1<<ftb) - fl[k+1]. fl[0] is assumed to be 0, and the table is terminated by a value of 0 (where fh[k] == ft). The function is mathematically equivalent to calling ec_decode() with ft = (1<<ftb), using the returned value fs to search the table for the first entry where fs < (1<<ftb)-icdf[k], and calling ec_dec_update() with fl[k] = (1<<ftb) - icdf[k-1] (or 0 if k == 0), fh[k] = (1<<ftb) - idcf[k], and ft = (1<<ftb). Combining the search with the update allows the division to be replaced by a series of multiplications (which are usually much cheaper), and using an inverse CDF allows the use of an ftb as large as 8 in an 8-bit table without any special cases. This is the primary interface with the range decoder in the SILK layer, though it is used in a few places in the CELT layer as well. Although icdf[k] is more convenient for the code, the frequency counts, f[k], are a more natural representation of the probability distribution function (PDF) for a given symbol. Therefore, this document lists the latter, not the former, when describing the context in which a symbol is coded as a list, e.g., {4, 4, 4, 4}/16 for a uniform context with four possible values and ft = 16. The value of ft after the slash is always the sum of the entries in the PDF, but is included for convenience. Contexts with identical probabilities, f[k]/ft, but different values of ft (or equivalently, ftb) are not the same, and cannot, in general, be used in place of one another. An icdf table is also not capable of representing a PDF where the first symbol has 0 probability. In such contexts, ec_dec_icdf() can decode the symbol by using a table that drops the entries for any initial zero-probability values and by adding the constant offset of the first value with a non-zero probability to its return value.
Top   ToC   RFC6716 - Page 29

4.1.4. Decoding Raw Bits

The raw bits used by the CELT layer are packed at the end of the frame, with the least significant bit of the first value packed in the least significant bit of the last byte, filling up to the most significant bit in the last byte, continuing on to the least significant bit of the penultimate byte, and so on. The reference implementation reads them using ec_dec_bits() (entdec.c). Because the range decoder must read several bytes ahead in the stream, as described in Section 4.1.2.1, the input consumed by the raw bits may overlap with the input consumed by the range coder, and a decoder MUST allow this. The format should render it impossible to attempt to read more raw bits than there are actual bits in the frame, though a decoder may wish to check for this and report an error.

4.1.5. Decoding Uniformly Distributed Integers

The function ec_dec_uint() (entdec.c) decodes one of ft equiprobable values in the range 0 to (ft - 1), inclusive, each with a frequency of 1, where ft may be as large as (2**32 - 1). Because ec_decode() is limited to a total frequency of (2**16 - 1), it splits up the value into a range coded symbol representing up to 8 of the high bits, and, if necessary, raw bits representing the remainder of the value. The limit of 8 bits in the range coded symbol is a trade-off between implementation complexity, modeling error (since the symbols no longer truly have equal coding cost), and rounding error introduced by the range coder itself (which gets larger as more bits are included). Using raw bits reduces the maximum number of divisions required in the worst case, but means that it may be possible to decode a value outside the range 0 to (ft - 1), inclusive. ec_dec_uint() takes a single, positive parameter, ft, which is not necessarily a power of two, and returns an integer, t, whose value lies between 0 and (ft - 1), inclusive. Let ftb = ilog(ft - 1), i.e., the number of bits required to store (ft - 1) in two's complement notation. If ftb is 8 or less, then t is decoded with t = ec_decode(ft), and the range coder state is updated using the three-tuple (t, t + 1, ft). If ftb is greater than 8, then the top 8 bits of t are decoded using t = ec_decode(((ft - 1) >> (ftb - 8)) + 1) the decoder state is updated using the three-tuple (t, t + 1, ((ft - 1) >> (ftb - 8)) + 1), and the remaining bits are decoded as raw bits, setting
Top   ToC   RFC6716 - Page 30
                t = (t << (ftb - 8)) | ec_dec_bits(ftb - 8)

   If, at this point, t >= ft, then the current frame is corrupt.  In
   that case, the decoder should assume there has been an error in the
   coding, decoding, or transmission and SHOULD take measures to conceal
   the error (e.g., saturate to ft-1 or use the Packet Loss Concealment
   (PLC)) and/or report to the application that the error has occurred.

4.1.6. Current Bit Usage

The bit allocation routines in the CELT decoder need a conservative upper bound on the number of bits that have been used from the current frame thus far, including both range coder bits and raw bits. This drives allocation decisions that must match those made in the encoder. The upper bound is computed in the reference implementation to whole-bit precision by the function ec_tell() (entcode.h) and to fractional 1/8th bit precision by the function ec_tell_frac() (entcode.c). Like all operations in the range coder, it must be implemented in a bit-exact manner, and it must produce exactly the same value returned by the same functions in the encoder after encoding the same symbols. ec_tell() is guaranteed to return ceil(ec_tell_frac()/8.0). In various places, the codec will check to ensure there is enough room to contain a symbol before attempting to decode it. In practice, although the number of bits used so far is an upper bound, decoding a symbol whose probability model suggests it has a worst-case cost of p 1/8th bits may actually advance the return value of ec_tell_frac() by p-1, p, or p+1 1/8th bits, due to approximation error in that upper bound, truncation error in the range coder, and for large values of ft, modeling error in ec_dec_uint(). However, this error is bounded, and periodic calls to ec_tell() or ec_tell_frac() at precisely defined points in the decoding process prevent it from accumulating. For a range coder symbol that requires a whole number of bits (i.e., for which ft/(fh[k] - fl[k]) is a power of two), where there are at least p 1/8th bits available, decoding the symbol will never cause ec_tell() or ec_tell_frac() to exceed the size of the frame ("bust the budget"). In this case, the return value of ec_tell_frac() will only advance by more than p 1/8th bits if there were an additional, fractional number of bits remaining, and it will never advance beyond the next whole-bit boundary, which is safe, since frames always contain a whole number of bits. However, when p is not a whole number of bits, an extra 1/8th bit is required to ensure that decoding the symbol will not bust the budget.
Top   ToC   RFC6716 - Page 31
   The reference implementation keeps track of the total number of whole
   bits that have been processed by the decoder so far in the variable
   nbits_total, including the (possibly fractional) number of bits that
   are currently buffered, but not consumed, inside the range coder.
   nbits_total is initialized to 9 just before the initial range
   renormalization process completes (or equivalently, it can be
   initialized to 33 after the first renormalization).  The extra two
   bits over the actual amount buffered by the range coder guarantees
   that it is an upper bound and that there is enough room for the
   encoder to terminate the stream.  Each iteration through the range
   coder's renormalization loop increases nbits_total by 8.  Reading raw
   bits increases nbits_total by the number of raw bits read.

4.1.6.1. ec_tell()
The whole number of bits buffered in rng may be estimated via lg = ilog(rng). ec_tell() then becomes a simple matter of removing these bits from the total. It returns (nbits_total - lg). In a newly initialized decoder, before any symbols have been read, this reports that 1 bit has been used. This is the bit reserved for termination of the encoder.
4.1.6.2. ec_tell_frac()
ec_tell_frac() estimates the number of bits buffered in rng to fractional precision. Since rng must be greater than 2**23 after renormalization, lg must be at least 24. Let r_Q15 = rng >> (lg-16) so that 32768 <= r_Q15 < 65536, an unsigned Q15 value representing the fractional part of rng. Then, the following procedure can be used to add one bit of precision to lg. First, update r_Q15 = (r_Q15*r_Q15) >> 15 Then, add the 16th bit of r_Q15 to lg via lg = 2*lg + (r_Q15 >> 16) Finally, if this bit was a 1, reduce r_Q15 by a factor of two via r_Q15 = r_Q15 >> 1 so that it once again lies in the range 32768 <= r_Q15 < 65536. This procedure is repeated three times to extend lg to 1/8th bit precision. ec_tell_frac() then returns (nbits_total*8 - lg).


(next page on part 3)

Next Section