RFC 6716

Definition of the Opus Audio Codec

Pages: 326
Proposed Standard
→ Errata
Updated by: 8251

Part 6 of 14 – Pages 104 to 130

RFC6716 - Page 104 prevText

4.3.  CELT Decoder

   The CELT layer of Opus is based on the Modified Discrete Cosine
   Transform [MDCT] with partially overlapping windows of 5 to 22.5 ms.
   The main principle behind CELT is that the MDCT spectrum is divided
   into bands that (roughly) follow the Bark scale, i.e., the scale of
   the ear's critical bands [ZWICKER61].  The normal CELT layer uses 21
   of those bands, though Opus Custom (see Section 6.2) may use a
   different number of bands.  In Hybrid mode, the first 17 bands (up to
   8 kHz) are not coded.  A band can contain as little as one MDCT bin
   per channel, and as many as 176 bins per channel, as detailed in
   Table 55.  In each band, the gain (energy) is coded separately from
   the shape of the spectrum.  Coding the gain explicitly makes it easy
   to preserve the spectral envelope of the signal.  The remaining unit-
   norm shape vector is encoded using a Pyramid Vector Quantizer
   (PVQ) Section 4.3.4.

   +--------+--------+------+-------+-------+-------------+------------+
   | Frame  | 2.5 ms | 5 ms | 10 ms | 20 ms |       Start |       Stop |
   | Size:  |        |      |       |       |   Frequency |  Frequency |
   +--------+--------+------+-------+-------+-------------+------------+
   | Band   |  Bins: |      |       |       |             |            |
   |        |        |      |       |       |             |            |
   | 0      |      1 |    2 |     4 |     8 |        0 Hz |     200 Hz |
   |        |        |      |       |       |             |            |
   | 1      |      1 |    2 |     4 |     8 |      200 Hz |     400 Hz |
   |        |        |      |       |       |             |            |

RFC6716 - Page 105

   | 2      |      1 |    2 |     4 |     8 |      400 Hz |     600 Hz |
   |        |        |      |       |       |             |            |
   | 3      |      1 |    2 |     4 |     8 |      600 Hz |     800 Hz |
   |        |        |      |       |       |             |            |
   | 4      |      1 |    2 |     4 |     8 |      800 Hz |    1000 Hz |
   |        |        |      |       |       |             |            |
   | 5      |      1 |    2 |     4 |     8 |     1000 Hz |    1200 Hz |
   |        |        |      |       |       |             |            |
   | 6      |      1 |    2 |     4 |     8 |     1200 Hz |    1400 Hz |
   |        |        |      |       |       |             |            |
   | 7      |      1 |    2 |     4 |     8 |     1400 Hz |    1600 Hz |
   |        |        |      |       |       |             |            |
   | 8      |      2 |    4 |     8 |    16 |     1600 Hz |    2000 Hz |
   |        |        |      |       |       |             |            |
   | 9      |      2 |    4 |     8 |    16 |     2000 Hz |    2400 Hz |
   |        |        |      |       |       |             |            |
   | 10     |      2 |    4 |     8 |    16 |     2400 Hz |    2800 Hz |
   |        |        |      |       |       |             |            |
   | 11     |      2 |    4 |     8 |    16 |     2800 Hz |    3200 Hz |
   |        |        |      |       |       |             |            |
   | 12     |      4 |    8 |    16 |    32 |     3200 Hz |    4000 Hz |
   |        |        |      |       |       |             |            |
   | 13     |      4 |    8 |    16 |    32 |     4000 Hz |    4800 Hz |
   |        |        |      |       |       |             |            |
   | 14     |      4 |    8 |    16 |    32 |     4800 Hz |    5600 Hz |
   |        |        |      |       |       |             |            |
   | 15     |      6 |   12 |    24 |    48 |     5600 Hz |    6800 Hz |
   |        |        |      |       |       |             |            |
   | 16     |      6 |   12 |    24 |    48 |     6800 Hz |    8000 Hz |
   |        |        |      |       |       |             |            |
   | 17     |      8 |   16 |    32 |    64 |     8000 Hz |    9600 Hz |
   |        |        |      |       |       |             |            |
   | 18     |     12 |   24 |    48 |    96 |     9600 Hz |   12000 Hz |
   |        |        |      |       |       |             |            |
   | 19     |     18 |   36 |    72 |   144 |    12000 Hz |   15600 Hz |
   |        |        |      |       |       |             |            |
   | 20     |     22 |   44 |    88 |   176 |    15600 Hz |   20000 Hz |
   +--------+--------+------+-------+-------+-------------+------------+

       Table 55: MDCT Bins per Channel per Band for Each Frame Size

   Transients are notoriously difficult for transform codecs to code.
   CELT uses two different strategies for them:

   1.  Using multiple smaller MDCTs instead of a single large MDCT, and

   2.  Dynamic time-frequency resolution changes (See Section 4.3.4.5).

RFC6716 - Page 106

   To improve quality on highly tonal and periodic signals, CELT
   includes a pre-filter/post-filter combination.  The pre-filter on the
   encoder side attenuates the signal's harmonics.  The post-filter on
   the decoder side restores the original gain of the harmonics, while
   shaping the coding noise to roughly follow the harmonics.  Such noise
   shaping reduces the perception of the noise.

   When coding a stereo signal, three coding methods are available:

   o  mid-side stereo: encodes the mean and the difference of the left
      and right channels,

   o  intensity stereo: only encodes the mean of the left and right
      channels (discards the difference),

   o  dual stereo: encodes the left and right channels separately.

   An overview of the decoder is given in Figure 17.

                       +---------+
                       | Coarse  |
                    +->| decoder |----+
                    |  +---------+    |
                    |                 |
                    |  +---------+    v
                    |  |  Fine   |  +---+
                    +->| decoder |->| + |
                    |  +---------+  +---+
                    |       ^         |
        +---------+ |       |         |
        |  Range  | | +----------+    v
        | Decoder |-+ |   Bit    | +------+
        +---------+ | |Allocation| | 2**x |
                    | +----------+ +------+
                    |       |         |
                    |       v         v               +--------+
                    |  +---------+  +---+  +-------+  | pitch  |
                    +->|   PVQ   |->| * |->| IMDCT |->| post-  |--->
                    |  | decoder |  +---+  +-------+  | filter |
                    |  +---------+                    +--------+
                    |                                      ^
                    +--------------------------------------+

        Legend: IMDCT = Inverse MDCT

                 Figure 17: Structure of the CELT decoder

   The decoder is based on the following symbols and sets of symbols:

RFC6716 - Page 107

          +---------------+---------------------+---------------+
          |   Symbol(s)   |         PDF         |   Condition   |
          +---------------+---------------------+---------------+
          |    silence    |   {32767, 1}/32768  |               |
          |               |                     |               |
          |  post-filter  |       {1, 1}/2      |               |
          |               |                     |               |
          |     octave    |     uniform (6)     |  post-filter  |
          |               |                     |               |
          |     period    | raw bits (4+octave) |  post-filter  |
          |               |                     |               |
          |      gain     |     raw bits (3)    |  post-filter  |
          |               |                     |               |
          |     tapset    |     {2, 1, 1}/4     |  post-filter  |
          |               |                     |               |
          |   transient   |       {7, 1}/8      |               |
          |               |                     |               |
          |     intra     |       {7, 1}/8      |               |
          |               |                     |               |
          | coarse energy |    Section 4.3.2    |               |
          |               |                     |               |
          |   tf_change   |    Section 4.3.1    |               |
          |               |                     |               |
          |   tf_select   |       {1, 1}/2      | Section 4.3.1 |
          |               |                     |               |
          |     spread    |   {7, 2, 21, 2}/32  |               |
          |               |                     |               |
          |  dyn. alloc.  |    Section 4.3.3    |               |
          |               |                     |               |
          |  alloc. trim  |       Table 58      |               |
          |               |                     |               |
          |      skip     |       {1, 1}/2      | Section 4.3.3 |
          |               |                     |               |
          |   intensity   |       uniform       | Section 4.3.3 |
          |               |                     |               |
          |      dual     |       {1, 1}/2      |               |
          |               |                     |               |
          |  fine energy  |    Section 4.3.2    |               |
          |               |                     |               |
          |    residual   |    Section 4.3.4    |               |
          |               |                     |               |
          | anti-collapse |       {1, 1}/2      | Section 4.3.5 |
          |               |                     |               |
          |    finalize   |    Section 4.3.2    |               |
          +---------------+---------------------+---------------+

    Table 56: Order of the Symbols in the CELT Section of the Bitstream

RFC6716 - Page 108

   The decoder extracts information from the range-coded bitstream in
   the order described in Table 56.  In some circumstances, it is
   possible for a decoded value to be out of range due to a very small
   amount of redundancy in the encoding of large integers by the range
   coder.  In that case, the decoder should assume there has been an
   error in the coding, decoding, or transmission and SHOULD take
   measures to conceal the error and/or report to the application that a
   problem has occurred.  Such out of range errors cannot occur in the
   SILK layer.

4.3.1.  Transient Decoding

   The "transient" flag indicates whether the frame uses a single long
   MDCT or several short MDCTs.  When it is set, then the MDCT
   coefficients represent multiple short MDCTs in the frame.  When not
   set, the coefficients represent a single long MDCT for the frame.
   The flag is encoded in the bitstream with a probability of 1/8.  In
   addition to the global transient flag is a per-band binary flag to
   change the time-frequency (tf) resolution independently in each band.
   The change in tf resolution is defined in tf_select_table[][] in
   celt.c and depends on the frame size, whether the transient flag is
   set, and the value of tf_select.  The tf_select flag uses a 1/2
   probability, but is only decoded if it can have an impact on the
   result knowing the value of all per-band tf_change flags.

4.3.2.  Energy Envelope Decoding

   It is important to quantize the energy with sufficient resolution
   because any energy quantization error cannot be compensated for at a
   later stage.  Regardless of the resolution used for encoding the
   spectral shape of a band, it is perceptually important to preserve
   the energy in each band.  CELT uses a three-step coarse-fine-fine
   strategy for encoding the energy in the base-2 log domain, as
   implemented in quant_bands.c.

4.3.2.1.  Coarse Energy Decoding

   Coarse quantization of the energy uses a fixed resolution of 6 dB
   (integer part of base-2 log).  To minimize the bitrate, prediction is
   applied both in time (using the previous frame) and in frequency
   (using the previous bands).  The part of the prediction that is based
   on the previous frame can be disabled, creating an "intra" frame
   where the energy is coded without reference to prior frames.  The
   decoder first reads the intra flag to determine what prediction is
   used.  The 2-D z-transform [Z-TRANSFORM] of the prediction filter is

RFC6716 - Page 109

                                            -1          -1
                              (1 - alpha*z_l  )*(1 - z_b  )
                A(z_l, z_b) = -----------------------------
                                                 -1
                                     1 - beta*z_b

   where b is the band index and l is the frame index.  The prediction
   coefficients applied depend on the frame size in use when not using
   intra energy and are alpha=0, beta=4915/32768 when using intra
   energy.  The time-domain prediction is based on the final fine
   quantization of the previous frame, while the frequency domain
   (within the current frame) prediction is based on coarse quantization
   only (because the fine quantization has not been computed yet).  The
   prediction is clamped internally so that fixed-point implementations
   with limited dynamic range always remain in the same state as
   floating point implementations.  We approximate the ideal probability
   distribution of the prediction error using a Laplace distribution
   with separate parameters for each frame size in intra- and inter-
   frame modes.  These parameters are held in the e_prob_model table in
   quant_bands.c.  The coarse energy decoding is performed by
   unquant_coarse_energy() (quant_bands.c).  The decoding of the
   Laplace-distributed values is implemented in ec_laplace_decode()
   (laplace.c).

4.3.2.2.  Fine Energy Quantization

   The number of bits assigned to fine energy quantization in each band
   is determined by the bit allocation computation described in
   Section 4.3.3.  Let B_i be the number of fine energy bits for band i;
   the refinement is an integer f in the range [0,2**B_i-1].  The
   mapping between f and the correction applied to the coarse energy is
   equal to (f+1/2)/2**B_i - 1/2.  Fine energy quantization is
   implemented in quant_fine_energy() (quant_bands.c).

   When some bits are left "unused" after all other flags have been
   decoded, these bits are assigned to a "final" step of fine
   allocation.  In effect, these bits are used to add one extra fine
   energy bit per band per channel.  The allocation process determines
   two "priorities" for the final fine bits.  Any remaining bits are
   first assigned only to bands of priority 0, starting from band 0 and
   going up.  If all bands of priority 0 have received one bit per
   channel, then bands of priority 1 are assigned an extra bit per
   channel, starting from band 0.  If any bits are left after this, they
   are left unused.  This is implemented in unquant_energy_finalise()
   (quant_bands.c).

RFC6716 - Page 110

4.3.3.  Bit Allocation

   Because the bit allocation drives the decoding of the range-coder
   stream, it MUST be recovered exactly so that identical coding
   decisions are made in the encoder and decoder.  Any deviation from
   the reference's resulting bit allocation will result in corrupted
   output, though implementers are free to implement the procedure in
   any way that produces identical results.

   The per-band gain-shape structure of the CELT layer ensures that
   using the same number of bits for the spectral shape of a band in
   every frame will result in a roughly constant signal-to-noise ratio
   in that band.  This results in coding noise that has the same
   spectral envelope as the signal.  The masking curve produced by a
   standard psychoacoustic model also closely follows the spectral
   envelope of the signal.  This structure means that the ideal
   allocation is more consistent from frame to frame than it is for
   other codecs without an equivalent structure and that a fixed
   allocation provides fairly consistent perceptual
   performance [VALIN2010].

   Many codecs transmit significant amounts of side information to
   control the bit allocation within a frame.  Often this control is
   only indirect, and it must be exercised carefully to achieve the
   desired rate constraints.  The CELT layer, however, can adapt over a
   very wide range of rates, so it has a large number of codebook sizes
   to choose from for each band.  Explicitly signaling the size of each
   of these codebooks would impose considerable overhead, even though
   the allocation is relatively static from frame to frame.  This is
   because all of the information required to compute these codebook
   sizes must be derived from a single frame by itself, in order to
   retain robustness to packet loss, so the signaling cannot take
   advantage of knowledge of the allocation in neighboring frames.  This
   problem is exacerbated in low-latency (small frame size)
   applications, which would include this overhead in every frame.

   For this reason, in the MDCT mode, Opus uses a primarily implicit bit
   allocation.  The available bitstream capacity is known in advance to
   both the encoder and decoder without additional signaling, ultimately
   from the packet sizes expressed by a higher-level protocol.  Using
   this information, the codec interpolates an allocation from a hard-
   coded table.

   While the band-energy structure effectively models intra-band
   masking, it ignores the weaker inter-band masking, band-temporal
   masking, and other less significant perceptual effects.  While these
   effects can often be ignored, they can become significant for
   particular samples.  One mechanism available to encoders would be to

RFC6716 - Page 111

   simply increase the overall rate for these frames, but this is not
   possible in a constant rate mode and can be fairly inefficient.  As a
   result three explicitly signaled mechanisms are provided to alter the
   implicit allocation:

   o  Band boost

   o  Allocation trim

   o  Band skipping

   The first of these mechanisms, band boost, allows an encoder to boost
   the allocation in specific bands.  The second, allocation trim, works
   by biasing the overall allocation towards higher or lower frequency
   bands.  The third, band skipping, selects which low-precision high
   frequency bands will be allocated no shape bits at all.

   In stereo mode, there are two additional parameters potentially coded
   as part of the allocation procedure: a parameter to allow the
   selective elimination of allocation for the 'side' (i.e., intensity
   stereo) in jointly coded bands, and a flag to deactivate joint coding
   (i.e., dual stereo).  These values are not signaled if they would be
   meaningless in the overall context of the allocation.

   Because every signaled adjustment increases overhead and
   implementation complexity, none were included speculatively: the
   reference encoder makes use of all of these mechanisms.  While the
   decision logic in the reference was found to be effective enough to
   justify the overhead and complexity, further analysis techniques may
   be discovered that increase the effectiveness of these parameters.
   As with other signaled parameters, an encoder is free to choose the
   values in any manner, but, unless a technique is known to deliver
   superior perceptual results, the methods used by the reference
   implementation should be used.

   The allocation process consists of the following steps: determining
   the per-band maximum allocation vector, decoding the boosts, decoding
   the tilt, determining the remaining capacity of the frame, searching
   the mode table for the entry nearest but not exceeding the available
   space (subject to the tilt, boosts, band maximums, and band
   minimums), linear interpolation, reallocation of unused bits with
   concurrent skip decoding, determination of the fine-energy vs. shape
   split, and final reallocation.  This process results in a per-band
   shape allocation (in 1/8th-bit units), a per-band fine-energy
   allocation (in 1 bit per channel units), a set of band priorities for
   controlling the use of remaining bits at the end of the frame, and a
   remaining balance of unallocated space, which is usually zero except
   at very high rates.

RFC6716 - Page 112

   The "static" bit allocation (in 1/8 bits) for a quality q, excluding
   the minimums, maximums, tilt and boosts, is equal to
   channels*N*alloc[band][q]<<LM>>2, where alloc[][] is given in
   Table 57 and LM=log2(frame_size/120).  The allocation is obtained by
   linearly interpolating between two values of q (in steps of 1/64) to
   find the highest allocation that does not exceed the number of bits
   remaining.

    Rows indicate the MDCT bands, columns are the different quality (q)
             parameters.  The units are 1/32 bit per MDCT bin.

     +---+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
     | 0 |  1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |
     +---+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
     | 0 | 90 | 110 | 118 | 126 | 134 | 144 | 152 | 162 | 172 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 80 | 100 | 110 | 119 | 127 | 137 | 145 | 155 | 165 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 75 |  90 | 103 | 112 | 120 | 130 | 138 | 148 | 158 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 69 |  84 |  93 | 104 | 114 | 124 | 132 | 142 | 152 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 63 |  78 |  86 |  95 | 103 | 113 | 123 | 133 | 143 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 56 |  71 |  80 |  89 |  97 | 107 | 117 | 127 | 137 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 49 |  65 |  75 |  83 |  91 | 101 | 111 | 121 | 131 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 40 |  58 |  70 |  78 |  85 |  95 | 105 | 115 | 125 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 34 |  51 |  65 |  72 |  78 |  88 |  98 | 108 | 118 | 198 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 29 |  45 |  59 |  66 |  72 |  82 |  92 | 102 | 112 | 193 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 20 |  39 |  53 |  60 |  66 |  76 |  86 |  96 | 106 | 188 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 18 |  32 |  47 |  54 |  60 |  70 |  80 |  90 | 100 | 183 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 10 |  26 |  40 |  47 |  54 |  64 |  74 |  84 |  94 | 178 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |  20 |  31 |  39 |  47 |  57 |  67 |  77 |  87 | 173 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |  12 |  23 |  32 |  41 |  51 |  61 |  71 |  81 | 168 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |  15 |  25 |  35 |  45 |  55 |  65 |  75 | 163 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |   4 |  17 |  29 |  39 |  49 |  59 |  69 | 158 |
     |   |    |     |     |     |     |     |     |     |     |     |

RFC6716 - Page 113

     | 0 |  0 |   0 |   0 |  12 |  23 |  33 |  43 |  53 |  63 | 153 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |   0 |   1 |  16 |  26 |  36 |  46 |  56 | 148 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |   0 |   0 |  10 |  15 |  20 |  30 |  45 | 129 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |   0 |   0 |   1 |   1 |   1 |   1 |  20 | 104 |
     +---+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

                  Table 57: CELT Static Allocation Table

   The maximum allocation vector is an approximation of the maximum
   space that can be used by each band for a given mode.  The value is
   approximate because the shape encoding is variable rate (due to
   entropy coding of splitting parameters).  Setting the maximum too low
   reduces the maximum achievable quality in a band while setting it too
   high may result in waste: bitstream capacity available at the end of
   the frame that cannot be put to any use.  The maximums specified by
   the codec reflect the average maximum.  In the reference
   implementation, the maximums in bits/sample are precomputed in a
   static table (see cache_caps50[] in static_modes_float.h) for each
   band, for each value of LM, and for both mono and stereo.
   Implementations are expected to simply use the same table data, but
   the procedure for generating this table is included in rate.c as part
   of compute_pulse_cache().

   To convert the values in cache.caps into the actual maximums: first,
   set nbBands to the maximum number of bands for this mode, and stereo
   to zero if stereo is not in use and one otherwise.  For each band,
   set N to the number of MDCT bins covered by the band (for one
   channel), set LM to the shift value for the frame size.  Then, set i
   to nbBands*(2*LM+stereo).  Next, set the maximum for the band to the
   i-th index of cache.caps + 64 and multiply by the number of channels
   in the current frame (one or two) and by N, then divide the result by
   4 using integer division.  The resulting vector will be called cap[].
   The elements fit in signed 16-bit integers but do not fit in 8 bits.
   This procedure is implemented in the reference in the function
   init_caps() in celt.c.

   The band boosts are represented by a series of binary symbols that
   are entropy coded with very low probability.  Each band can
   potentially be boosted multiple times, subject to the frame actually
   having enough room to obey the boost and having enough room to code
   the boost symbol.  The default coding cost for a boost starts out at
   six bits (probability p=1/64), but subsequent boosts in a band cost
   only a single bit and every time a band is boosted the initial cost
   is reduced (down to a minimum of two bits, or p=1/4).  Since the

RFC6716 - Page 114

   initial cost of coding a boost is 6 bits, the coding cost of the
   boost symbols when completely unused is 0.48 bits/frame for a 21 band
   mode (21*-log2(1-1/2**6)).

   To decode the band boosts: First, set 'dynalloc_logp' to 6, the
   initial amount of storage required to signal a boost in bits,
   'total_bits' to the size of the frame in 8th bits, 'total_boost' to
   zero, and 'tell' to the total number of 8th bits decoded so far.  For
   each band from the coding start (0 normally, but 17 in Hybrid mode)
   to the coding end (which changes depending on the signaled
   bandwidth), the boost quanta in units of 1/8 bit is calculated as
   quanta = min(8*N, max(48, N)).  This represents a boost step size of
   six bits, subject to a lower limit of 1/8th bit/sample and an upper
   limit of 1 bit/sample.  Set 'boost' to zero and 'dynalloc_loop_logp'
   to dynalloc_logp.  While dynalloc_loop_log (the current worst case
   symbol cost) in 8th bits plus tell is less than total_bits plus
   total_boost and boost is less than cap[] for this band: Decode a bit
   from the bitstream with dynalloc_loop_logp as the cost of a one and
   update tell to reflect the current used capacity.  If the decoded
   value is zero break the loop.  Otherwise, add quanta to boost and
   total_boost, subtract quanta from total_bits, and set
   dynalloc_loop_log to 1.  When the loop finishes 'boost' contains the
   bit allocation boost for this band.  If boost is non-zero and
   dynalloc_logp is greater than 2, decrease dynalloc_logp.  Once this
   process has been executed on all bands, the band boosts have been
   decoded.  This procedure is implemented around line 2474 of celt.c.

   At very low rates, it is possible that there won't be enough
   available space to execute the inner loop even once.  In these cases,
   band boost is not possible, but its overhead is completely
   eliminated.  Because of the high cost of band boost when activated, a
   reasonable encoder should not be using it at very low rates.  The
   reference implements its dynalloc decision logic around line 1304 of
   celt.c.

   The allocation trim is an integer value from 0-10.  The default value
   of 5 indicates no trim.  The trim parameter is entropy coded in order
   to lower the coding cost of less extreme adjustments.  Values lower
   than 5 bias the allocation towards lower frequencies and values above
   5 bias it towards higher frequencies.  Like other signaled
   parameters, signaling of the trim is gated so that it is not included
   if there is insufficient space available in the bitstream.  To decode
   the trim, first set the trim value to 5, then if and only if the
   count of decoded 8th bits so far (ec_tell_frac) plus 48 (6 bits) is
   less than or equal to the total frame size in 8th bits minus
   total_boost (a product of the above band boost procedure), decode the
   trim value using the PDF in Table 58.

RFC6716 - Page 115

              +--------------------------------------------+
              | PDF                                        |
              +--------------------------------------------+
              | {2, 2, 5, 10, 22, 46, 22, 10, 5, 2, 2}/128 |
              +--------------------------------------------+

                        Table 58: PDF for the Trim

   For 10 ms and 20 ms frames using short blocks and that have at least
   LM+2 bits left prior to the allocation process, one anti-collapse bit
   is reserved in the allocation process so it can be decoded later.
   Following the anti-collapse reservation, one bit is reserved for skip
   if available.

   For stereo frames, bits are reserved for intensity stereo and for
   dual stereo.  Intensity stereo requires ilog2(end-start) bits.  Those
   bits are reserved if there are enough bits left.  Following this, one
   bit is reserved for dual stereo if available.

   The allocation computation begins by setting up some initial
   conditions. 'total' is set to the remaining available 8th bits,
   computed by taking the size of the coded frame times 8 and
   subtracting ec_tell_frac().  From this value, one (8th bit) is
   subtracted to ensure that the resulting allocation will be
   conservative. 'anti_collapse_rsv' is set to 8 (8th bits) if and only
   if the frame is a transient, LM is greater than 1, and total is
   greater than or equal to (LM+2) * 8.  Total is then decremented by
   anti_collapse_rsv and clamped to be equal to or greater than zero.
   'skip_rsv' is set to 8 (8th bits) if total is greater than 8,
   otherwise it is zero.  Total is then decremented by skip_rsv.  This
   reserves space for the final skipping flag.

   If the current frame is stereo, intensity_rsv is set to the
   conservative log2 in 8th bits of the number of coded bands for this
   frame (given by the table LOG2_FRAC_TABLE in rate.c).  If
   intensity_rsv is greater than total, then intensity_rsv is set to
   zero.  Otherwise, total is decremented by intensity_rsv, and if total
   is still greater than 8, dual_stereo_rsv is set to 8 and total is
   decremented by dual_stereo_rsv.

   The allocation process then computes a vector representing the hard
   minimum amounts allocation any band will receive for shape.  This
   minimum is higher than the technical limit of the PVQ process, but
   very low rate allocations produce an excessively sparse spectrum and
   these bands are better served by having no allocation at all.  For
   each coded band, set thresh[band] to 24 times the number of MDCT bins
   in the band and divide by 16.  If 8 times the number of channels is
   greater, use that instead.  This sets the minimum allocation to one

RFC6716 - Page 116

   bit per channel or 48 128th bits per MDCT bin, whichever is greater.
   The band-size dependent part of this value is not scaled by the
   channel count, because at the very low rates where this limit is
   applicable there will usually be no bits allocated to the side.

   The previously decoded allocation trim is used to derive a vector of
   per-band adjustments, 'trim_offsets[]'.  For each coded band take the
   alloc_trim and subtract 5 and LM.  Then, multiply the result by the
   number of channels, the number of MDCT bins in the shortest frame
   size for this mode, the number of remaining bands, 2**LM, and 8.
   Next, divide this value by 64.  Finally, if the number of MDCT bins
   in the band per channel is only one, 8 times the number of channels
   is subtracted in order to diminish the allocation by one bit, because
   width 1 bands receive greater benefit from the coarse energy coding.

4.3.4.  Shape Decoding

   In each band, the normalized "shape" is encoded using Pyramid Vector
   Quantizer.

   In the simplest case, the number of bits allocated in Section 4.3.3
   is converted to a number of pulses as described by Section 4.3.4.1.
   Knowing the number of pulses and the number of samples in the band,
   the decoder calculates the size of the codebook as detailed in
   Section 4.3.4.2.  The size is used to decode an unsigned integer
   (uniform probability model), which is the codeword index.  This index
   is converted into the corresponding vector as explained in
   Section 4.3.4.2.  This vector is then scaled to unit norm.

4.3.4.1.  Bits to Pulses

   Although the allocation is performed in 1/8th bit units, the
   quantization requires an integer number of pulses K.  To do this, the
   encoder searches for the value of K that produces the number of bits
   nearest to the allocated value (rounding down if exactly halfway
   between two values), not to exceed the total number of bits
   available.  For efficiency reasons, the search is performed against a
   precomputed allocation table that only permits some K values for each
   N.  The number of codebook entries can be computed as explained in
   Section 4.3.4.2.  The difference between the number of bits allocated
   and the number of bits used is accumulated to a "balance"
   (initialized to zero) that helps adjust the allocation for the next
   bands.  One third of the balance is applied to the bit allocation of
   each band to help achieve the target allocation.  The only exceptions
   are the band before the last and the last band, for which half the
   balance and the whole balance are applied, respectively.

RFC6716 - Page 117

4.3.4.2.  PVQ Decoding

   Decoding of PVQ vectors is implemented in decode_pulses() (cwrs.c).
   The unique codeword index is decoded as a uniformly distributed
   integer value between 0 and V(N,K)-1, where V(N,K) is the number of
   possible combinations of K pulses in N samples.  The index is then
   converted to a vector in the same way specified in [PVQ].  The
   indexing is based on the calculation of V(N,K) (denoted N(L,K) in
   [PVQ]).

   The number of combinations can be computed recursively as V(N,K) =
   V(N-1,K) + V(N,K-1) + V(N-1,K-1), with V(N,0) = 1 and V(0,K) = 0, K
   != 0.  There are many different ways to compute V(N,K), including
   precomputed tables and direct use of the recursive formulation.  The
   reference implementation applies the recursive formulation one line
   (or column) at a time to save on memory use, along with an alternate,
   univariate recurrence to initialize an arbitrary line, and direct
   polynomial solutions for small N.  All of these methods are
   equivalent, and have different trade-offs in speed, memory usage, and
   code size.  Implementations MAY use any methods they like, as long as
   they are equivalent to the mathematical definition.

   The decoded vector X is recovered as follows.  Let i be the index
   decoded with the procedure in Section 4.1.5 with ft = V(N,K), so that
   0 <= i < V(N,K).  Let k = K.  Then, for j = 0 to (N - 1), inclusive,
   do:

   1.  Let p = (V(N-j-1,k) + V(N-j,k))/2.

   2.  If i < p, then let sgn = 1, else let sgn = -1 and set i = i - p.

   3.  Let k0 = k and set p = p - V(N-j-1,k).

   4.  While p > i, set k = k - 1 and p = p - V(N-j-1,k).

   5.  Set X[j] = sgn*(k0 - k) and i = i - p.

   The decoded vector X is then normalized such that its L2-norm equals
   one.

4.3.4.3.  Spreading

   The normalized vector decoded in Section 4.3.4.2 is then rotated for
   the purpose of avoiding tonal artifacts.  The rotation gain is equal
   to

                           g_r = N / (N + f_r*K)

RFC6716 - Page 118

   where N is the number of dimensions, K is the number of pulses, and
   f_r depends on the value of the "spread" parameter in the bitstream.

                 +--------------+------------------------+
                 | Spread value | f_r                    |
                 +--------------+------------------------+
                 | 0            | infinite (no rotation) |
                 |              |                        |
                 | 1            | 15                     |
                 |              |                        |
                 | 2            | 10                     |
                 |              |                        |
                 | 3            | 5                      |
                 +--------------+------------------------+

                        Table 59: Spreading Values

   The rotation angle is then calculated as

                                              2
                                     pi *  g_r
                             theta = ----------
                                         4

   A 2-D rotation R(i,j) between points x_i and x_j is defined as:

                  x_i' =  cos(theta)*x_i + sin(theta)*x_j
                  x_j' = -sin(theta)*x_i + cos(theta)*x_j

   An N-D rotation is then achieved by applying a series of 2-D
   rotations back and forth, in the following order: R(x_1, x_2), R(x_2,
   x_3), ..., R(x_N-2, X_N-1), R(x_N-1, X_N), R(x_N-2, X_N-1), ...,
   R(x_1, x_2).

   If the decoded vector represents more than one time block, then this
   spreading process is applied separately on each time block.  Also, if
   each block represents 8 samples or more, then another N-D rotation,
   by (pi/2-theta), is applied _before_ the rotation described above.
   This extra rotation is applied in an interleaved manner with a stride
   equal to round(sqrt(N/nb_blocks)), i.e., it is applied independently
   for each set of sample S_k = {stride*n + k}, n=0..N/stride-1.

4.3.4.4.  Split Decoding

   To avoid the need for multi-precision calculations when decoding PVQ
   codevectors, the maximum size allowed for codebooks is 32 bits.  When
   larger codebooks are needed, the vector is instead split in two sub-
   vectors of size N/2.  A quantized gain parameter with precision

RFC6716 - Page 119

   derived from the current allocation is entropy coded to represent the
   relative gains of each side of the split, and the entire decoding
   process is recursively applied.  Multiple levels of splitting may be
   applied up to a limit of LM+1 splits.  The same recursive mechanism
   is applied for the joint coding of stereo audio.

4.3.4.5.  Time-Frequency Change

   The time-frequency (TF) parameters are used to control the time-
   frequency resolution trade-off in each coded band.  For each band,
   there are two possible TF choices.  For the first band coded, the PDF
   is {3, 1}/4 for frames marked as transient and {15, 1}/16 for the
   other frames.  For subsequent bands, the TF choice is coded relative
   to the previous TF choice with probability {15, 1}/16 for transient
   frames and {31, 1}/32 otherwise.  The mapping between the decoded TF
   choices and the adjustment in TF resolution is shown in the tables
   below.

                       +-----------------+---+----+
                       | Frame size (ms) | 0 |  1 |
                       +-----------------+---+----+
                       |       2.5       | 0 | -1 |
                       |                 |   |    |
                       |        5        | 0 | -1 |
                       |                 |   |    |
                       |        10       | 0 | -2 |
                       |                 |   |    |
                       |        20       | 0 | -2 |
                       +-----------------+---+----+

     Table 60: TF Adjustments for Non-transient Frames and tf_select=0

                       +-----------------+---+----+
                       | Frame size (ms) | 0 |  1 |
                       +-----------------+---+----+
                       |       2.5       | 0 | -1 |
                       |                 |   |    |
                       |        5        | 0 | -2 |
                       |                 |   |    |
                       |        10       | 0 | -3 |
                       |                 |   |    |
                       |        20       | 0 | -3 |
                       +-----------------+---+----+

     Table 61: TF Adjustments for Non-transient Frames and tf_select=1

RFC6716 - Page 120

                       +-----------------+---+----+
                       | Frame size (ms) | 0 |  1 |
                       +-----------------+---+----+
                       |       2.5       | 0 | -1 |
                       |                 |   |    |
                       |        5        | 1 |  0 |
                       |                 |   |    |
                       |        10       | 2 |  0 |
                       |                 |   |    |
                       |        20       | 3 |  0 |
                       +-----------------+---+----+

       Table 62: TF Adjustments for Transient Frames and tf_select=0

                       +-----------------+---+----+
                       | Frame size (ms) | 0 |  1 |
                       +-----------------+---+----+
                       |       2.5       | 0 | -1 |
                       |                 |   |    |
                       |        5        | 1 | -1 |
                       |                 |   |    |
                       |        10       | 1 | -1 |
                       |                 |   |    |
                       |        20       | 1 | -1 |
                       +-----------------+---+----+

       Table 63: TF Adjustments for Transient Frames and tf_select=1

   A negative TF adjustment means that the temporal resolution is
   increased, while a positive TF adjustment means that the frequency
   resolution is increased.  Changes in TF resolution are implemented
   using the Hadamard transform [HADAMARD].  To increase the time
   resolution by N, N "levels" of the Hadamard transform are applied to
   the decoded vector for each interleaved MDCT vector.  To increase the
   frequency resolution (assumes a transient frame), then N levels of
   the Hadamard transform are applied _across_ the interleaved MDCT
   vector.  In the case of increased time resolution, the decoder uses
   the "sequency order" because the input vector is sorted in time.

4.3.5.  Anti-collapse Processing

   The anti-collapse feature is designed to avoid the situation where
   the use of multiple short MDCTs causes the energy in one or more of
   the MDCTs to be zero for some bands, causing unpleasant artifacts.
   When the frame has the transient bit set, an anti-collapse bit is
   decoded.  When anti-collapse is set, the energy in each small MDCT is
   prevented from collapsing to zero.  For each band of each MDCT where
   a collapse is detected, a pseudo-random signal is inserted with an

RFC6716 - Page 121

   energy corresponding to the minimum energy over the two previous
   frames.  A renormalization step is then required to ensure that the
   anti-collapse step did not alter the energy preservation property.

4.3.6.  Denormalization

   Just as each band was normalized in the encoder, the last step of the
   decoder before the inverse MDCT is to denormalize the bands.  Each
   decoded normalized band is multiplied by the square root of the
   decoded energy.  This is done by denormalise_bands() (bands.c).

4.3.7.  Inverse MDCT

   The inverse MDCT implementation has no special characteristics.  The
   input is N frequency-domain samples and the output is 2*N time-domain
   samples, while scaling by 1/2.  A "low-overlap" window reduces the
   algorithmic delay.  It is derived from a basic (full-overlap) 240-
   sample version of the window used by the Vorbis codec:

                                                         2
                          /   /pi      /pi   n + 1/2\ \ \
                   W(n) = |sin|-- * sin|-- * -------| | |
                          \   \2       \2       L   / / /

   The low-overlap window is created by zero-padding the basic window
   and inserting ones in the middle, such that the resulting window
   still satisfies power complementarity [PRINCEN86].  The IMDCT and
   windowing are performed by mdct_backward (mdct.c).

4.3.7.1.  Post-Filter

   The output of the inverse MDCT (after weighted overlap-add) is sent
   to the post-filter.  Although the post-filter is applied at the end,
   the post-filter parameters are encoded at the beginning, just after
   the silence flag.  The post-filter can be switched on or off using
   one bit (logp=1).  If the post-filter is enabled, then the octave is
   decoded as an integer value between 0 and 6 of uniform probability.
   Once the octave is known, the fine pitch within the octave is decoded
   using 4+octave raw bits.  The final pitch period is equal to
   (16<<octave)+fine_pitch-1 so it is bounded between 15 and 1022,
   inclusively.  Next, the gain is decoded as three raw bits and is
   equal to G=3*(int_gain+1)/32.  The set of post-filter taps is decoded
   last, using a pdf equal to {2, 1, 1}/4.  Tapset zero corresponds to
   the filter coefficients g0 = 0.3066406250, g1 = 0.2170410156, g2 =
   0.1296386719.  Tapset one corresponds to the filter coefficients g0 =
   0.4638671875, g1 = 0.2680664062, g2 = 0, and tapset two uses filter
   coefficients g0 = 0.7998046875, g1 = 0.1000976562, g2 = 0.

RFC6716 - Page 122

   The post-filter response is thus computed as:


             y(n) = x(n) + G*(g0*y(n-T) + g1*(y(n-T+1)+y(n-T+1))
                                        + g2*(y(n-T+2)+y(n-T+2)))


   During a transition between different gains, a smooth transition is
   calculated using the square of the MDCT window.  It is important that
   values of y(n) be interpolated one at a time such that the past value
   of y(n) used is interpolated.

4.3.7.2.  De-emphasis

   After the post-filter, the signal is de-emphasized using the inverse
   of the pre-emphasis filter used in the encoder:

                            1            1
                           ---- = ---------------
                           A(z)                -1
                                  1 - alpha_p*z

   where alpha_p=0.8500061035.

4.4.  Packet Loss Concealment (PLC)

   Packet Loss Concealment (PLC) is an optional decoder-side feature
   that SHOULD be included when receiving from an unreliable channel.
   Because PLC is not part of the bitstream, there are many acceptable
   ways to implement PLC with different complexity/quality trade-offs.

   The PLC in the reference implementation depends on the mode of last
   packet received.  In CELT mode, the PLC finds a periodicity in the
   decoded signal and repeats the windowed waveform using the pitch
   offset.  The windowed waveform is overlapped in such a way as to
   preserve the time-domain aliasing cancellation with the previous
   frame and the next frame.  This is implemented in celt_decode_lost()
   (mdct.c).  In SILK mode, the PLC uses LPC extrapolation from the
   previous frame, implemented in silk_PLC() (PLC.c).

4.4.1.  Clock Drift Compensation

   Clock drift refers to the gradual desynchronization of two endpoints
   whose sample clocks run at different frequencies while they are
   streaming live audio.  Differences in clock frequencies are generally
   attributable to manufacturing variation in the endpoints' clock
   hardware.  For long-lived streams, the time difference between sender
   and receiver can grow without bound.

RFC6716 - Page 123

   When the sender's clock runs slower than the receiver's, the effect
   is similar to packet loss: too few packets are received.  The
   receiver can distinguish between drift and loss if the transport
   provides packet timestamps.  A receiver for live streams SHOULD
   conceal the effects of drift, and it MAY do so by invoking the PLC.

   When the sender's clock runs faster than the receiver's, too many
   packets will be received.  The receiver MAY respond by skipping any
   packet (i.e., not submitting the packet for decoding).  This is
   likely to produce a less severe artifact than if the frame were
   dropped after decoding.

   A decoder MAY employ a more sophisticated drift compensation method.
   For example, the NetEQ component [GOOGLE-NETEQ] of the Google WebRTC
   codebase [GOOGLE-WEBRTC] compensates for drift by adding or removing
   one period when the signal is highly periodic.  The reference
   implementation of Opus allows a caller to learn whether the current
   frame's signal is highly periodic, and if so what the period is,
   using the OPUS_GET_PITCH() request.

4.5.  Configuration Switching

   Switching between the Opus coding modes, audio bandwidths, and
   channel counts requires careful consideration to avoid audible
   glitches.  Switching between any two configurations of the CELT-only
   mode, any two configurations of the Hybrid mode, or from WB SILK to
   Hybrid mode does not require any special treatment in the decoder, as
   the MDCT overlap will smooth the transition.  Switching from Hybrid
   mode to WB SILK requires adding in the final contents of the CELT
   overlap buffer to the first SILK-only packet.  This can be done by
   decoding a 2.5 ms silence frame with the CELT decoder using the
   channel count of the SILK-only packet (and any choice of audio
   bandwidth), which will correctly handle the cases when the channel
   count changes as well.

   When changing the channel count for SILK-only or Hybrid packets, the
   encoder can avoid glitches by smoothly varying the stereo width of
   the input signal before or after the transition, and it SHOULD do so.
   However, other transitions between SILK-only packets or between NB or
   MB SILK and Hybrid packets may cause glitches, because neither the
   LSF coefficients nor the LTP, LPC, stereo unmixing, and resampler
   buffers are available at the new sample rate.  These switches SHOULD
   be delayed by the encoder until quiet periods or transients, where
   the inevitable glitches will be less audible.  Additionally, the
   bitstream MAY include redundant side information ("redundancy"), in
   the form of additional CELT frames embedded in each of the Opus
   frames around the transition.

RFC6716 - Page 124

   The other transitions that cannot be easily handled are those where
   the lower frequencies switch between the SILK LP-based model and the
   CELT MDCT model.  However, an encoder may not have an opportunity to
   delay such a switch to a convenient point.  For example, if the
   content switches from speech to music, and the encoder does not have
   enough latency in its analysis to detect this in advance, there may
   be no convenient silence period during which to make the transition
   for quite some time.  To avoid or reduce glitches during these
   problematic mode transitions, and between audio bandwidth changes in
   the SILK-only modes, transitions MAY include redundant side
   information ("redundancy"), in the form of an additional CELT frame
   embedded in the Opus frame.

   A transition between coding the lower frequencies with the LP model
   and the MDCT model or a transition that involves changing the SILK
   bandwidth is only normatively specified when it includes redundancy.
   For those without redundancy, it is RECOMMENDED that the decoder use
   a concealment technique (e.g., make use of a PLC algorithm) to "fill
   in" the gap or discontinuity caused by the mode transition.
   Therefore, PLC MUST NOT be applied during any normative transition,
   i.e., when

   o  A packet includes redundancy for this transition (as described
      below),

   o  The transition is between any WB SILK packet and any Hybrid
      packet, or vice versa,

   o  The transition is between any two Hybrid mode packets, or

   o  The transition is between any two CELT mode packets,

   unless there is actual packet loss.

4.5.1.  Transition Side Information (Redundancy)

   Transitions with side information include an extra 5 ms "redundant"
   CELT frame within the Opus frame.  This frame is designed to fill in
   the gap or discontinuity in the different layers without requiring
   the decoder to conceal it.  For transitions from CELT-only to SILK-
   only or Hybrid, the redundant frame is inserted in the first Opus
   frame after the transition (i.e., the first SILK-only or Hybrid
   frame).  For transitions from SILK-only or Hybrid to CELT-only, the
   redundant frame is inserted in the last Opus frame before the
   transition (i.e., the last SILK-only or Hybrid frame).

RFC6716 - Page 125

4.5.1.1.  Redundancy Flag

   The presence of redundancy is signaled in all SILK-only and Hybrid
   frames, not just those involved in a mode transition.  This allows
   the frames to be decoded correctly even if an adjacent frame is lost.
   For SILK-only frames, this signaling is implicit, based on the size
   of the Opus frame and the number of bits consumed decoding the SILK
   portion of it.  After decoding the SILK portion of the Opus frame,
   the decoder uses ec_tell() (see Section 4.1.6.1) to check if there
   are at least 17 bits remaining.  If so, then the frame contains
   redundancy.

   For Hybrid frames, this signaling is explicit.  After decoding the
   SILK portion of the Opus frame, the decoder uses ec_tell() (see
   Section 4.1.6.1) to ensure there are at least 37 bits remaining.  If
   so, it reads a symbol with the PDF in Table 64, and if the value is
   1, then the frame contains redundancy.  Otherwise (if there were
   fewer than 37 bits left or the value was 0), the frame does not
   contain redundancy.

                            +----------------+
                            | PDF            |
                            +----------------+
                            | {4095, 1}/4096 |
                            +----------------+

                       Table 64: Redundancy Flag PDF

4.5.1.2.  Redundancy Position Flag

   Since the current frame is a SILK-only or a Hybrid frame, it must be
   at least 10 ms.  Therefore, it needs an additional flag to indicate
   whether the redundant 5 ms CELT frame should be mixed into the
   beginning of the current frame, or the end.  After determining that a
   frame contains redundancy, the decoder reads a 1 bit symbol with a
   uniform PDF (Table 65).

                               +----------+
                               | PDF      |
                               +----------+
                               | {1, 1}/2 |
                               +----------+

                     Table 65: Redundancy Position PDF

RFC6716 - Page 126

   If the value is zero, this is the first frame in the transition, and
   the redundancy belongs at the end.  If the value is one, this is the
   second frame in the transition, and the redundancy belongs at the
   beginning.  There is no way to specify that an Opus frame contains
   separate redundant CELT frames at both the beginning and the end.

4.5.1.3.  Redundancy Size

   Unlike the CELT portion of a Hybrid frame, the redundant CELT frame
   does not use the same entropy coder state as the rest of the Opus
   frame, because this would break the CELT bit allocation mechanism in
   Hybrid frames.  Thus, a redundant CELT frame always starts and ends
   on a byte boundary, even in SILK-only frames, where this is not
   strictly necessary.

   For SILK-only frames, the number of bytes in the redundant CELT frame
   is simply the number of whole bytes remaining, which must be at least
   2, due to the space check in Section 4.5.1.1.  For Hybrid frames, the
   number of bytes is equal to 2, plus a decoded unsigned integer less
   than 256 (see Section 4.1.5).  This may be more than the number of
   whole bytes remaining in the Opus frame, in which case the frame is
   invalid.  However, a decoder is not required to ignore the entire
   frame, as this may be the result of a bit error that desynchronized
   the range coder.  There may still be useful data before the error,
   and a decoder MAY keep any audio decoded so far instead of invoking
   the PLC, but it is RECOMMENDED that the decoder stop decoding and
   discard the rest of the current Opus frame.

   It would have been possible to avoid these invalid states in the
   design of Opus by limiting the range of the explicit length decoded
   from Hybrid frames by the actual number of whole bytes remaining.
   However, this would require an encoder to determine the rate
   allocation for the MDCT layer up front, before it began encoding that
   layer.  By allowing some invalid sizes, the encoder is able to defer
   that decision until much later.  When encoding Hybrid frames that do
   not include redundancy, the encoder must still decide up front if it
   wishes to use the minimum 37 bits required to trigger encoding of the
   redundancy flag, but this is a much looser restriction.

   After determining the size of the redundant CELT frame, the decoder
   reduces the size of the buffer currently in use by the range coder by
   that amount.  The MDCT layer reads any raw bits from the end of this
   reduced buffer, and all calculations of the number of bits remaining
   in the buffer must be done using this new, reduced size, rather than
   the original size of the Opus frame.

RFC6716 - Page 127

4.5.1.4.  Decoding the Redundancy

   The redundant frame is decoded like any other CELT-only frame, with
   the exception that it does not contain a TOC byte.  The frame size is
   fixed at 5 ms, the channel count is set to that of the current frame,
   and the audio bandwidth is also set to that of the current frame,
   with the exception that for MB SILK frames, it is set to WB.

   If the redundancy belongs at the beginning (in a CELT-only to SILK-
   only or Hybrid transition), the final reconstructed output uses the
   first 2.5 ms of audio output by the decoder for the redundant frame
   as is, discarding the corresponding output from the SILK-only or
   Hybrid portion of the frame.  The remaining 2.5 ms is cross-lapped
   with the decoded SILK/Hybrid signal using the CELT's power-
   complementary MDCT window to ensure a smooth transition.

   If the redundancy belongs at the end (in a SILK-only or Hybrid to
   CELT-only transition), only the second half (2.5 ms) of the audio
   output by the decoder for the redundant frame is used.  In that case,
   the second half of the redundant frame is cross-lapped with the end
   of the SILK/Hybrid signal, again using CELT's power-complementary
   MDCT window to ensure a smooth transition.

4.5.2.  State Reset

   When a transition occurs, the state of the SILK or the CELT decoder
   (or both) may need to be reset before decoding a frame in the new
   mode.  This avoids reusing "out of date" memory, which may not have
   been updated in some time or may not be in a well-defined state due
   to, e.g., PLC.  The SILK state is reset before every SILK-only or
   Hybrid frame where the previous frame was CELT-only.  The CELT state
   is reset every time the operating mode changes and the new mode is
   either Hybrid or CELT-only, except when the transition uses
   redundancy as described above.  When switching from SILK-only or
   Hybrid to CELT-only with redundancy, the CELT state is reset before
   decoding the redundant CELT frame embedded in the SILK-only or Hybrid
   frame, but it is not reset before decoding the following CELT-only
   frame.  When switching from CELT-only mode to SILK-only or Hybrid
   mode with redundancy, the CELT decoder is not reset for decoding the
   redundant CELT frame.

RFC6716 - Page 128

4.5.3.  Summary of Transitions

   Figure 18 illustrates all of the normative transitions involving a
   mode change, an audio bandwidth change, or both.  Each one uses an S,
   H, or C to represent an Opus frame in the corresponding mode.  In
   addition, an R indicates the presence of redundancy in the Opus frame
   with which it is cross-lapped.  Its location in the first or last
   5 ms is assumed to correspond to whether it is the frame before or
   after the transition.  Other uses of redundancy are non-normative.
   Finally, a c indicates the contents of the CELT overlap buffer after
   the previously decoded frame (i.e., as extracted by decoding a
   silence frame).

RFC6716 - Page 129

    SILK to SILK with Redundancy:             S -> S -> S
                                                        &
                                                       !R -> R
                                                             &
                                                            ;S -> S -> S

    NB or MB SILK to Hybrid with Redundancy:  S -> S -> S
                                                        &
                                                       !R ->;H -> H -> H

    WB SILK to Hybrid:                        S -> S -> S ->!H -> H -> H

    SILK to CELT with Redundancy:             S -> S -> S
                                                        &
                                                       !R -> C -> C -> C

    Hybrid to NB or MB SILK with Redundancy:  H -> H -> H
                                                        &
                                                       !R -> R
                                                             &
                                                            ;S -> S -> S

    Hybrid to WB SILK:                        H -> H -> H -> c
                                                          \  +
                                                           > S -> S -> S

    Hybrid to CELT with Redundancy:           H -> H -> H
                                                        &
                                                       !R -> C -> C -> C

    CELT to SILK with Redundancy:             C -> C -> C -> R
                                                             &
                                                            ;S -> S -> S

    CELT to Hybrid with Redundancy:           C -> C -> C -> R
                                                             &
                                                            |H -> H -> H

    Key:
    S   SILK-only frame                 ;   SILK decoder reset
    H   Hybrid frame                    |   CELT and SILK decoder resets
    C   CELT-only frame                 !   CELT decoder reset
    c   CELT overlap                    +   Direct mixing
    R   Redundant CELT frame            &   Windowed cross-lap

                     Figure 18: Normative Transitions

RFC6716 - Page 130

   The first two and the last two Opus frames in each example are
   illustrative, i.e., there is no requirement that a stream remain in
   the same configuration for three consecutive frames before or after a
   switch.

   The behavior of transitions without redundancy where PLC is allowed
   is non-normative.  An encoder might still wish to use these
   transitions if, for example, it doesn't want to add the extra bitrate
   required for redundancy or if it makes a decision to switch after it
   has already transmitted the frame that would have had to contain the
   redundancy.  Figure 19 illustrates the recommended cross-lapping and
   decoder resets for these transitions.

    SILK to SILK (audio bandwidth change):    S -> S -> S   ;S -> S -> S

    NB or MB SILK to Hybrid:                  S -> S -> S   |H -> H -> H

    SILK to CELT without Redundancy:          S -> S -> S -> P
                                                             &
                                                            !C -> C -> C

    Hybrid to NB or MB SILK:                  H -> H -> H -> c
                                                             +
                                                            ;S -> S -> S

    Hybrid to CELT without Redundancy:        H -> H -> H -> P
                                                             &
                                                            !C -> C -> C

    CELT to SILK without Redundancy:          C -> C -> C -> P
                                                             &
                                                            ;S -> S -> S

    CELT to Hybrid without Redundancy:        C -> C -> C -> P
                                                             &
                                                            |H -> H -> H

    Key:
    S   SILK-only frame                 ;   SILK decoder reset
    H   Hybrid frame                    |   CELT and SILK decoder resets
    C   CELT-only frame                 !   CELT decoder reset
    c   CELT overlap                    +   Direct mixing
    P   Packet Loss Concealment         &   Windowed cross-lap

             Figure 19: Recommended Non-Normative Transitions

   Encoders SHOULD NOT use other transitions, e.g., those that involve
   redundancy in ways not illustrated in Figure 18.

(next page on part 7)