Tech-invite3GPPspaceIETFspace
959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 6716

Definition of the Opus Audio Codec

Pages: 326
Proposed Standard
Errata
Updated by:  8251
Part 6 of 14 – Pages 104 to 130
First   Prev   Next

Top   ToC   RFC6716 - Page 104   prevText

4.3. CELT Decoder

The CELT layer of Opus is based on the Modified Discrete Cosine Transform [MDCT] with partially overlapping windows of 5 to 22.5 ms. The main principle behind CELT is that the MDCT spectrum is divided into bands that (roughly) follow the Bark scale, i.e., the scale of the ear's critical bands [ZWICKER61]. The normal CELT layer uses 21 of those bands, though Opus Custom (see Section 6.2) may use a different number of bands. In Hybrid mode, the first 17 bands (up to 8 kHz) are not coded. A band can contain as little as one MDCT bin per channel, and as many as 176 bins per channel, as detailed in Table 55. In each band, the gain (energy) is coded separately from the shape of the spectrum. Coding the gain explicitly makes it easy to preserve the spectral envelope of the signal. The remaining unit- norm shape vector is encoded using a Pyramid Vector Quantizer (PVQ) Section 4.3.4. +--------+--------+------+-------+-------+-------------+------------+ | Frame | 2.5 ms | 5 ms | 10 ms | 20 ms | Start | Stop | | Size: | | | | | Frequency | Frequency | +--------+--------+------+-------+-------+-------------+------------+ | Band | Bins: | | | | | | | | | | | | | | | 0 | 1 | 2 | 4 | 8 | 0 Hz | 200 Hz | | | | | | | | | | 1 | 1 | 2 | 4 | 8 | 200 Hz | 400 Hz | | | | | | | | |
Top   ToC   RFC6716 - Page 105
   | 2      |      1 |    2 |     4 |     8 |      400 Hz |     600 Hz |
   |        |        |      |       |       |             |            |
   | 3      |      1 |    2 |     4 |     8 |      600 Hz |     800 Hz |
   |        |        |      |       |       |             |            |
   | 4      |      1 |    2 |     4 |     8 |      800 Hz |    1000 Hz |
   |        |        |      |       |       |             |            |
   | 5      |      1 |    2 |     4 |     8 |     1000 Hz |    1200 Hz |
   |        |        |      |       |       |             |            |
   | 6      |      1 |    2 |     4 |     8 |     1200 Hz |    1400 Hz |
   |        |        |      |       |       |             |            |
   | 7      |      1 |    2 |     4 |     8 |     1400 Hz |    1600 Hz |
   |        |        |      |       |       |             |            |
   | 8      |      2 |    4 |     8 |    16 |     1600 Hz |    2000 Hz |
   |        |        |      |       |       |             |            |
   | 9      |      2 |    4 |     8 |    16 |     2000 Hz |    2400 Hz |
   |        |        |      |       |       |             |            |
   | 10     |      2 |    4 |     8 |    16 |     2400 Hz |    2800 Hz |
   |        |        |      |       |       |             |            |
   | 11     |      2 |    4 |     8 |    16 |     2800 Hz |    3200 Hz |
   |        |        |      |       |       |             |            |
   | 12     |      4 |    8 |    16 |    32 |     3200 Hz |    4000 Hz |
   |        |        |      |       |       |             |            |
   | 13     |      4 |    8 |    16 |    32 |     4000 Hz |    4800 Hz |
   |        |        |      |       |       |             |            |
   | 14     |      4 |    8 |    16 |    32 |     4800 Hz |    5600 Hz |
   |        |        |      |       |       |             |            |
   | 15     |      6 |   12 |    24 |    48 |     5600 Hz |    6800 Hz |
   |        |        |      |       |       |             |            |
   | 16     |      6 |   12 |    24 |    48 |     6800 Hz |    8000 Hz |
   |        |        |      |       |       |             |            |
   | 17     |      8 |   16 |    32 |    64 |     8000 Hz |    9600 Hz |
   |        |        |      |       |       |             |            |
   | 18     |     12 |   24 |    48 |    96 |     9600 Hz |   12000 Hz |
   |        |        |      |       |       |             |            |
   | 19     |     18 |   36 |    72 |   144 |    12000 Hz |   15600 Hz |
   |        |        |      |       |       |             |            |
   | 20     |     22 |   44 |    88 |   176 |    15600 Hz |   20000 Hz |
   +--------+--------+------+-------+-------+-------------+------------+

       Table 55: MDCT Bins per Channel per Band for Each Frame Size

   Transients are notoriously difficult for transform codecs to code.
   CELT uses two different strategies for them:

   1.  Using multiple smaller MDCTs instead of a single large MDCT, and

   2.  Dynamic time-frequency resolution changes (See Section 4.3.4.5).
Top   ToC   RFC6716 - Page 106
   To improve quality on highly tonal and periodic signals, CELT
   includes a pre-filter/post-filter combination.  The pre-filter on the
   encoder side attenuates the signal's harmonics.  The post-filter on
   the decoder side restores the original gain of the harmonics, while
   shaping the coding noise to roughly follow the harmonics.  Such noise
   shaping reduces the perception of the noise.

   When coding a stereo signal, three coding methods are available:

   o  mid-side stereo: encodes the mean and the difference of the left
      and right channels,

   o  intensity stereo: only encodes the mean of the left and right
      channels (discards the difference),

   o  dual stereo: encodes the left and right channels separately.

   An overview of the decoder is given in Figure 17.

                       +---------+
                       | Coarse  |
                    +->| decoder |----+
                    |  +---------+    |
                    |                 |
                    |  +---------+    v
                    |  |  Fine   |  +---+
                    +->| decoder |->| + |
                    |  +---------+  +---+
                    |       ^         |
        +---------+ |       |         |
        |  Range  | | +----------+    v
        | Decoder |-+ |   Bit    | +------+
        +---------+ | |Allocation| | 2**x |
                    | +----------+ +------+
                    |       |         |
                    |       v         v               +--------+
                    |  +---------+  +---+  +-------+  | pitch  |
                    +->|   PVQ   |->| * |->| IMDCT |->| post-  |--->
                    |  | decoder |  +---+  +-------+  | filter |
                    |  +---------+                    +--------+
                    |                                      ^
                    +--------------------------------------+

        Legend: IMDCT = Inverse MDCT

                 Figure 17: Structure of the CELT decoder

   The decoder is based on the following symbols and sets of symbols:
Top   ToC   RFC6716 - Page 107
          +---------------+---------------------+---------------+
          |   Symbol(s)   |         PDF         |   Condition   |
          +---------------+---------------------+---------------+
          |    silence    |   {32767, 1}/32768  |               |
          |               |                     |               |
          |  post-filter  |       {1, 1}/2      |               |
          |               |                     |               |
          |     octave    |     uniform (6)     |  post-filter  |
          |               |                     |               |
          |     period    | raw bits (4+octave) |  post-filter  |
          |               |                     |               |
          |      gain     |     raw bits (3)    |  post-filter  |
          |               |                     |               |
          |     tapset    |     {2, 1, 1}/4     |  post-filter  |
          |               |                     |               |
          |   transient   |       {7, 1}/8      |               |
          |               |                     |               |
          |     intra     |       {7, 1}/8      |               |
          |               |                     |               |
          | coarse energy |    Section 4.3.2    |               |
          |               |                     |               |
          |   tf_change   |    Section 4.3.1    |               |
          |               |                     |               |
          |   tf_select   |       {1, 1}/2      | Section 4.3.1 |
          |               |                     |               |
          |     spread    |   {7, 2, 21, 2}/32  |               |
          |               |                     |               |
          |  dyn. alloc.  |    Section 4.3.3    |               |
          |               |                     |               |
          |  alloc. trim  |       Table 58      |               |
          |               |                     |               |
          |      skip     |       {1, 1}/2      | Section 4.3.3 |
          |               |                     |               |
          |   intensity   |       uniform       | Section 4.3.3 |
          |               |                     |               |
          |      dual     |       {1, 1}/2      |               |
          |               |                     |               |
          |  fine energy  |    Section 4.3.2    |               |
          |               |                     |               |
          |    residual   |    Section 4.3.4    |               |
          |               |                     |               |
          | anti-collapse |       {1, 1}/2      | Section 4.3.5 |
          |               |                     |               |
          |    finalize   |    Section 4.3.2    |               |
          +---------------+---------------------+---------------+

    Table 56: Order of the Symbols in the CELT Section of the Bitstream
Top   ToC   RFC6716 - Page 108
   The decoder extracts information from the range-coded bitstream in
   the order described in Table 56.  In some circumstances, it is
   possible for a decoded value to be out of range due to a very small
   amount of redundancy in the encoding of large integers by the range
   coder.  In that case, the decoder should assume there has been an
   error in the coding, decoding, or transmission and SHOULD take
   measures to conceal the error and/or report to the application that a
   problem has occurred.  Such out of range errors cannot occur in the
   SILK layer.

4.3.1. Transient Decoding

The "transient" flag indicates whether the frame uses a single long MDCT or several short MDCTs. When it is set, then the MDCT coefficients represent multiple short MDCTs in the frame. When not set, the coefficients represent a single long MDCT for the frame. The flag is encoded in the bitstream with a probability of 1/8. In addition to the global transient flag is a per-band binary flag to change the time-frequency (tf) resolution independently in each band. The change in tf resolution is defined in tf_select_table[][] in celt.c and depends on the frame size, whether the transient flag is set, and the value of tf_select. The tf_select flag uses a 1/2 probability, but is only decoded if it can have an impact on the result knowing the value of all per-band tf_change flags.

4.3.2. Energy Envelope Decoding

It is important to quantize the energy with sufficient resolution because any energy quantization error cannot be compensated for at a later stage. Regardless of the resolution used for encoding the spectral shape of a band, it is perceptually important to preserve the energy in each band. CELT uses a three-step coarse-fine-fine strategy for encoding the energy in the base-2 log domain, as implemented in quant_bands.c.
4.3.2.1. Coarse Energy Decoding
Coarse quantization of the energy uses a fixed resolution of 6 dB (integer part of base-2 log). To minimize the bitrate, prediction is applied both in time (using the previous frame) and in frequency (using the previous bands). The part of the prediction that is based on the previous frame can be disabled, creating an "intra" frame where the energy is coded without reference to prior frames. The decoder first reads the intra flag to determine what prediction is used. The 2-D z-transform [Z-TRANSFORM] of the prediction filter is
Top   ToC   RFC6716 - Page 109
                                            -1          -1
                              (1 - alpha*z_l  )*(1 - z_b  )
                A(z_l, z_b) = -----------------------------
                                                 -1
                                     1 - beta*z_b

   where b is the band index and l is the frame index.  The prediction
   coefficients applied depend on the frame size in use when not using
   intra energy and are alpha=0, beta=4915/32768 when using intra
   energy.  The time-domain prediction is based on the final fine
   quantization of the previous frame, while the frequency domain
   (within the current frame) prediction is based on coarse quantization
   only (because the fine quantization has not been computed yet).  The
   prediction is clamped internally so that fixed-point implementations
   with limited dynamic range always remain in the same state as
   floating point implementations.  We approximate the ideal probability
   distribution of the prediction error using a Laplace distribution
   with separate parameters for each frame size in intra- and inter-
   frame modes.  These parameters are held in the e_prob_model table in
   quant_bands.c.  The coarse energy decoding is performed by
   unquant_coarse_energy() (quant_bands.c).  The decoding of the
   Laplace-distributed values is implemented in ec_laplace_decode()
   (laplace.c).

4.3.2.2. Fine Energy Quantization
The number of bits assigned to fine energy quantization in each band is determined by the bit allocation computation described in Section 4.3.3. Let B_i be the number of fine energy bits for band i; the refinement is an integer f in the range [0,2**B_i-1]. The mapping between f and the correction applied to the coarse energy is equal to (f+1/2)/2**B_i - 1/2. Fine energy quantization is implemented in quant_fine_energy() (quant_bands.c). When some bits are left "unused" after all other flags have been decoded, these bits are assigned to a "final" step of fine allocation. In effect, these bits are used to add one extra fine energy bit per band per channel. The allocation process determines two "priorities" for the final fine bits. Any remaining bits are first assigned only to bands of priority 0, starting from band 0 and going up. If all bands of priority 0 have received one bit per channel, then bands of priority 1 are assigned an extra bit per channel, starting from band 0. If any bits are left after this, they are left unused. This is implemented in unquant_energy_finalise() (quant_bands.c).
Top   ToC   RFC6716 - Page 110

4.3.3. Bit Allocation

Because the bit allocation drives the decoding of the range-coder stream, it MUST be recovered exactly so that identical coding decisions are made in the encoder and decoder. Any deviation from the reference's resulting bit allocation will result in corrupted output, though implementers are free to implement the procedure in any way that produces identical results. The per-band gain-shape structure of the CELT layer ensures that using the same number of bits for the spectral shape of a band in every frame will result in a roughly constant signal-to-noise ratio in that band. This results in coding noise that has the same spectral envelope as the signal. The masking curve produced by a standard psychoacoustic model also closely follows the spectral envelope of the signal. This structure means that the ideal allocation is more consistent from frame to frame than it is for other codecs without an equivalent structure and that a fixed allocation provides fairly consistent perceptual performance [VALIN2010]. Many codecs transmit significant amounts of side information to control the bit allocation within a frame. Often this control is only indirect, and it must be exercised carefully to achieve the desired rate constraints. The CELT layer, however, can adapt over a very wide range of rates, so it has a large number of codebook sizes to choose from for each band. Explicitly signaling the size of each of these codebooks would impose considerable overhead, even though the allocation is relatively static from frame to frame. This is because all of the information required to compute these codebook sizes must be derived from a single frame by itself, in order to retain robustness to packet loss, so the signaling cannot take advantage of knowledge of the allocation in neighboring frames. This problem is exacerbated in low-latency (small frame size) applications, which would include this overhead in every frame. For this reason, in the MDCT mode, Opus uses a primarily implicit bit allocation. The available bitstream capacity is known in advance to both the encoder and decoder without additional signaling, ultimately from the packet sizes expressed by a higher-level protocol. Using this information, the codec interpolates an allocation from a hard- coded table. While the band-energy structure effectively models intra-band masking, it ignores the weaker inter-band masking, band-temporal masking, and other less significant perceptual effects. While these effects can often be ignored, they can become significant for particular samples. One mechanism available to encoders would be to
Top   ToC   RFC6716 - Page 111
   simply increase the overall rate for these frames, but this is not
   possible in a constant rate mode and can be fairly inefficient.  As a
   result three explicitly signaled mechanisms are provided to alter the
   implicit allocation:

   o  Band boost

   o  Allocation trim

   o  Band skipping

   The first of these mechanisms, band boost, allows an encoder to boost
   the allocation in specific bands.  The second, allocation trim, works
   by biasing the overall allocation towards higher or lower frequency
   bands.  The third, band skipping, selects which low-precision high
   frequency bands will be allocated no shape bits at all.

   In stereo mode, there are two additional parameters potentially coded
   as part of the allocation procedure: a parameter to allow the
   selective elimination of allocation for the 'side' (i.e., intensity
   stereo) in jointly coded bands, and a flag to deactivate joint coding
   (i.e., dual stereo).  These values are not signaled if they would be
   meaningless in the overall context of the allocation.

   Because every signaled adjustment increases overhead and
   implementation complexity, none were included speculatively: the
   reference encoder makes use of all of these mechanisms.  While the
   decision logic in the reference was found to be effective enough to
   justify the overhead and complexity, further analysis techniques may
   be discovered that increase the effectiveness of these parameters.
   As with other signaled parameters, an encoder is free to choose the
   values in any manner, but, unless a technique is known to deliver
   superior perceptual results, the methods used by the reference
   implementation should be used.

   The allocation process consists of the following steps: determining
   the per-band maximum allocation vector, decoding the boosts, decoding
   the tilt, determining the remaining capacity of the frame, searching
   the mode table for the entry nearest but not exceeding the available
   space (subject to the tilt, boosts, band maximums, and band
   minimums), linear interpolation, reallocation of unused bits with
   concurrent skip decoding, determination of the fine-energy vs. shape
   split, and final reallocation.  This process results in a per-band
   shape allocation (in 1/8th-bit units), a per-band fine-energy
   allocation (in 1 bit per channel units), a set of band priorities for
   controlling the use of remaining bits at the end of the frame, and a
   remaining balance of unallocated space, which is usually zero except
   at very high rates.
Top   ToC   RFC6716 - Page 112
   The "static" bit allocation (in 1/8 bits) for a quality q, excluding
   the minimums, maximums, tilt and boosts, is equal to
   channels*N*alloc[band][q]<<LM>>2, where alloc[][] is given in
   Table 57 and LM=log2(frame_size/120).  The allocation is obtained by
   linearly interpolating between two values of q (in steps of 1/64) to
   find the highest allocation that does not exceed the number of bits
   remaining.

    Rows indicate the MDCT bands, columns are the different quality (q)
             parameters.  The units are 1/32 bit per MDCT bin.

     +---+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
     | 0 |  1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |
     +---+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
     | 0 | 90 | 110 | 118 | 126 | 134 | 144 | 152 | 162 | 172 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 80 | 100 | 110 | 119 | 127 | 137 | 145 | 155 | 165 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 75 |  90 | 103 | 112 | 120 | 130 | 138 | 148 | 158 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 69 |  84 |  93 | 104 | 114 | 124 | 132 | 142 | 152 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 63 |  78 |  86 |  95 | 103 | 113 | 123 | 133 | 143 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 56 |  71 |  80 |  89 |  97 | 107 | 117 | 127 | 137 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 49 |  65 |  75 |  83 |  91 | 101 | 111 | 121 | 131 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 40 |  58 |  70 |  78 |  85 |  95 | 105 | 115 | 125 | 200 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 34 |  51 |  65 |  72 |  78 |  88 |  98 | 108 | 118 | 198 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 29 |  45 |  59 |  66 |  72 |  82 |  92 | 102 | 112 | 193 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 20 |  39 |  53 |  60 |  66 |  76 |  86 |  96 | 106 | 188 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 18 |  32 |  47 |  54 |  60 |  70 |  80 |  90 | 100 | 183 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 | 10 |  26 |  40 |  47 |  54 |  64 |  74 |  84 |  94 | 178 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |  20 |  31 |  39 |  47 |  57 |  67 |  77 |  87 | 173 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |  12 |  23 |  32 |  41 |  51 |  61 |  71 |  81 | 168 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |  15 |  25 |  35 |  45 |  55 |  65 |  75 | 163 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |   4 |  17 |  29 |  39 |  49 |  59 |  69 | 158 |
     |   |    |     |     |     |     |     |     |     |     |     |
Top   ToC   RFC6716 - Page 113
     | 0 |  0 |   0 |   0 |  12 |  23 |  33 |  43 |  53 |  63 | 153 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |   0 |   1 |  16 |  26 |  36 |  46 |  56 | 148 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |   0 |   0 |  10 |  15 |  20 |  30 |  45 | 129 |
     |   |    |     |     |     |     |     |     |     |     |     |
     | 0 |  0 |   0 |   0 |   0 |   1 |   1 |   1 |   1 |  20 | 104 |
     +---+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

                  Table 57: CELT Static Allocation Table

   The maximum allocation vector is an approximation of the maximum
   space that can be used by each band for a given mode.  The value is
   approximate because the shape encoding is variable rate (due to
   entropy coding of splitting parameters).  Setting the maximum too low
   reduces the maximum achievable quality in a band while setting it too
   high may result in waste: bitstream capacity available at the end of
   the frame that cannot be put to any use.  The maximums specified by
   the codec reflect the average maximum.  In the reference
   implementation, the maximums in bits/sample are precomputed in a
   static table (see cache_caps50[] in static_modes_float.h) for each
   band, for each value of LM, and for both mono and stereo.
   Implementations are expected to simply use the same table data, but
   the procedure for generating this table is included in rate.c as part
   of compute_pulse_cache().

   To convert the values in cache.caps into the actual maximums: first,
   set nbBands to the maximum number of bands for this mode, and stereo
   to zero if stereo is not in use and one otherwise.  For each band,
   set N to the number of MDCT bins covered by the band (for one
   channel), set LM to the shift value for the frame size.  Then, set i
   to nbBands*(2*LM+stereo).  Next, set the maximum for the band to the
   i-th index of cache.caps + 64 and multiply by the number of channels
   in the current frame (one or two) and by N, then divide the result by
   4 using integer division.  The resulting vector will be called cap[].
   The elements fit in signed 16-bit integers but do not fit in 8 bits.
   This procedure is implemented in the reference in the function
   init_caps() in celt.c.

   The band boosts are represented by a series of binary symbols that
   are entropy coded with very low probability.  Each band can
   potentially be boosted multiple times, subject to the frame actually
   having enough room to obey the boost and having enough room to code
   the boost symbol.  The default coding cost for a boost starts out at
   six bits (probability p=1/64), but subsequent boosts in a band cost
   only a single bit and every time a band is boosted the initial cost
   is reduced (down to a minimum of two bits, or p=1/4).  Since the
Top   ToC   RFC6716 - Page 114
   initial cost of coding a boost is 6 bits, the coding cost of the
   boost symbols when completely unused is 0.48 bits/frame for a 21 band
   mode (21*-log2(1-1/2**6)).

   To decode the band boosts: First, set 'dynalloc_logp' to 6, the
   initial amount of storage required to signal a boost in bits,
   'total_bits' to the size of the frame in 8th bits, 'total_boost' to
   zero, and 'tell' to the total number of 8th bits decoded so far.  For
   each band from the coding start (0 normally, but 17 in Hybrid mode)
   to the coding end (which changes depending on the signaled
   bandwidth), the boost quanta in units of 1/8 bit is calculated as
   quanta = min(8*N, max(48, N)).  This represents a boost step size of
   six bits, subject to a lower limit of 1/8th bit/sample and an upper
   limit of 1 bit/sample.  Set 'boost' to zero and 'dynalloc_loop_logp'
   to dynalloc_logp.  While dynalloc_loop_log (the current worst case
   symbol cost) in 8th bits plus tell is less than total_bits plus
   total_boost and boost is less than cap[] for this band: Decode a bit
   from the bitstream with dynalloc_loop_logp as the cost of a one and
   update tell to reflect the current used capacity.  If the decoded
   value is zero break the loop.  Otherwise, add quanta to boost and
   total_boost, subtract quanta from total_bits, and set
   dynalloc_loop_log to 1.  When the loop finishes 'boost' contains the
   bit allocation boost for this band.  If boost is non-zero and
   dynalloc_logp is greater than 2, decrease dynalloc_logp.  Once this
   process has been executed on all bands, the band boosts have been
   decoded.  This procedure is implemented around line 2474 of celt.c.

   At very low rates, it is possible that there won't be enough
   available space to execute the inner loop even once.  In these cases,
   band boost is not possible, but its overhead is completely
   eliminated.  Because of the high cost of band boost when activated, a
   reasonable encoder should not be using it at very low rates.  The
   reference implements its dynalloc decision logic around line 1304 of
   celt.c.

   The allocation trim is an integer value from 0-10.  The default value
   of 5 indicates no trim.  The trim parameter is entropy coded in order
   to lower the coding cost of less extreme adjustments.  Values lower
   than 5 bias the allocation towards lower frequencies and values above
   5 bias it towards higher frequencies.  Like other signaled
   parameters, signaling of the trim is gated so that it is not included
   if there is insufficient space available in the bitstream.  To decode
   the trim, first set the trim value to 5, then if and only if the
   count of decoded 8th bits so far (ec_tell_frac) plus 48 (6 bits) is
   less than or equal to the total frame size in 8th bits minus
   total_boost (a product of the above band boost procedure), decode the
   trim value using the PDF in Table 58.
Top   ToC   RFC6716 - Page 115
              +--------------------------------------------+
              | PDF                                        |
              +--------------------------------------------+
              | {2, 2, 5, 10, 22, 46, 22, 10, 5, 2, 2}/128 |
              +--------------------------------------------+

                        Table 58: PDF for the Trim

   For 10 ms and 20 ms frames using short blocks and that have at least
   LM+2 bits left prior to the allocation process, one anti-collapse bit
   is reserved in the allocation process so it can be decoded later.
   Following the anti-collapse reservation, one bit is reserved for skip
   if available.

   For stereo frames, bits are reserved for intensity stereo and for
   dual stereo.  Intensity stereo requires ilog2(end-start) bits.  Those
   bits are reserved if there are enough bits left.  Following this, one
   bit is reserved for dual stereo if available.

   The allocation computation begins by setting up some initial
   conditions. 'total' is set to the remaining available 8th bits,
   computed by taking the size of the coded frame times 8 and
   subtracting ec_tell_frac().  From this value, one (8th bit) is
   subtracted to ensure that the resulting allocation will be
   conservative. 'anti_collapse_rsv' is set to 8 (8th bits) if and only
   if the frame is a transient, LM is greater than 1, and total is
   greater than or equal to (LM+2) * 8.  Total is then decremented by
   anti_collapse_rsv and clamped to be equal to or greater than zero.
   'skip_rsv' is set to 8 (8th bits) if total is greater than 8,
   otherwise it is zero.  Total is then decremented by skip_rsv.  This
   reserves space for the final skipping flag.

   If the current frame is stereo, intensity_rsv is set to the
   conservative log2 in 8th bits of the number of coded bands for this
   frame (given by the table LOG2_FRAC_TABLE in rate.c).  If
   intensity_rsv is greater than total, then intensity_rsv is set to
   zero.  Otherwise, total is decremented by intensity_rsv, and if total
   is still greater than 8, dual_stereo_rsv is set to 8 and total is
   decremented by dual_stereo_rsv.

   The allocation process then computes a vector representing the hard
   minimum amounts allocation any band will receive for shape.  This
   minimum is higher than the technical limit of the PVQ process, but
   very low rate allocations produce an excessively sparse spectrum and
   these bands are better served by having no allocation at all.  For
   each coded band, set thresh[band] to 24 times the number of MDCT bins
   in the band and divide by 16.  If 8 times the number of channels is
   greater, use that instead.  This sets the minimum allocation to one
Top   ToC   RFC6716 - Page 116
   bit per channel or 48 128th bits per MDCT bin, whichever is greater.
   The band-size dependent part of this value is not scaled by the
   channel count, because at the very low rates where this limit is
   applicable there will usually be no bits allocated to the side.

   The previously decoded allocation trim is used to derive a vector of
   per-band adjustments, 'trim_offsets[]'.  For each coded band take the
   alloc_trim and subtract 5 and LM.  Then, multiply the result by the
   number of channels, the number of MDCT bins in the shortest frame
   size for this mode, the number of remaining bands, 2**LM, and 8.
   Next, divide this value by 64.  Finally, if the number of MDCT bins
   in the band per channel is only one, 8 times the number of channels
   is subtracted in order to diminish the allocation by one bit, because
   width 1 bands receive greater benefit from the coarse energy coding.

4.3.4. Shape Decoding

In each band, the normalized "shape" is encoded using Pyramid Vector Quantizer. In the simplest case, the number of bits allocated in Section 4.3.3 is converted to a number of pulses as described by Section 4.3.4.1. Knowing the number of pulses and the number of samples in the band, the decoder calculates the size of the codebook as detailed in Section 4.3.4.2. The size is used to decode an unsigned integer (uniform probability model), which is the codeword index. This index is converted into the corresponding vector as explained in Section 4.3.4.2. This vector is then scaled to unit norm.
4.3.4.1. Bits to Pulses
Although the allocation is performed in 1/8th bit units, the quantization requires an integer number of pulses K. To do this, the encoder searches for the value of K that produces the number of bits nearest to the allocated value (rounding down if exactly halfway between two values), not to exceed the total number of bits available. For efficiency reasons, the search is performed against a precomputed allocation table that only permits some K values for each N. The number of codebook entries can be computed as explained in Section 4.3.4.2. The difference between the number of bits allocated and the number of bits used is accumulated to a "balance" (initialized to zero) that helps adjust the allocation for the next bands. One third of the balance is applied to the bit allocation of each band to help achieve the target allocation. The only exceptions are the band before the last and the last band, for which half the balance and the whole balance are applied, respectively.
Top   ToC   RFC6716 - Page 117
4.3.4.2. PVQ Decoding
Decoding of PVQ vectors is implemented in decode_pulses() (cwrs.c). The unique codeword index is decoded as a uniformly distributed integer value between 0 and V(N,K)-1, where V(N,K) is the number of possible combinations of K pulses in N samples. The index is then converted to a vector in the same way specified in [PVQ]. The indexing is based on the calculation of V(N,K) (denoted N(L,K) in [PVQ]). The number of combinations can be computed recursively as V(N,K) = V(N-1,K) + V(N,K-1) + V(N-1,K-1), with V(N,0) = 1 and V(0,K) = 0, K != 0. There are many different ways to compute V(N,K), including precomputed tables and direct use of the recursive formulation. The reference implementation applies the recursive formulation one line (or column) at a time to save on memory use, along with an alternate, univariate recurrence to initialize an arbitrary line, and direct polynomial solutions for small N. All of these methods are equivalent, and have different trade-offs in speed, memory usage, and code size. Implementations MAY use any methods they like, as long as they are equivalent to the mathematical definition. The decoded vector X is recovered as follows. Let i be the index decoded with the procedure in Section 4.1.5 with ft = V(N,K), so that 0 <= i < V(N,K). Let k = K. Then, for j = 0 to (N - 1), inclusive, do: 1. Let p = (V(N-j-1,k) + V(N-j,k))/2. 2. If i < p, then let sgn = 1, else let sgn = -1 and set i = i - p. 3. Let k0 = k and set p = p - V(N-j-1,k). 4. While p > i, set k = k - 1 and p = p - V(N-j-1,k). 5. Set X[j] = sgn*(k0 - k) and i = i - p. The decoded vector X is then normalized such that its L2-norm equals one.
4.3.4.3. Spreading
The normalized vector decoded in Section 4.3.4.2 is then rotated for the purpose of avoiding tonal artifacts. The rotation gain is equal to g_r = N / (N + f_r*K)
Top   ToC   RFC6716 - Page 118
   where N is the number of dimensions, K is the number of pulses, and
   f_r depends on the value of the "spread" parameter in the bitstream.

                 +--------------+------------------------+
                 | Spread value | f_r                    |
                 +--------------+------------------------+
                 | 0            | infinite (no rotation) |
                 |              |                        |
                 | 1            | 15                     |
                 |              |                        |
                 | 2            | 10                     |
                 |              |                        |
                 | 3            | 5                      |
                 +--------------+------------------------+

                        Table 59: Spreading Values

   The rotation angle is then calculated as

                                              2
                                     pi *  g_r
                             theta = ----------
                                         4

   A 2-D rotation R(i,j) between points x_i and x_j is defined as:

                  x_i' =  cos(theta)*x_i + sin(theta)*x_j
                  x_j' = -sin(theta)*x_i + cos(theta)*x_j

   An N-D rotation is then achieved by applying a series of 2-D
   rotations back and forth, in the following order: R(x_1, x_2), R(x_2,
   x_3), ..., R(x_N-2, X_N-1), R(x_N-1, X_N), R(x_N-2, X_N-1), ...,
   R(x_1, x_2).

   If the decoded vector represents more than one time block, then this
   spreading process is applied separately on each time block.  Also, if
   each block represents 8 samples or more, then another N-D rotation,
   by (pi/2-theta), is applied _before_ the rotation described above.
   This extra rotation is applied in an interleaved manner with a stride
   equal to round(sqrt(N/nb_blocks)), i.e., it is applied independently
   for each set of sample S_k = {stride*n + k}, n=0..N/stride-1.

4.3.4.4. Split Decoding
To avoid the need for multi-precision calculations when decoding PVQ codevectors, the maximum size allowed for codebooks is 32 bits. When larger codebooks are needed, the vector is instead split in two sub- vectors of size N/2. A quantized gain parameter with precision
Top   ToC   RFC6716 - Page 119
   derived from the current allocation is entropy coded to represent the
   relative gains of each side of the split, and the entire decoding
   process is recursively applied.  Multiple levels of splitting may be
   applied up to a limit of LM+1 splits.  The same recursive mechanism
   is applied for the joint coding of stereo audio.

4.3.4.5. Time-Frequency Change
The time-frequency (TF) parameters are used to control the time- frequency resolution trade-off in each coded band. For each band, there are two possible TF choices. For the first band coded, the PDF is {3, 1}/4 for frames marked as transient and {15, 1}/16 for the other frames. For subsequent bands, the TF choice is coded relative to the previous TF choice with probability {15, 1}/16 for transient frames and {31, 1}/32 otherwise. The mapping between the decoded TF choices and the adjustment in TF resolution is shown in the tables below. +-----------------+---+----+ | Frame size (ms) | 0 | 1 | +-----------------+---+----+ | 2.5 | 0 | -1 | | | | | | 5 | 0 | -1 | | | | | | 10 | 0 | -2 | | | | | | 20 | 0 | -2 | +-----------------+---+----+ Table 60: TF Adjustments for Non-transient Frames and tf_select=0 +-----------------+---+----+ | Frame size (ms) | 0 | 1 | +-----------------+---+----+ | 2.5 | 0 | -1 | | | | | | 5 | 0 | -2 | | | | | | 10 | 0 | -3 | | | | | | 20 | 0 | -3 | +-----------------+---+----+ Table 61: TF Adjustments for Non-transient Frames and tf_select=1
Top   ToC   RFC6716 - Page 120
                       +-----------------+---+----+
                       | Frame size (ms) | 0 |  1 |
                       +-----------------+---+----+
                       |       2.5       | 0 | -1 |
                       |                 |   |    |
                       |        5        | 1 |  0 |
                       |                 |   |    |
                       |        10       | 2 |  0 |
                       |                 |   |    |
                       |        20       | 3 |  0 |
                       +-----------------+---+----+

       Table 62: TF Adjustments for Transient Frames and tf_select=0

                       +-----------------+---+----+
                       | Frame size (ms) | 0 |  1 |
                       +-----------------+---+----+
                       |       2.5       | 0 | -1 |
                       |                 |   |    |
                       |        5        | 1 | -1 |
                       |                 |   |    |
                       |        10       | 1 | -1 |
                       |                 |   |    |
                       |        20       | 1 | -1 |
                       +-----------------+---+----+

       Table 63: TF Adjustments for Transient Frames and tf_select=1

   A negative TF adjustment means that the temporal resolution is
   increased, while a positive TF adjustment means that the frequency
   resolution is increased.  Changes in TF resolution are implemented
   using the Hadamard transform [HADAMARD].  To increase the time
   resolution by N, N "levels" of the Hadamard transform are applied to
   the decoded vector for each interleaved MDCT vector.  To increase the
   frequency resolution (assumes a transient frame), then N levels of
   the Hadamard transform are applied _across_ the interleaved MDCT
   vector.  In the case of increased time resolution, the decoder uses
   the "sequency order" because the input vector is sorted in time.

4.3.5. Anti-collapse Processing

The anti-collapse feature is designed to avoid the situation where the use of multiple short MDCTs causes the energy in one or more of the MDCTs to be zero for some bands, causing unpleasant artifacts. When the frame has the transient bit set, an anti-collapse bit is decoded. When anti-collapse is set, the energy in each small MDCT is prevented from collapsing to zero. For each band of each MDCT where a collapse is detected, a pseudo-random signal is inserted with an
Top   ToC   RFC6716 - Page 121
   energy corresponding to the minimum energy over the two previous
   frames.  A renormalization step is then required to ensure that the
   anti-collapse step did not alter the energy preservation property.

4.3.6. Denormalization

Just as each band was normalized in the encoder, the last step of the decoder before the inverse MDCT is to denormalize the bands. Each decoded normalized band is multiplied by the square root of the decoded energy. This is done by denormalise_bands() (bands.c).

4.3.7. Inverse MDCT

The inverse MDCT implementation has no special characteristics. The input is N frequency-domain samples and the output is 2*N time-domain samples, while scaling by 1/2. A "low-overlap" window reduces the algorithmic delay. It is derived from a basic (full-overlap) 240- sample version of the window used by the Vorbis codec: 2 / /pi /pi n + 1/2\ \ \ W(n) = |sin|-- * sin|-- * -------| | | \ \2 \2 L / / / The low-overlap window is created by zero-padding the basic window and inserting ones in the middle, such that the resulting window still satisfies power complementarity [PRINCEN86]. The IMDCT and windowing are performed by mdct_backward (mdct.c).
4.3.7.1. Post-Filter
The output of the inverse MDCT (after weighted overlap-add) is sent to the post-filter. Although the post-filter is applied at the end, the post-filter parameters are encoded at the beginning, just after the silence flag. The post-filter can be switched on or off using one bit (logp=1). If the post-filter is enabled, then the octave is decoded as an integer value between 0 and 6 of uniform probability. Once the octave is known, the fine pitch within the octave is decoded using 4+octave raw bits. The final pitch period is equal to (16<<octave)+fine_pitch-1 so it is bounded between 15 and 1022, inclusively. Next, the gain is decoded as three raw bits and is equal to G=3*(int_gain+1)/32. The set of post-filter taps is decoded last, using a pdf equal to {2, 1, 1}/4. Tapset zero corresponds to the filter coefficients g0 = 0.3066406250, g1 = 0.2170410156, g2 = 0.1296386719. Tapset one corresponds to the filter coefficients g0 = 0.4638671875, g1 = 0.2680664062, g2 = 0, and tapset two uses filter coefficients g0 = 0.7998046875, g1 = 0.1000976562, g2 = 0.
Top   ToC   RFC6716 - Page 122
   The post-filter response is thus computed as:


             y(n) = x(n) + G*(g0*y(n-T) + g1*(y(n-T+1)+y(n-T+1))
                                        + g2*(y(n-T+2)+y(n-T+2)))


   During a transition between different gains, a smooth transition is
   calculated using the square of the MDCT window.  It is important that
   values of y(n) be interpolated one at a time such that the past value
   of y(n) used is interpolated.

4.3.7.2. De-emphasis
After the post-filter, the signal is de-emphasized using the inverse of the pre-emphasis filter used in the encoder: 1 1 ---- = --------------- A(z) -1 1 - alpha_p*z where alpha_p=0.8500061035.

4.4. Packet Loss Concealment (PLC)

Packet Loss Concealment (PLC) is an optional decoder-side feature that SHOULD be included when receiving from an unreliable channel. Because PLC is not part of the bitstream, there are many acceptable ways to implement PLC with different complexity/quality trade-offs. The PLC in the reference implementation depends on the mode of last packet received. In CELT mode, the PLC finds a periodicity in the decoded signal and repeats the windowed waveform using the pitch offset. The windowed waveform is overlapped in such a way as to preserve the time-domain aliasing cancellation with the previous frame and the next frame. This is implemented in celt_decode_lost() (mdct.c). In SILK mode, the PLC uses LPC extrapolation from the previous frame, implemented in silk_PLC() (PLC.c).

4.4.1. Clock Drift Compensation

Clock drift refers to the gradual desynchronization of two endpoints whose sample clocks run at different frequencies while they are streaming live audio. Differences in clock frequencies are generally attributable to manufacturing variation in the endpoints' clock hardware. For long-lived streams, the time difference between sender and receiver can grow without bound.
Top   ToC   RFC6716 - Page 123
   When the sender's clock runs slower than the receiver's, the effect
   is similar to packet loss: too few packets are received.  The
   receiver can distinguish between drift and loss if the transport
   provides packet timestamps.  A receiver for live streams SHOULD
   conceal the effects of drift, and it MAY do so by invoking the PLC.

   When the sender's clock runs faster than the receiver's, too many
   packets will be received.  The receiver MAY respond by skipping any
   packet (i.e., not submitting the packet for decoding).  This is
   likely to produce a less severe artifact than if the frame were
   dropped after decoding.

   A decoder MAY employ a more sophisticated drift compensation method.
   For example, the NetEQ component [GOOGLE-NETEQ] of the Google WebRTC
   codebase [GOOGLE-WEBRTC] compensates for drift by adding or removing
   one period when the signal is highly periodic.  The reference
   implementation of Opus allows a caller to learn whether the current
   frame's signal is highly periodic, and if so what the period is,
   using the OPUS_GET_PITCH() request.

4.5. Configuration Switching

Switching between the Opus coding modes, audio bandwidths, and channel counts requires careful consideration to avoid audible glitches. Switching between any two configurations of the CELT-only mode, any two configurations of the Hybrid mode, or from WB SILK to Hybrid mode does not require any special treatment in the decoder, as the MDCT overlap will smooth the transition. Switching from Hybrid mode to WB SILK requires adding in the final contents of the CELT overlap buffer to the first SILK-only packet. This can be done by decoding a 2.5 ms silence frame with the CELT decoder using the channel count of the SILK-only packet (and any choice of audio bandwidth), which will correctly handle the cases when the channel count changes as well. When changing the channel count for SILK-only or Hybrid packets, the encoder can avoid glitches by smoothly varying the stereo width of the input signal before or after the transition, and it SHOULD do so. However, other transitions between SILK-only packets or between NB or MB SILK and Hybrid packets may cause glitches, because neither the LSF coefficients nor the LTP, LPC, stereo unmixing, and resampler buffers are available at the new sample rate. These switches SHOULD be delayed by the encoder until quiet periods or transients, where the inevitable glitches will be less audible. Additionally, the bitstream MAY include redundant side information ("redundancy"), in the form of additional CELT frames embedded in each of the Opus frames around the transition.
Top   ToC   RFC6716 - Page 124
   The other transitions that cannot be easily handled are those where
   the lower frequencies switch between the SILK LP-based model and the
   CELT MDCT model.  However, an encoder may not have an opportunity to
   delay such a switch to a convenient point.  For example, if the
   content switches from speech to music, and the encoder does not have
   enough latency in its analysis to detect this in advance, there may
   be no convenient silence period during which to make the transition
   for quite some time.  To avoid or reduce glitches during these
   problematic mode transitions, and between audio bandwidth changes in
   the SILK-only modes, transitions MAY include redundant side
   information ("redundancy"), in the form of an additional CELT frame
   embedded in the Opus frame.

   A transition between coding the lower frequencies with the LP model
   and the MDCT model or a transition that involves changing the SILK
   bandwidth is only normatively specified when it includes redundancy.
   For those without redundancy, it is RECOMMENDED that the decoder use
   a concealment technique (e.g., make use of a PLC algorithm) to "fill
   in" the gap or discontinuity caused by the mode transition.
   Therefore, PLC MUST NOT be applied during any normative transition,
   i.e., when

   o  A packet includes redundancy for this transition (as described
      below),

   o  The transition is between any WB SILK packet and any Hybrid
      packet, or vice versa,

   o  The transition is between any two Hybrid mode packets, or

   o  The transition is between any two CELT mode packets,

   unless there is actual packet loss.

4.5.1. Transition Side Information (Redundancy)

Transitions with side information include an extra 5 ms "redundant" CELT frame within the Opus frame. This frame is designed to fill in the gap or discontinuity in the different layers without requiring the decoder to conceal it. For transitions from CELT-only to SILK- only or Hybrid, the redundant frame is inserted in the first Opus frame after the transition (i.e., the first SILK-only or Hybrid frame). For transitions from SILK-only or Hybrid to CELT-only, the redundant frame is inserted in the last Opus frame before the transition (i.e., the last SILK-only or Hybrid frame).
Top   ToC   RFC6716 - Page 125
4.5.1.1. Redundancy Flag
The presence of redundancy is signaled in all SILK-only and Hybrid frames, not just those involved in a mode transition. This allows the frames to be decoded correctly even if an adjacent frame is lost. For SILK-only frames, this signaling is implicit, based on the size of the Opus frame and the number of bits consumed decoding the SILK portion of it. After decoding the SILK portion of the Opus frame, the decoder uses ec_tell() (see Section 4.1.6.1) to check if there are at least 17 bits remaining. If so, then the frame contains redundancy. For Hybrid frames, this signaling is explicit. After decoding the SILK portion of the Opus frame, the decoder uses ec_tell() (see Section 4.1.6.1) to ensure there are at least 37 bits remaining. If so, it reads a symbol with the PDF in Table 64, and if the value is 1, then the frame contains redundancy. Otherwise (if there were fewer than 37 bits left or the value was 0), the frame does not contain redundancy. +----------------+ | PDF | +----------------+ | {4095, 1}/4096 | +----------------+ Table 64: Redundancy Flag PDF
4.5.1.2. Redundancy Position Flag
Since the current frame is a SILK-only or a Hybrid frame, it must be at least 10 ms. Therefore, it needs an additional flag to indicate whether the redundant 5 ms CELT frame should be mixed into the beginning of the current frame, or the end. After determining that a frame contains redundancy, the decoder reads a 1 bit symbol with a uniform PDF (Table 65). +----------+ | PDF | +----------+ | {1, 1}/2 | +----------+ Table 65: Redundancy Position PDF
Top   ToC   RFC6716 - Page 126
   If the value is zero, this is the first frame in the transition, and
   the redundancy belongs at the end.  If the value is one, this is the
   second frame in the transition, and the redundancy belongs at the
   beginning.  There is no way to specify that an Opus frame contains
   separate redundant CELT frames at both the beginning and the end.

4.5.1.3. Redundancy Size
Unlike the CELT portion of a Hybrid frame, the redundant CELT frame does not use the same entropy coder state as the rest of the Opus frame, because this would break the CELT bit allocation mechanism in Hybrid frames. Thus, a redundant CELT frame always starts and ends on a byte boundary, even in SILK-only frames, where this is not strictly necessary. For SILK-only frames, the number of bytes in the redundant CELT frame is simply the number of whole bytes remaining, which must be at least 2, due to the space check in Section 4.5.1.1. For Hybrid frames, the number of bytes is equal to 2, plus a decoded unsigned integer less than 256 (see Section 4.1.5). This may be more than the number of whole bytes remaining in the Opus frame, in which case the frame is invalid. However, a decoder is not required to ignore the entire frame, as this may be the result of a bit error that desynchronized the range coder. There may still be useful data before the error, and a decoder MAY keep any audio decoded so far instead of invoking the PLC, but it is RECOMMENDED that the decoder stop decoding and discard the rest of the current Opus frame. It would have been possible to avoid these invalid states in the design of Opus by limiting the range of the explicit length decoded from Hybrid frames by the actual number of whole bytes remaining. However, this would require an encoder to determine the rate allocation for the MDCT layer up front, before it began encoding that layer. By allowing some invalid sizes, the encoder is able to defer that decision until much later. When encoding Hybrid frames that do not include redundancy, the encoder must still decide up front if it wishes to use the minimum 37 bits required to trigger encoding of the redundancy flag, but this is a much looser restriction. After determining the size of the redundant CELT frame, the decoder reduces the size of the buffer currently in use by the range coder by that amount. The MDCT layer reads any raw bits from the end of this reduced buffer, and all calculations of the number of bits remaining in the buffer must be done using this new, reduced size, rather than the original size of the Opus frame.
Top   ToC   RFC6716 - Page 127
4.5.1.4. Decoding the Redundancy
The redundant frame is decoded like any other CELT-only frame, with the exception that it does not contain a TOC byte. The frame size is fixed at 5 ms, the channel count is set to that of the current frame, and the audio bandwidth is also set to that of the current frame, with the exception that for MB SILK frames, it is set to WB. If the redundancy belongs at the beginning (in a CELT-only to SILK- only or Hybrid transition), the final reconstructed output uses the first 2.5 ms of audio output by the decoder for the redundant frame as is, discarding the corresponding output from the SILK-only or Hybrid portion of the frame. The remaining 2.5 ms is cross-lapped with the decoded SILK/Hybrid signal using the CELT's power- complementary MDCT window to ensure a smooth transition. If the redundancy belongs at the end (in a SILK-only or Hybrid to CELT-only transition), only the second half (2.5 ms) of the audio output by the decoder for the redundant frame is used. In that case, the second half of the redundant frame is cross-lapped with the end of the SILK/Hybrid signal, again using CELT's power-complementary MDCT window to ensure a smooth transition.

4.5.2. State Reset

When a transition occurs, the state of the SILK or the CELT decoder (or both) may need to be reset before decoding a frame in the new mode. This avoids reusing "out of date" memory, which may not have been updated in some time or may not be in a well-defined state due to, e.g., PLC. The SILK state is reset before every SILK-only or Hybrid frame where the previous frame was CELT-only. The CELT state is reset every time the operating mode changes and the new mode is either Hybrid or CELT-only, except when the transition uses redundancy as described above. When switching from SILK-only or Hybrid to CELT-only with redundancy, the CELT state is reset before decoding the redundant CELT frame embedded in the SILK-only or Hybrid frame, but it is not reset before decoding the following CELT-only frame. When switching from CELT-only mode to SILK-only or Hybrid mode with redundancy, the CELT decoder is not reset for decoding the redundant CELT frame.
Top   ToC   RFC6716 - Page 128

4.5.3. Summary of Transitions

Figure 18 illustrates all of the normative transitions involving a mode change, an audio bandwidth change, or both. Each one uses an S, H, or C to represent an Opus frame in the corresponding mode. In addition, an R indicates the presence of redundancy in the Opus frame with which it is cross-lapped. Its location in the first or last 5 ms is assumed to correspond to whether it is the frame before or after the transition. Other uses of redundancy are non-normative. Finally, a c indicates the contents of the CELT overlap buffer after the previously decoded frame (i.e., as extracted by decoding a silence frame).
Top   ToC   RFC6716 - Page 129
    SILK to SILK with Redundancy:             S -> S -> S
                                                        &
                                                       !R -> R
                                                             &
                                                            ;S -> S -> S

    NB or MB SILK to Hybrid with Redundancy:  S -> S -> S
                                                        &
                                                       !R ->;H -> H -> H

    WB SILK to Hybrid:                        S -> S -> S ->!H -> H -> H

    SILK to CELT with Redundancy:             S -> S -> S
                                                        &
                                                       !R -> C -> C -> C

    Hybrid to NB or MB SILK with Redundancy:  H -> H -> H
                                                        &
                                                       !R -> R
                                                             &
                                                            ;S -> S -> S

    Hybrid to WB SILK:                        H -> H -> H -> c
                                                          \  +
                                                           > S -> S -> S

    Hybrid to CELT with Redundancy:           H -> H -> H
                                                        &
                                                       !R -> C -> C -> C

    CELT to SILK with Redundancy:             C -> C -> C -> R
                                                             &
                                                            ;S -> S -> S

    CELT to Hybrid with Redundancy:           C -> C -> C -> R
                                                             &
                                                            |H -> H -> H

    Key:
    S   SILK-only frame                 ;   SILK decoder reset
    H   Hybrid frame                    |   CELT and SILK decoder resets
    C   CELT-only frame                 !   CELT decoder reset
    c   CELT overlap                    +   Direct mixing
    R   Redundant CELT frame            &   Windowed cross-lap

                     Figure 18: Normative Transitions
Top   ToC   RFC6716 - Page 130
   The first two and the last two Opus frames in each example are
   illustrative, i.e., there is no requirement that a stream remain in
   the same configuration for three consecutive frames before or after a
   switch.

   The behavior of transitions without redundancy where PLC is allowed
   is non-normative.  An encoder might still wish to use these
   transitions if, for example, it doesn't want to add the extra bitrate
   required for redundancy or if it makes a decision to switch after it
   has already transmitted the frame that would have had to contain the
   redundancy.  Figure 19 illustrates the recommended cross-lapping and
   decoder resets for these transitions.

    SILK to SILK (audio bandwidth change):    S -> S -> S   ;S -> S -> S

    NB or MB SILK to Hybrid:                  S -> S -> S   |H -> H -> H

    SILK to CELT without Redundancy:          S -> S -> S -> P
                                                             &
                                                            !C -> C -> C

    Hybrid to NB or MB SILK:                  H -> H -> H -> c
                                                             +
                                                            ;S -> S -> S

    Hybrid to CELT without Redundancy:        H -> H -> H -> P
                                                             &
                                                            !C -> C -> C

    CELT to SILK without Redundancy:          C -> C -> C -> P
                                                             &
                                                            ;S -> S -> S

    CELT to Hybrid without Redundancy:        C -> C -> C -> P
                                                             &
                                                            |H -> H -> H

    Key:
    S   SILK-only frame                 ;   SILK decoder reset
    H   Hybrid frame                    |   CELT and SILK decoder resets
    C   CELT-only frame                 !   CELT decoder reset
    c   CELT overlap                    +   Direct mixing
    P   Packet Loss Concealment         &   Windowed cross-lap

             Figure 19: Recommended Non-Normative Transitions

   Encoders SHOULD NOT use other transitions, e.g., those that involve
   redundancy in ways not illustrated in Figure 18.


(next page on part 7)

Next Section