RFC 6716

Definition of the Opus Audio Codec

Pages: 326
Proposed Standard
→ Errata
Updated by: 8251

Part 7 of 14 – Pages 131 to 155

RFC6716 - Page 131 prevText

5.  Opus Encoder

   Just like the decoder, the Opus encoder also normally consists of two
   main blocks: the SILK encoder and the CELT encoder.  However, unlike
   the case of the decoder, a valid (though potentially suboptimal) Opus
   encoder is not required to support all modes and may thus only
   include a SILK encoder module or a CELT encoder module.  The output
   bitstream of the Opus encoding contains bits from the SILK and CELT
   encoders, though these are not separable due to the use of a range
   coder.  A block diagram of the encoder is illustrated below.

                        +------------+    +---------+
                        |   Sample   |    |  SILK   |------+
                     +->|    Rate    |--->| Encoder |      V
      +-----------+  |  | Conversion |    |         | +---------+
      | Optional  |  |  +------------+    +---------+ |  Range  |
    ->| High-pass |--+                                | Encoder |---->
      |  Filter   |  |  +--------------+  +---------+ |         | Bit-
      +-----------+  |  |    Delay     |  |  CELT   | +---------+ stream
                     +->| Compensation |->| Encoder |      ^
                        |              |  |         |------+
                        +--------------+  +---------+

                          Figure 20: Opus Encoder

   For a normal encoder where both the SILK and the CELT modules are
   included, an optimal encoder should select which coding mode to use
   at run-time depending on the conditions.  In the reference
   implementation, the frame size is selected by the application, but
   the other configuration parameters (number of channels, bandwidth,
   mode) are automatically selected (unless explicitly overridden by the
   application) depending on the following:

   o  Requested bitrate

   o  Input sampling rate

   o  Type of signal (speech vs. music)

   o  Frame size in use

   The type of signal currently needs to be provided by the application
   (though it can be changed in real-time).  An Opus encoder
   implementation could also do automatic detection, but since Opus is
   an interactive codec, such an implementation would likely have to
   either delay the signal (for non-interactive applications) or delay
   the mode switching decisions (for interactive applications).

RFC6716 - Page 132

   When the encoder is configured for voice over IP applications, the
   input signal is filtered by a high-pass filter to remove the lowest
   part of the spectrum that contains little speech energy and may
   contain background noise.  This is a second order Auto Regressive
   Moving Average (i.e., with poles and zeros) filter with a cut-off
   frequency around 50 Hz.  In the future, a music detector may also be
   used to lower the cut-off frequency when the input signal is detected
   to be music rather than speech.

5.1.  Range Encoder

   The range coder acts as the bit-packer for Opus.  It is used in three
   different ways: to encode

   o  Entropy-coded symbols with a fixed probability model using
      ec_encode() (entenc.c),

   o  Integers from 0 to (2**M - 1) using ec_enc_uint() or ec_enc_bits()
      (entenc.c),

   o  Integers from 0 to (ft - 1) (where ft is not a power of two) using
      ec_enc_uint() (entenc.c).

   The range encoder maintains an internal state vector composed of the
   four-tuple (val, rng, rem, ext) representing the low end of the
   current range, the size of the current range, a single buffered
   output byte, and a count of additional carry-propagating output
   bytes.  Both val and rng are 32-bit unsigned integer values, rem is a
   byte value or less than 255 or the special value -1, and ext is an
   unsigned integer with at least 11 bits.  This state vector is
   initialized at the start of each frame to the value
   (0, 2**31, -1, 0).  After encoding a sequence of symbols, the value
   of rng in the encoder should exactly match the value of rng in the
   decoder after decoding the same sequence of symbols.  This is a
   powerful tool for detecting errors in either an encoder or decoder
   implementation.  The value of val, on the other hand, represents
   different things in the encoder and decoder, and is not expected to
   match.

   The decoder has no analog for rem and ext.  These are used to perform
   carry propagation in the renormalization loop below.  Each iteration
   of this loop produces 9 bits of output, consisting of 8 data bits and
   a carry flag.  The encoder cannot determine the final value of the
   output bytes until it propagates these carry flags.  Therefore, the
   reference implementation buffers a single non-propagating output byte
   (i.e., one less than 255) in rem and keeps a count of additional

RFC6716 - Page 133

   propagating (i.e., 255) output bytes in ext.  An implementation may
   choose to use any mathematically equivalent scheme to perform carry
   propagation.

5.1.1.  Encoding Symbols

   The main encoding function is ec_encode() (entenc.c), which encodes
   symbol k in the current context using the same three-tuple
   (fl[k], fh[k], ft) as the decoder to describe the range of the symbol
   (see Section 4.1).

   ec_encode() updates the state of the encoder as follows.  If fl[k] is
   greater than zero, then

                                       rng
                     val = val + rng - --- * (ft - fl)
                                       ft

                           rng
                     rng = --- * (fh - fl)
                           ft

   Otherwise, val is unchanged and

                                    rng
                        rng = rng - --- * (fh - fl)
                                    ft

   The divisions here are integer division.

5.1.1.1.  Renormalization

   After this update, the range is normalized using a procedure very
   similar to that of Section 4.1.2.1, implemented by ec_enc_normalize()
   (entenc.c).  The following process is repeated until rng > 2**23.
   First, the top 9 bits of val, (val>>23), are sent to the carry
   buffer, described in Section 5.1.1.2.  Then, the encoder sets

                        val = (val<<8) & 0x7FFFFFFF

                        rng = rng<<8

5.1.1.2.  Carry Propagation and Output Buffering

   The function ec_enc_carry_out() (entenc.c) implements carry
   propagation and output buffering.  It takes, as input, a 9-bit
   unsigned value, c, consisting of 8 data bits and an additional carry

RFC6716 - Page 134

   bit.  If c is equal to the value 255, then ext is simply incremented,
   and no other state updates are performed.  Otherwise, let b = (c>>8)
   be the carry bit.  Then,

   o  If the buffered byte rem contains a value other than -1, the
      encoder outputs the byte (rem + b).  Otherwise, if rem is -1, no
      byte is output.

   o  If ext is non-zero, then the encoder outputs ext bytes -- all with
      a value of 0 if b is set, or 255 if b is unset -- and sets ext to
      0.

   o  rem is set to the 8 data bits:

                               rem = c & 255

5.1.2.  Alternate Encoding Methods

   The reference implementation uses three additional encoding methods
   that are exactly equivalent to the above, but make assumptions and
   simplifications that allow for a more efficient implementation.

5.1.2.1.  ec_encode_bin()

   The first is ec_encode_bin() (entenc.c), defined using the parameter
   ftb instead of ft.  It is mathematically equivalent to calling
   ec_encode() with ft = (1<<ftb), but it avoids using division.

5.1.2.2.  ec_enc_bit_logp()

   The next is ec_enc_bit_logp() (entenc.c), which encodes a single
   binary symbol.  The context is described by a single parameter, logp,
   which is the absolute value of the base-2 logarithm of the
   probability of a "1".  It is mathematically equivalent to calling
   ec_encode() with the 3-tuple (fl[k] = 0, fh[k] = (1<<logp) - 1,
   ft = (1<<logp)) if k is 0 and with (fl[k] = (1<<logp) - 1,
   fh[k] = ft = (1<<logp)) if k is 1.  The implementation requires no
   multiplications or divisions.

5.1.2.3.  ec_enc_icdf()

   The last is ec_enc_icdf() (entenc.c), which encodes a single binary
   symbol with a table-based context of up to 8 bits.  This uses the
   same icdf table as ec_dec_icdf() from Section 4.1.3.3.  The function

RFC6716 - Page 135

   is mathematically equivalent to calling ec_encode() with
   fl[k] = (1<<ftb) - icdf[k-1] (or 0 if k == 0), fh[k] = (1<<ftb) -
    icdf[k], and ft = (1<<ftb).  This only saves a few arithmetic
   operations over ec_encode_bin(), but it allows the encoder to use the
   same icdf tables as the decoder.

5.1.3.  Encoding Raw Bits

   The raw bits used by the CELT layer are packed at the end of the
   buffer using ec_enc_bits() (entenc.c).  Because the raw bits may
   continue into the last byte output by the range coder if there is
   room in the low-order bits, the encoder must be prepared to merge
   these values into a single byte.  The procedure in Section 5.1.5 does
   this in a way that ensures both the range coded data and the raw bits
   can be decoded successfully.

5.1.4.  Encoding Uniformly Distributed Integers

   The function ec_enc_uint() (entenc.c) encodes one of ft equiprobable
   symbols in the range 0 to (ft - 1), inclusive, each with a frequency
   of 1, where ft may be as large as (2**32 - 1).  Like the decoder (see
   Section 4.1.5), it splits up the value into a range coded symbol
   representing up to 8 of the high bits, and, if necessary, raw bits
   representing the remainder of the value.

   ec_enc_uint() takes a two-tuple (t, ft), where t is the unsigned
   integer to be encoded, 0 <= t < ft, and ft is not necessarily a power
   of two.  Let ftb = ilog(ft - 1), i.e., the number of bits required to
   store (ft - 1) in two's complement notation.  If ftb is 8 or less,
   then t is encoded directly using ec_encode() with the three-tuple (t,
   t + 1, ft).

   If ftb is greater than 8, then the top 8 bits of t are encoded using
   the three-tuple (t>>(ftb - 8), (t>>(ftb - 8)) + 1,
   ((ft - 1)>>(ftb - 8)) + 1), and the remaining bits,
   (t & ((1<<(ftb - 8)) - 1), are encoded as raw bits with
   ec_enc_bits().

5.1.5.  Finalizing the Stream

   After all symbols are encoded, the stream must be finalized by
   outputting a value inside the current range.  Let end be the unsigned
   integer in the interval [val, val + rng) with the largest number of
   trailing zero bits, b, such that (end + (1<<b) - 1) is also in the
   interval [val, val + rng).  This choice of end allows the maximum
   number of trailing bits to be set to arbitrary values while still
   ensuring the range coded part of the buffer can be decoded correctly.

RFC6716 - Page 136

   Then, while end is not zero, the top 9 bits of end, i.e., (end>>23),
   are passed to the carry buffer in accordance with the procedure in
   Section 5.1.1.2, and end is updated via

                        end = (end<<8) & 0x7FFFFFFF

   Finally, if the buffered output byte, rem, is neither zero nor the
   special value -1, or the carry count, ext, is greater than zero, then
   9 zero bits are sent to the carry buffer to flush it to the output
   buffer.  When outputting the final byte from the range coder, if it
   would overlap any raw bits already packed into the end of the output
   buffer, they should be ORed into the same byte.  The bit allocation
   routines in the CELT layer should ensure that this can be done
   without corrupting the range coder data so long as end is chosen as
   described above.  If there is any space between the end of the range
   coder data and the end of the raw bits, it is padded with zero bits.
   This entire process is implemented by ec_enc_done() (entenc.c).

5.1.6.  Current Bit Usage

   The bit allocation routines in Opus need to be able to determine a
   conservative upper bound on the number of bits that have been used to
   encode the current frame thus far.  This drives allocation decisions
   and ensures that the range coder and raw bits will not overflow the
   output buffer.  This is computed in the reference implementation to
   whole-bit precision by the function ec_tell() (entcode.h) and to
   fractional 1/8th bit precision by the function ec_tell_frac()
   (entcode.c).  Like all operations in the range coder, it must be
   implemented in a bit-exact manner, and it must produce exactly the
   same value returned by the same functions in the decoder after
   decoding the same symbols.

5.2.  SILK Encoder

   In many respects, the SILK encoder mirrors the SILK decoder described
   in Section 4.2.  Details such as the quantization and range coder
   tables can be found there, while this section describes the high-
   level design choices that were made.  The diagram below shows the
   basic modules of the SILK encoder.

               +----------+    +--------+    +---------+
               |  Sample  |    | Stereo |    |  SILK   |
        ------>|   Rate   |--->| Mixing |--->|  Core   |---------->
        Input  |Conversion|    |        |    | Encoder |  Bitstream
               +----------+    +--------+    +---------+

                          Figure 21: SILK Encoder

RFC6716 - Page 137

5.2.1.  Sample Rate Conversion

   The input signal's sampling rate is adjusted by a sample rate
   conversion module so that it matches the SILK internal sampling rate.
   The input to the sample rate converter is delayed by a number of
   samples depending on the sample rate ratio, such that the overall
   delay is constant for all input and output sample rates.

5.2.2.  Stereo Mixing

   The stereo mixer is only used for stereo input signals.  It converts
   a stereo left-right signal into an adaptive mid-side representation.
   The first step is to compute non-adaptive mid-side signals as half
   the sum and difference between left and right signals.  The side
   signal is then minimized in energy by subtracting a prediction of it
   based on the mid signal.  This prediction works well when the left
   and right signals exhibit linear dependency, for instance, for an
   amplitude-panned input signal.  Like in the decoder, the prediction
   coefficients are linearly interpolated during the first 8 ms of the
   frame.  The mid signal is always encoded, whereas the residual side
   signal is only encoded if it has sufficient energy compared to the
   mid signal's energy.  If it has not, the "mid_only_flag" is set
   without encoding the side signal.

   The predictor coefficients are coded regardless of whether the side
   signal is encoded.  For each frame, two predictor coefficients are
   computed, one that predicts between low-passed mid and side channels,
   and one that predicts between high-passed mid and side channels.  The
   low-pass filter is a simple three-tap filter and creates a delay of
   one sample.  The high-pass filtered signal is the difference between
   the mid signal delayed by one sample and the low-passed signal.
   Instead of explicitly computing the high-passed signal, it is
   computationally more efficient to transform the prediction
   coefficients before applying them to the filtered mid signal, as
   follows:

               pred(n) = LP(n) * w0 + HP(n) * w1
                       = LP(n) * w0 + (mid(n-1) - LP(n)) * w1
                       = LP(n) * (w0 - w1) + mid(n-1) * w1


   where w0 and w1 are the low-pass and high-pass prediction
   coefficients, mid(n-1) is the mid signal delayed by one sample, LP(n)
   and HP(n) are the low-passed and high-passed signals and pred(n) is
   the prediction signal that is subtracted from the side signal.

RFC6716 - Page 138

5.2.3.  SILK Core Encoder

   What follows is a description of the core encoder and its components.
   For simplicity, the core encoder is referred to simply as the encoder
   in the remainder of this section.  An overview of the encoder is
   given in Figure 22.

                                                                +---+
                             +--------------------------------->|   |
        +---------+          |      +---------+                 |   |
        |Voice    |          |      |LTP      |12               |   |
    +-->|Activity |--+       +----->|Scaling  |-----------+---->|   |
    |   |Detection|3 |       |      |Control  |<--+       |     |   |
    |   +---------+  |       |      +---------+   |       |     |   |
    |                |       |      +---------+   |       |     |   |
    |                |       |      |Gains    |   |       |     |   |
    |                |       |  +-->|Processor|---|---+---|---->| R |
    |                |       |  |   |         |11 |   |   |     | a |
    |               \/       |  |   +---------+   |   |   |     | n |
    |          +---------+   |  |   +---------+   |   |   |     | g |
    |          |Pitch    |   |  |   |LSF      |   |   |   |     | e |
    |       +->|Analysis |---+  |   |Quantizer|---|---|---|---->|   |
    |       |  |         |4  |  |   |         |8  |   |   |     | E |-->
    |       |  +---------+   |  |   +---------+   |   |   |     | n | 2
    |       |                |  |    9/\  10|     |   |   |     | c |
    |       |                |  |     |    \/     |   |   |     | o |
    |       |  +---------+   |  |   +----------+  |   |   |     | d |
    |       |  |Noise    |   +--|-->|Prediction|--+---|---|---->| e |
    |       +->|Shaping  |---|--+   |Analysis  |7 |   |   |     | r |
    |       |  |Analysis |5  |  |   |          |  |   |   |     |   |
    |       |  +---------+   |  |   +----------+  |   |   |     |   |
    |       |                |  |        /\       |   |   |     |   |
    |       |     +----------|--|--------+        |   |   |     |   |
    |       |     |         \/  \/               \/  \/  \/     |   |
    |       |     |      +----------+          +------------+   |   |
    |       |     |      |          |          |Noise       |   |   |
   -+-------+-----+----->|Pre-filter|--------->|Shaping     |-->|   |
   1                     |          | 6        |Quantization|13 |   |
                         +----------+          +------------+   +---+

   1:  Input speech signal
   2:  Range encoded bitstream
   3:  Voice activity estimate
   4:  Pitch lags (per 5 ms) and voicing decision (per 20 ms)
   5:  Noise shaping quantization coefficients
     - Short-term synthesis and analysis
       noise shaping coefficients (per 5 ms)

RFC6716 - Page 139

     - Long-term synthesis and analysis noise
       shaping coefficients (per 5 ms and for voiced speech only)
     - Noise shaping tilt (per 5 ms)
     - Quantizer gain/step size (per 5 ms)
   6:  Input signal filtered with analysis noise shaping filters
   7:  Short- and Long-Term Prediction coefficients
       LTP (per 5 ms) and LPC (per 20 ms)
   8:  LSF quantization indices
   9:  LSF coefficients
   10: Quantized LSF coefficients
   11: Processed gains, and synthesis noise shape coefficients
   12: LTP state scaling coefficient.  Controlling error
       propagation / prediction gain trade-off
   13: Quantized signal

                       Figure 22: SILK Core Encoder

5.2.3.1.  Voice Activity Detection

   The input signal is processed by a Voice Activity Detection (VAD)
   algorithm to produce a measure of voice activity, spectral tilt, and
   signal-to-noise estimates for each frame.  The VAD uses a sequence of
   half-band filterbanks to split the signal into four subbands:
   0...Fs/16, Fs/16...Fs/8, Fs/8...Fs/4, and Fs/4...Fs/2, where Fs is
   the sampling frequency (8, 12, 16, or 24 kHz).  The lowest subband,
   from 0 - Fs/16, is high-pass filtered with a first-order moving
   average (MA) filter (with transfer function H(z) = 1-z**(-1)) to
   reduce the energy at the lowest frequencies.  For each frame, the
   signal energy per subband is computed.  In each subband, a noise
   level estimator tracks the background noise level and a Signal-to-
   Noise Ratio (SNR) value is computed as the logarithm of the ratio of
   energy-to-noise level.  Using these intermediate variables, the
   following parameters are calculated for use in other SILK modules:

   o  Average SNR.  The average of the subband SNR values.

   o  Smoothed subband SNRs.  Temporally smoothed subband SNR values.

   o  Speech activity level.  Based on the average SNR and a weighted
      average of the subband energies.

   o  Spectral tilt.  A weighted average of the subband SNRs, with
      positive weights for the low subbands and negative weights for the
      high subbands.

RFC6716 - Page 140

5.2.3.2.  Pitch Analysis

   The input signal is processed by the open loop pitch estimator shown
   in Figure 23.

                                    +--------+  +----------+
                                    |2 x Down|  |Time-     |
                                 +->|sampling|->|Correlator|     |
                                 |  |        |  |          |     |4
                                 |  +--------+  +----------+    \/
                                 |                    | 2    +-------+
                                 |                    |  +-->|Speech |5
       +---------+    +--------+ |                   \/  |   |Type   |->
       |LPC      |    |Down    | |              +----------+ |       |
    +->|Analysis | +->|sample  |-+------------->|Time-     | +-------+
    |  |         | |  |to 8 kHz|                |Correlator|----------->
    |  +---------+ |  +--------+                |__________|          6
    |       |      |                                  |3
    |      \/      |                                 \/
    |  +---------+ |                            +----------+
    |  |Whitening| |                            |Time-     |
   -+->|Filter   |-+--------------------------->|Correlator|----------->
   1   |         |                              |          |          7
       +---------+                              +----------+

   1: Input signal
   2: Lag candidates from stage 1
   3: Lag candidates from stage 2
   4: Correlation threshold
   5: Voiced/unvoiced flag
   6: Pitch correlation
   7: Pitch lags

              Figure 23: Block Diagram of the Pitch Estimator

   The pitch analysis finds a binary voiced/unvoiced classification,
   and, for frames classified as voiced, four pitch lags per frame --
   one for each 5 ms subframe -- and a pitch correlation indicating the
   periodicity of the signal.  The input is first whitened using a
   Linear Prediction (LP) whitening filter, where the coefficients are
   computed through standard Linear Predictive Coding (LPC) analysis.
   The order of the whitening filter is 16 for best results, but is
   reduced to 12 for medium complexity and 8 for low complexity modes.
   The whitened signal is analyzed to find pitch lags for which the time
   correlation is high.  The analysis consists of three stages for
   reducing the complexity:

RFC6716 - Page 141

   o  In the first stage, the whitened signal is downsampled to 4 kHz
      (from 8 kHz), and the current frame is correlated to a signal
      delayed by a range of lags, starting from a shortest lag
      corresponding to 500 Hz, to a longest lag corresponding to 56 Hz.

   o  The second stage operates on an 8 kHz signal (downsampled from 12,
      16, or 24 kHz) and measures time correlations only near the lags
      corresponding to those that had sufficiently high correlations in
      the first stage.  The resulting correlations are adjusted for a
      small bias towards short lags to avoid ending up with a multiple
      of the true pitch lag.  The highest adjusted correlation is
      compared to a threshold depending on:

      *  Whether the previous frame was classified as voiced.

      *  The speech activity level.

      *  The spectral tilt.

      If the threshold is exceeded, the current frame is classified as
      voiced and the lag with the highest adjusted correlation is stored
      for a final pitch analysis of the highest precision in the third
      stage.

   o  The last stage operates directly on the whitened input signal to
      compute time correlations for each of the four subframes
      independently in a narrow range around the lag with highest
      correlation from the second stage.

5.2.3.3.  Noise Shaping Analysis

   The noise shaping analysis finds gains and filter coefficients used
   in the pre-filter and noise shaping quantizer.  These parameters are
   chosen such that they will fulfill several requirements:

   o  Balancing quantization noise and bitrate.  The quantization gains
      determine the step size between reconstruction levels of the
      excitation signal.  Therefore, increasing the quantization gain
      amplifies quantization noise, but also reduces the bitrate by
      lowering the entropy of the quantization indices.

   o  Spectral shaping of the quantization noise; the noise shaping
      quantizer is capable of reducing quantization noise in some parts
      of the spectrum at the cost of increased noise in other parts
      without substantially changing the bitrate.  By shaping the noise
      such that it follows the signal spectrum, it becomes less audible.
      In practice, best results are obtained by making the shape of the
      noise spectrum slightly flatter than the signal spectrum.

RFC6716 - Page 142

   o  De-emphasizing spectral valleys; by using different coefficients
      in the analysis and synthesis part of the pre-filter and noise
      shaping quantizer, the levels of the spectral valleys can be
      decreased relative to the levels of the spectral peaks such as
      speech formants and harmonics.  This reduces the entropy of the
      signal, which is the difference between the coded signal and the
      quantization noise, thus lowering the bitrate.

   o  Matching the levels of the decoded speech formants to the levels
      of the original speech formants; an adjustment gain and a first
      order tilt coefficient are computed to compensate for the effect
      of the noise shaping quantization on the level and spectral tilt.

                 / \   ___
                  |   // \\
                  |  //   \\     ____
                  |_//     \\___//  \\         ____
                  | /  ___  \   /    \\       //  \\
                P |/  /   \  \_/      \\_____//    \\
                o |  /     \     ____  \     /      \\
                w | /       \___/    \  \___/  ____  \\___ 1
                e |/                  \       /    \  \
                r |                    \_____/      \  \__ 2
                  |                                  \
                  |                                   \___ 3
                  |
                  +---------------------------------------->
                                   Frequency

               1: Input signal spectrum
               2: De-emphasized and level matched spectrum
               3: Quantization noise spectrum

      Figure 24: Noise Shaping and Spectral De-emphasis Illustration

   Figure 24 shows an example of an input signal spectrum (1).  After
   de-emphasis and level matching, the spectrum has deeper valleys (2).
   The quantization noise spectrum (3) more or less follows the input
   signal spectrum, while having slightly less pronounced peaks.  The
   entropy, which provides a lower bound on the bitrate for encoding the
   excitation signal, is proportional to the area between the de-
   emphasized spectrum (2) and the quantization noise spectrum (3).
   Without de-emphasis, the entropy is proportional to the area between
   input spectrum (1) and quantization noise (3) -- clearly higher.

   The transformation from input signal to de-emphasized signal can be
   described as a filtering operation with a filter

RFC6716 - Page 143

                                             -1    Wana(z)
                  H(z) = G * ( 1 - c_tilt * z  ) * -------
                                                   Wsyn(z)

   having an adjustment gain G, a first order tilt adjustment filter
   with tilt coefficient c_tilt, and where

                       16                           d
                       __            -k        -L  __            -k
        Wana(z) = (1 - \ a_ana(k) * z  )*(1 - z  * \ b_ana(k) * z  )
                       /_                          /_
                       k=1                         k=-d

   is the analysis part of the de-emphasis filter, consisting of the
   short-term shaping filter with coefficients a_ana(k) and the long-
   term shaping filter with coefficients b_ana(k) and pitch lag L.  The
   parameter d determines the number of long-term shaping filter taps.

   Similarly, but without the tilt adjustment, the synthesis part can be
   written as

                       16                           d
                       __            -k        -L  __            -k
        Wsyn(z) = (1 - \ a_syn(k) * z  )*(1 - z  * \ b_syn(k) * z  )
                       /_                          /_
                       k=1                         k=-d

   All noise shaping parameters are computed and applied per subframe of
   5 ms.  First, an LPC analysis is performed on a windowed signal block
   of 15 ms.  The signal block has a look-ahead of 5 ms relative to the
   current subframe, and the window is an asymmetric sine window.  The
   LPC analysis is done with the autocorrelation method, with an order
   of between 8, in lowest-complexity mode, and 16, for best quality.

   Optionally, the LPC analysis and noise shaping filters are warped by
   replacing the delay elements by first-order allpass filters.  This
   increases the frequency resolution at low frequencies and reduces it
   at high ones, which better matches the human auditory system and
   improves quality.  The warped analysis and filtering comes at a cost
   in complexity and is therefore only done in higher complexity modes.

   The quantization gain is found by taking the square root of the
   residual energy from the LPC analysis and multiplying it by a value
   inversely proportional to the coding quality control parameter and
   the pitch correlation.

RFC6716 - Page 144

   Next, the two sets of short-term noise shaping coefficients a_ana(k)
   and a_syn(k) are obtained by applying different amounts of bandwidth
   expansion to the coefficients found in the LPC analysis.  This
   bandwidth expansion moves the roots of the LPC polynomial towards the
   origin, using the formulas

                                              k
                         a_ana(k) = a(k)*g_ana   and

                                              k
                         a_syn(k) = a(k)*g_syn

   where a(k) is the k'th LPC coefficient, and the bandwidth expansion
   factors g_ana and g_syn are calculated as

                         g_ana = 0.95 - 0.01*C  and

                         g_syn = 0.95 + 0.01*C

   where C is the coding quality control parameter between 0 and 1.
   Applying more bandwidth expansion to the analysis part than to the
   synthesis part gives the desired de-emphasis of spectral valleys in
   between formants.

   The long-term shaping is applied only during voiced frames.  It uses
   a three-tap filter, described by

                   b_ana = F_ana * [0.25, 0.5, 0.25]  and

                   b_syn = F_syn * [0.25, 0.5, 0.25].

   For unvoiced frames, these coefficients are set to 0.  The
   multiplication factors F_ana and F_syn are chosen between 0 and 1,
   depending on the coding quality control parameter, as well as the
   calculated pitch correlation and smoothed subband SNR of the lowest
   subband.  By having F_ana less than F_syn, the pitch harmonics are
   emphasized relative to the valleys in between the harmonics.

   The tilt coefficient c_tilt is for unvoiced frames chosen as

                               c_tilt = 0.25

   and as

                         c_tilt = 0.25 + 0.2625 * V

   for voiced frames, where V is the voice activity level between 0 and
   1.

RFC6716 - Page 145

   The adjustment gain G serves to correct any level mismatch between
   the original and decoded signals that might arise from the noise
   shaping and de-emphasis.  This gain is computed as the ratio of the
   prediction gain of the short-term analysis and synthesis filter
   coefficients.  The prediction gain of an LPC synthesis filter is the
   square root of the output energy when the filter is excited by a
   unit-energy impulse on the input.  An efficient way to compute the
   prediction gain is by first computing the reflection coefficients
   from the LPC coefficients through the step-down algorithm, and
   extracting the prediction gain from the reflection coefficients as

                                    K
                                   ___          2  -0.5
                      predGain = ( | | 1 - (r_k)  )
                                   k=1

   where r_k is the k'th reflection coefficient.

   Initial values for the quantization gains are computed as the square
   root of the residual energy of the LPC analysis, adjusted by the
   coding quality control parameter.  These quantization gains are later
   adjusted based on the results of the prediction analysis.

5.2.3.4.  Prediction Analysis

   The prediction analysis is performed in one of two ways depending on
   how the pitch estimator classified the frame.  The processing for
   voiced and unvoiced speech is described in Section 5.2.3.4.1 and
   Section 5.2.3.4.2, respectively.  Inputs to this function include the
   pre-whitened signal from the pitch estimator (see Section 5.2.3.2).

5.2.3.4.1.  Voiced Speech

   For a frame of voiced speech, the pitch pulses will remain dominant
   in the pre-whitened input signal.  Further whitening is desirable as
   it leads to higher quality at the same available bitrate.  To achieve
   this, a Long-Term Prediction (LTP) analysis is carried out to
   estimate the coefficients of a fifth-order LTP filter for each of
   four subframes.  The LTP coefficients are quantized using the method
   described in Section 5.2.3.6, and the quantized LTP coefficients are
   used to compute the LTP residual signal.  This LTP residual signal is
   the input to an LPC analysis where the LPC coefficients are estimated
   using Burg's method [BURG], such that the residual energy is
   minimized.  The estimated LPC coefficients are converted to a Line
   Spectral Frequency (LSF) vector and quantized as described in
   Section 5.2.3.5.  After quantization, the quantized LSF vector is
   converted back to LPC coefficients using the full procedure in
   Section 4.2.7.5.  By using quantized LTP coefficients and LPC

RFC6716 - Page 146

   coefficients derived from the quantized LSF coefficients, the encoder
   remains fully synchronized with the decoder.  The quantized LPC and
   LTP coefficients are also used to filter the input signal and measure
   residual energy for each of the four subframes.

5.2.3.4.2.  Unvoiced Speech

   For a speech signal that has been classified as unvoiced, there is no
   need for LTP filtering, as it has already been determined that the
   pre-whitened input signal is not periodic enough within the allowed
   pitch period range for LTP analysis to be worth the cost in terms of
   complexity and bitrate.  The pre-whitened input signal is therefore
   discarded, and, instead, the input signal is used for LPC analysis
   using Burg's method.  The resulting LPC coefficients are converted to
   an LSF vector and quantized as described in the following section.
   They are then transformed back to obtain quantized LPC coefficients,
   which are then used to filter the input signal and measure residual
   energy for each of the four subframes.

5.2.3.4.2.1.  Burg's Method

   The main purpose of linear prediction in SILK is to reduce the
   bitrate by minimizing the residual energy.  At least at high
   bitrates, perceptual aspects are handled independently by the noise
   shaping filter.  Burg's method is used because it provides higher
   prediction gain than the autocorrelation method and, unlike the
   covariance method, produces stable filters (assuming numerical errors
   don't spoil that).  SILK's implementation of Burg's method is also
   computationally faster than the autocovariance method.  The
   implementation of Burg's method differs from traditional
   implementations in two aspects.  The first difference is that it
   operates on autocorrelations, similar to the Schur algorithm [SCHUR],
   but with a simple update to the autocorrelations after finding each
   reflection coefficient to make the result identical to Burg's method.
   This brings down the complexity of Burg's method to near that of the
   autocorrelation method.  The second difference is that the signal in
   each subframe is scaled by the inverse of the residual quantization
   step size.  Subframes with a small quantization step size will, on
   average, spend more bits for a given amount of residual energy than
   subframes with a large step size.  Without scaling, Burg's method
   minimizes the total residual energy in all subframes, which doesn't
   necessarily minimize the total number of bits needed for coding the
   quantized residual.  The residual energy of the scaled subframes is a
   better measure for that number of bits.

RFC6716 - Page 147

5.2.3.5.  LSF Quantization

   Unlike many other speech codecs, SILK uses variable bitrate coding
   for the LSFs.  This improves the average rate-distortion (R-D) trade-
   off and reduces outliers.  The variable bitrate coding minimizes a
   linear combination of the weighted quantization errors and the
   bitrate.  The weights for the quantization errors are the Inverse
   Harmonic Mean Weighting (IHMW) function proposed by Laroia et al.
   (see [LAROIA-ICASSP]).  These weights are referred to here as Laroia
   weights.

   The LSF quantizer consists of two stages.  The first stage is an
   (unweighted) vector quantizer (VQ), with a codebook size of 32
   vectors.  The quantization errors for the codebook vector are sorted,
   and for the N best vectors a second stage quantizer is run.  By
   varying the number N, a trade-off is made between R-D performance and
   computational efficiency.  For each of the N codebook vectors, the
   Laroia weights corresponding to that vector (and not to the input
   vector) are calculated.  Then, the residual between the input LSF
   vector and the codebook vector is scaled by the square roots of these
   Laroia weights.  This scaling partially normalizes error sensitivity
   for the residual vector so that a uniform quantizer with fixed step
   sizes can be used in the second stage without too much performance
   loss.  Additionally, by scaling with Laroia weights determined from
   the first-stage codebook vector, the process can be reversed in the
   decoder.

   The second stage uses predictive delayed decision scalar
   quantization.  The quantization error is weighted by Laroia weights
   determined from the LSF input vector.  The predictor multiplies the
   previous quantized residual value by a prediction coefficient that
   depends on the vector index from the first stage VQ and on the
   location in the LSF vector.  The prediction is subtracted from the
   LSF residual value before quantizing the result and is added back
   afterwards.  This subtraction can be interpreted as shifting the
   quantization levels of the scalar quantizer, and as a result the
   quantization error of each value depends on the quantization decision
   of the previous value.  This dependency is exploited by the delayed
   decision mechanism to search for a quantization sequency with best
   R-D performance with a Viterbi-like algorithm [VITERBI].  The
   quantizer processes the residual LSF vector in reverse order (i.e.,
   it starts with the highest residual LSF value).  This is done because
   the prediction works slightly better in the reverse direction.

RFC6716 - Page 148

   The quantization index of the first stage is entropy coded.  The
   quantization sequence from the second stage is also entropy coded,
   where for each element the probability table is chosen depending on
   the vector index from the first stage and the location of that
   element in the LSF vector.

5.2.3.5.1.  LSF Stabilization

   If the input is stable, finding the best candidate usually results in
   a quantized vector that is also stable.  Because of the two-stage
   approach, however, it is possible that the best quantization
   candidate is unstable.  The encoder applies the same stabilization
   procedure applied by the decoder (see Section 4.2.7.5.4) to ensure
   the LSF parameters are within their valid range, increasingly sorted,
   and have minimum distances between each other and the border values.

5.2.3.6.  LTP Quantization

   For voiced frames, the prediction analysis described in
   Section 5.2.3.4.1 resulted in four sets (one set per subframe) of
   five LTP coefficients, plus four weighting matrices.  The LTP
   coefficients for each subframe are quantized using entropy
   constrained vector quantization.  A total of three vector codebooks
   are available for quantization, with different rate-distortion trade-
   offs.  The three codebooks have 10, 20, and 40 vectors and average
   rates of about 3, 4, and 5 bits per vector, respectively.
   Consequently, the first codebook has larger average quantization
   distortion at a lower rate, whereas the last codebook has smaller
   average quantization distortion at a higher rate.  Given the
   weighting matrix W_ltp and LTP vector b, the weighted rate-distortion
   measure for a codebook vector cb_i with rate r_i is give by

               RD = u * (b - cb_i)' * W_ltp * (b - cb_i) + r_i

   where u is a fixed, heuristically determined parameter balancing the
   distortion and rate.  Which codebook gives the best performance for a
   given LTP vector depends on the weighting matrix for that LTP vector.
   For example, for a low valued W_ltp, it is advantageous to use the
   codebook with 10 vectors as it has a lower average rate.  For a large
   W_ltp, on the other hand, it is often better to use the codebook with
   40 vectors, as it is more likely to contain the best codebook vector.
   The weighting matrix W_ltp depends mostly on two aspects of the input
   signal.  The first is the periodicity of the signal; the more
   periodic, the larger W_ltp.  The second is the change in signal
   energy in the current subframe, relative to the signal one pitch lag
   earlier.  A decaying energy leads to a larger W_ltp than an
   increasing energy.  Both aspects fluctuate relatively slowly, which
   causes the W_ltp matrices for different subframes of one frame often

RFC6716 - Page 149

   to be similar.  Because of this, one of the three codebooks typically
   gives good performance for all subframes.  Therefore, the codebook
   search for the subframe LTP vectors is constrained to only allow
   codebook vectors to be chosen from the same codebook, resulting in a
   rate reduction.

   To find the best codebook, each of the three vector codebooks is used
   to quantize all subframe LTP vectors and produce a combined weighted
   rate-distortion measure for each vector codebook.  The vector
   codebook with the lowest combined rate-distortion over all subframes
   is chosen.  The quantized LTP vectors are used in the noise shaping
   quantizer, and the index of the codebook plus the four indices for
   the four subframe codebook vectors are passed on to the range
   encoder.

5.2.3.7.  Pre-filter

   In the pre-filter, the input signal is filtered using the spectral
   valley de-emphasis filter coefficients from the noise shaping
   analysis (see Section 5.2.3.3).  By applying only the noise shaping
   analysis filter to the input signal, it provides the input to the
   noise shaping quantizer.

5.2.3.8.  Noise Shaping Quantizer

   The noise shaping quantizer independently shapes the signal and
   coding noise spectra to obtain a perceptually higher quality at the
   same bitrate.

   The pre-filter output signal is multiplied with a compensation gain G
   computed in the noise shaping analysis.  Then, the output of a
   synthesis shaping filter is added, and the output of a prediction
   filter is subtracted to create a residual signal.  The residual
   signal is multiplied by the inverse quantized quantization gain from
   the noise shaping analysis and input to a scalar quantizer.  The
   quantization indices of the scalar quantizer represent a signal of
   pulses that is input to the pyramid range encoder.  The scalar
   quantizer also outputs a quantization signal, which is multiplied by
   the quantized quantization gain from the noise shaping analysis to
   create an excitation signal.  The output of the prediction filter is
   added to the excitation signal to form the quantized output signal
   y(n).  The quantized output signal y(n) is input to the synthesis
   shaping and prediction filters.

RFC6716 - Page 150

   Optionally, the noise shaping quantizer operates in a delayed
   decision mode.  In this mode, it uses a Viterbi algorithm to keep
   track of multiple rounding choices in the quantizer and select the
   best one after a delay of 32 samples.  This improves the rate/
   distortion performance of the quantizer.

5.2.3.9.  Constant Bitrate Mode

   SILK was designed to run in variable bitrate (VBR) mode.  However,
   the reference implementation also has a constant bitrate (CBR) mode
   for SILK.  In CBR mode, SILK will attempt to encode each packet with
   no more than the allowed number of bits.  The Opus wrapper code then
   pads the bitstream if any unused bits are left in SILK mode, or it
   encodes the high band with the remaining number of bits in Hybrid
   mode.  The number of payload bits is adjusted by changing the
   quantization gains and the rate/distortion trade-off in the noise
   shaping quantizer, in an iterative loop around the noise shaping
   quantizer and entropy coding.  Compared to the SILK VBR mode, the CBR
   mode has lower audio quality at a given average bitrate and has
   higher computational complexity.

5.3.  CELT Encoder

   Most of the aspects of the CELT encoder can be directly derived from
   the description of the decoder.  For example, the filters and
   rotations in the encoder are simply the inverse of the operation
   performed by the decoder.  Similarly, the quantizers generally
   optimize for the mean square error (because noise shaping is part of
   the bitstream itself), so no special search is required.  For this
   reason, only the less straightforward aspects of the encoder are
   described here.

5.3.1.  Pitch Pre-filter

   The pitch pre-filter is applied after the pre-emphasis.  It is
   applied in such a way as to be the inverse of the decoder's post-
   filter.  The main non-obvious aspect of the pre-filter is the
   selection of the pitch period.  The pitch search should be optimized
   for the following criteria:

   o  continuity: it is important that the pitch period does not change
      abruptly between frames; and

   o  avoidance of pitch multiples: when the period used is a multiple
      of the real period (lower frequency fundamental), the post-filter
      loses most of its ability to reduce noise

RFC6716 - Page 151

5.3.2.  Bands and Normalization

   The MDCT output is divided into bands that are designed to match the
   ear's critical bands for the smallest (2.5 ms) frame size.  The
   larger frame sizes use integer multiples of the 2.5 ms layout.  For
   each band, the encoder computes the energy that will later be
   encoded.  Each band is then normalized by the square root of the
   *unquantized* energy, such that each band now forms a unit vector X.
   The energy and the normalization are computed by
   compute_band_energies() and normalise_bands() (bands.c),
   respectively.

5.3.3.  Energy Envelope Quantization

   Energy quantization (both coarse and fine) can be easily understood
   from the decoding process.  For all useful bitrates, the coarse
   quantizer always chooses the quantized log energy value that
   minimizes the error for each band.  Only at very low rate does the
   encoder allow larger errors to minimize the rate and avoid using more
   bits than are available.  When the available CPU requirements allow
   it, it is best to try encoding the coarse energy both with and
   without inter-frame prediction such that the best prediction mode can
   be selected.  The optimal mode depends on the coding rate, the
   available bitrate, and the current rate of packet loss.

   The fine energy quantizer always chooses the quantized log energy
   value that minimizes the error for each band because the rate of the
   fine quantization depends only on the bit allocation and not on the
   values that are coded.

5.3.4.  Bit Allocation

   The encoder must use exactly the same bit allocation process as used
   by the decoder and described in Section 4.3.3.  The three mechanisms
   that can be used by the encoder to adjust the bitrate on a frame-by-
   frame basis are band boost, allocation trim, and band skipping.

5.3.4.1.  Band Boost

   The reference encoder makes a decision to boost a band when the
   energy of that band is significantly higher than that of the
   neighboring bands.  Let E_j be the log-energy of band j, we define

      D_j = 2*E_j - E_j-1 - E_j+1

   The allocation of band j is boosted once if D_j > t1 and twice if D_j
   > t2.  For LM>=1, t1=2 and t2=4, while for LM<1, t1=3 and t2=5.

RFC6716 - Page 152

5.3.4.2.  Allocation Trim

   The allocation trim is a value between 0 and 10 (inclusively) that
   controls the allocation balance between the low and high frequencies.
   The encoder starts with a safe "default" of 5 and deviates from that
   default in two different ways.  First, the trim can deviate by +/- 2
   depending on the spectral tilt of the input signal.  For signals with
   more low frequencies, the trim is increased by up to 2, while for
   signals with more high frequencies, the trim is decreased by up to 2.
   For stereo inputs, the trim value can be decreased by up to 4 when
   the inter-channel correlation at low frequency (first 8 bands) is
   high.

5.3.4.3.  Band Skipping

   The encoder uses band skipping to ensure that the shape of the bands
   is only coded if there is at least 1/2 bit per sample available for
   the PVQ.  If not, then no bit is allocated and folding is used
   instead.  To ensure continuity in the allocation, some amount of
   hysteresis is added to the process, such that a band that received
   PVQ bits in the previous frame only needs 7/16 bit/sample to be coded
   for the current frame, while a band that did not receive PVQ bits in
   the previous frames needs at least 9/16 bit/sample to be coded.

5.3.5.  Stereo Decisions

   Because CELT applies mid-side stereo coupling in the normalized
   domain, it does not suffer from important stereo image problems even
   when the two channels are completely uncorrelated.  For this reason,
   it is always safe to use stereo coupling on any audio frame.  That
   being said, there are some frames for which dual (independent) stereo
   is still more efficient.  This decision is made by comparing the
   estimated entropy with and without coupling over the first 13 bands,
   taking into account the fact that all bands with more than two MDCT
   bins require one extra degree of freedom when coded in mid-side.  Let
   L1_ms and L1_lr be the L1-norm of the mid-side vector and the L1-norm
   of the left-right vector, respectively.  The decision to use mid-side
   is made if and only if

                            L1_ms          L1_lr
                           --------    <   -----
                           bins + E        bins

   where bins is the number of MDCT bins in the first 13 bands and E is
   the number of extra degrees of freedom for mid-side coding.  For
   LM>1, E=13, otherwise E=5.

RFC6716 - Page 153

   The reference encoder decides on the intensity stereo threshold based
   on the bitrate alone.  After taking into account the frame size by
   subtracting 80 bits per frame for coarse energy, the first band using
   intensity coding is as follows:

                     +------------------+------------+
                     | bitrate (kbit/s) | start band |
                     +------------------+------------+
                     |        <35       |      8     |
                     |                  |            |
                     |       35-50      |     12     |
                     |                  |            |
                     |       50-68      |     16     |
                     |                  |            |
                     |       84-84      |     18     |
                     |                  |            |
                     |      84-102      |     19     |
                     |                  |            |
                     |      102-130     |     20     |
                     |                  |            |
                     |       >130       |  disabled  |
                     +------------------+------------+

                 Table 66: Thresholds for Intensity Stereo

5.3.6.  Time-Frequency Decision

   The choice of time-frequency resolution used in Section 4.3.4.5 is
   based on R-D optimization.  The distortion is the L1-norm (sum of
   absolute values) of each band after each TF resolution under
   consideration.  The L1 norm is used because it represents the entropy
   for a Laplacian source.  The number of bits required to code a change
   in TF resolution between two bands is higher than the cost of having
   those two bands use the same resolution, which is what requires the
   R-D optimization.  The optimal decision is computed using the Viterbi
   algorithm.  See tf_analysis() in celt/celt.c.

5.3.7.  Spreading Values Decision

   The choice of the spreading value in Table 59 has an impact on the
   nature of the coding noise introduced by CELT.  The larger the f_r
   value, the lower the impact of the rotation, and the more tonal the
   coding noise.  The more tonal the signal, the more tonal the noise
   should be, so the CELT encoder determines the optimal value for f_r
   by estimating how tonal the signal is.  The tonality estimate is
   based on discrete pdf (4-bin histogram) of each band.  Bands that

RFC6716 - Page 154

   have a large number of small values are considered more tonal and a
   decision is made by combining all bands with more than 8 samples.
   See spreading_decision() in celt/bands.c.

5.3.8.  Spherical Vector Quantization

   CELT uses a Pyramid Vector Quantizer (PVQ) [PVQ] for quantizing the
   details of the spectrum in each band that have not been predicted by
   the pitch predictor.  The PVQ codebook consists of all sums of K
   signed pulses in a vector of N samples, where two pulses at the same
   position are required to have the same sign.  Thus, the codebook
   includes all integer codevectors y of N dimensions that satisfy
   sum(abs(y(j))) = K.

   In bands where there are sufficient bits allocated, PVQ is used to
   encode the unit vector that results from the normalization in
   Section 5.3.2 directly.  Given a PVQ codevector y, the unit vector X
   is obtained as X = y/||y||, where ||.|| denotes the L2 norm.

5.3.8.1.  PVQ Search

   The search for the best codevector y is performed by alg_quant()
   (vq.c).  There are several possible approaches to the search, with a
   trade-off between quality and complexity.  The method used in the
   reference implementation computes an initial codeword y1 by
   projecting the normalized spectrum X onto the codebook pyramid of K-1
   pulses:

   y0 = truncate_towards_zero( (K-1) * X / sum(abs(X)))

   Depending on N, K and the input data, the initial codeword y0 may
   contain from 0 to K-1 non-zero values.  All the remaining pulses,
   with the exception of the last one, are found iteratively with a
   greedy search that minimizes the normalized correlation between y and
   X:

                                   T
                             J = -X * y / ||y||

   The search described above is considered to be a good trade-off
   between quality and computational cost.  However, there are other
   possible ways to search the PVQ codebook and the implementers MAY use
   any other search methods.  See alg_quant() in celt/vq.c.

RFC6716 - Page 155

5.3.8.2.  PVQ Encoding

   The vector to encode, X, is converted into an index i such that
   0 <= i < V(N,K) as follows.  Let i = 0 and k = 0.  Then, for
   j = (N - 1) down to 0, inclusive, do:

   1.  If k > 0, set i = i + (V(N-j-1,k-1) + V(N-j,k-1))/2.

   2.  Set k = k + abs(X[j]).

   3.  If X[j] < 0, set i = i + (V(N-j-1,k) + V(N-j,k))/2.

   The index i is then encoded using the procedure in Section 5.1.4 with
   ft = V(N,K).

(page 155 continued on part 8)