5. Opus Encoder Just like the decoder, the Opus encoder also normally consists of two main blocks: the SILK encoder and the CELT encoder. However, unlike the case of the decoder, a valid (though potentially suboptimal) Opus encoder is not required to support all modes and may thus only include a SILK encoder module or a CELT encoder module. The output bitstream of the Opus encoding contains bits from the SILK and CELT encoders, though these are not separable due to the use of a range coder. A block diagram of the encoder is illustrated below. +------------+ +---------+ | Sample | | SILK |------+ +->| Rate |--->| Encoder | V +-----------+ | | Conversion | | | +---------+ | Optional | | +------------+ +---------+ | Range | ->| High-pass |--+ | Encoder |----> | Filter | | +--------------+ +---------+ | | Bit- +-----------+ | | Delay | | CELT | +---------+ stream +->| Compensation |->| Encoder | ^ | | | |------+ +--------------+ +---------+ Figure 20: Opus Encoder For a normal encoder where both the SILK and the CELT modules are included, an optimal encoder should select which coding mode to use at run-time depending on the conditions. In the reference implementation, the frame size is selected by the application, but the other configuration parameters (number of channels, bandwidth, mode) are automatically selected (unless explicitly overridden by the application) depending on the following: o Requested bitrate o Input sampling rate o Type of signal (speech vs. music) o Frame size in use The type of signal currently needs to be provided by the application (though it can be changed in real-time). An Opus encoder implementation could also do automatic detection, but since Opus is an interactive codec, such an implementation would likely have to either delay the signal (for non-interactive applications) or delay the mode switching decisions (for interactive applications).

When the encoder is configured for voice over IP applications, the input signal is filtered by a high-pass filter to remove the lowest part of the spectrum that contains little speech energy and may contain background noise. This is a second order Auto Regressive Moving Average (i.e., with poles and zeros) filter with a cut-off frequency around 50 Hz. In the future, a music detector may also be used to lower the cut-off frequency when the input signal is detected to be music rather than speech. 5.1. Range Encoder The range coder acts as the bit-packer for Opus. It is used in three different ways: to encode o Entropy-coded symbols with a fixed probability model using ec_encode() (entenc.c), o Integers from 0 to (2**M - 1) using ec_enc_uint() or ec_enc_bits() (entenc.c), o Integers from 0 to (ft - 1) (where ft is not a power of two) using ec_enc_uint() (entenc.c). The range encoder maintains an internal state vector composed of the four-tuple (val, rng, rem, ext) representing the low end of the current range, the size of the current range, a single buffered output byte, and a count of additional carry-propagating output bytes. Both val and rng are 32-bit unsigned integer values, rem is a byte value or less than 255 or the special value -1, and ext is an unsigned integer with at least 11 bits. This state vector is initialized at the start of each frame to the value (0, 2**31, -1, 0). After encoding a sequence of symbols, the value of rng in the encoder should exactly match the value of rng in the decoder after decoding the same sequence of symbols. This is a powerful tool for detecting errors in either an encoder or decoder implementation. The value of val, on the other hand, represents different things in the encoder and decoder, and is not expected to match. The decoder has no analog for rem and ext. These are used to perform carry propagation in the renormalization loop below. Each iteration of this loop produces 9 bits of output, consisting of 8 data bits and a carry flag. The encoder cannot determine the final value of the output bytes until it propagates these carry flags. Therefore, the reference implementation buffers a single non-propagating output byte (i.e., one less than 255) in rem and keeps a count of additional

propagating (i.e., 255) output bytes in ext. An implementation may choose to use any mathematically equivalent scheme to perform carry propagation. 5.1.1. Encoding Symbols The main encoding function is ec_encode() (entenc.c), which encodes symbol k in the current context using the same three-tuple (fl[k], fh[k], ft) as the decoder to describe the range of the symbol (see Section 4.1). ec_encode() updates the state of the encoder as follows. If fl[k] is greater than zero, then rng val = val + rng - --- * (ft - fl) ft rng rng = --- * (fh - fl) ft Otherwise, val is unchanged and rng rng = rng - --- * (fh - fl) ft The divisions here are integer division. 5.1.1.1. Renormalization After this update, the range is normalized using a procedure very similar to that of Section 4.1.2.1, implemented by ec_enc_normalize() (entenc.c). The following process is repeated until rng > 2**23. First, the top 9 bits of val, (val>>23), are sent to the carry buffer, described in Section 5.1.1.2. Then, the encoder sets val = (val<<8) & 0x7FFFFFFF rng = rng<<8 5.1.1.2. Carry Propagation and Output Buffering The function ec_enc_carry_out() (entenc.c) implements carry propagation and output buffering. It takes, as input, a 9-bit unsigned value, c, consisting of 8 data bits and an additional carry

bit. If c is equal to the value 255, then ext is simply incremented, and no other state updates are performed. Otherwise, let b = (c>>8) be the carry bit. Then, o If the buffered byte rem contains a value other than -1, the encoder outputs the byte (rem + b). Otherwise, if rem is -1, no byte is output. o If ext is non-zero, then the encoder outputs ext bytes -- all with a value of 0 if b is set, or 255 if b is unset -- and sets ext to 0. o rem is set to the 8 data bits: rem = c & 255 5.1.2. Alternate Encoding Methods The reference implementation uses three additional encoding methods that are exactly equivalent to the above, but make assumptions and simplifications that allow for a more efficient implementation. 5.1.2.1. ec_encode_bin() The first is ec_encode_bin() (entenc.c), defined using the parameter ftb instead of ft. It is mathematically equivalent to calling ec_encode() with ft = (1<<ftb), but it avoids using division. 5.1.2.2. ec_enc_bit_logp() The next is ec_enc_bit_logp() (entenc.c), which encodes a single binary symbol. The context is described by a single parameter, logp, which is the absolute value of the base-2 logarithm of the probability of a "1". It is mathematically equivalent to calling ec_encode() with the 3-tuple (fl[k] = 0, fh[k] = (1<<logp) - 1, ft = (1<<logp)) if k is 0 and with (fl[k] = (1<<logp) - 1, fh[k] = ft = (1<<logp)) if k is 1. The implementation requires no multiplications or divisions. 5.1.2.3. ec_enc_icdf() The last is ec_enc_icdf() (entenc.c), which encodes a single binary symbol with a table-based context of up to 8 bits. This uses the same icdf table as ec_dec_icdf() from Section 4.1.3.3. The function

is mathematically equivalent to calling ec_encode() with fl[k] = (1<<ftb) - icdf[k-1] (or 0 if k == 0), fh[k] = (1<<ftb) - icdf[k], and ft = (1<<ftb). This only saves a few arithmetic operations over ec_encode_bin(), but it allows the encoder to use the same icdf tables as the decoder. 5.1.3. Encoding Raw Bits The raw bits used by the CELT layer are packed at the end of the buffer using ec_enc_bits() (entenc.c). Because the raw bits may continue into the last byte output by the range coder if there is room in the low-order bits, the encoder must be prepared to merge these values into a single byte. The procedure in Section 5.1.5 does this in a way that ensures both the range coded data and the raw bits can be decoded successfully. 5.1.4. Encoding Uniformly Distributed Integers The function ec_enc_uint() (entenc.c) encodes one of ft equiprobable symbols in the range 0 to (ft - 1), inclusive, each with a frequency of 1, where ft may be as large as (2**32 - 1). Like the decoder (see Section 4.1.5), it splits up the value into a range coded symbol representing up to 8 of the high bits, and, if necessary, raw bits representing the remainder of the value. ec_enc_uint() takes a two-tuple (t, ft), where t is the unsigned integer to be encoded, 0 <= t < ft, and ft is not necessarily a power of two. Let ftb = ilog(ft - 1), i.e., the number of bits required to store (ft - 1) in two's complement notation. If ftb is 8 or less, then t is encoded directly using ec_encode() with the three-tuple (t, t + 1, ft). If ftb is greater than 8, then the top 8 bits of t are encoded using the three-tuple (t>>(ftb - 8), (t>>(ftb - 8)) + 1, ((ft - 1)>>(ftb - 8)) + 1), and the remaining bits, (t & ((1<<(ftb - 8)) - 1), are encoded as raw bits with ec_enc_bits(). 5.1.5. Finalizing the Stream After all symbols are encoded, the stream must be finalized by outputting a value inside the current range. Let end be the unsigned integer in the interval [val, val + rng) with the largest number of trailing zero bits, b, such that (end + (1<<b) - 1) is also in the interval [val, val + rng). This choice of end allows the maximum number of trailing bits to be set to arbitrary values while still ensuring the range coded part of the buffer can be decoded correctly.

Then, while end is not zero, the top 9 bits of end, i.e., (end>>23), are passed to the carry buffer in accordance with the procedure in Section 5.1.1.2, and end is updated via end = (end<<8) & 0x7FFFFFFF Finally, if the buffered output byte, rem, is neither zero nor the special value -1, or the carry count, ext, is greater than zero, then 9 zero bits are sent to the carry buffer to flush it to the output buffer. When outputting the final byte from the range coder, if it would overlap any raw bits already packed into the end of the output buffer, they should be ORed into the same byte. The bit allocation routines in the CELT layer should ensure that this can be done without corrupting the range coder data so long as end is chosen as described above. If there is any space between the end of the range coder data and the end of the raw bits, it is padded with zero bits. This entire process is implemented by ec_enc_done() (entenc.c). 5.1.6. Current Bit Usage The bit allocation routines in Opus need to be able to determine a conservative upper bound on the number of bits that have been used to encode the current frame thus far. This drives allocation decisions and ensures that the range coder and raw bits will not overflow the output buffer. This is computed in the reference implementation to whole-bit precision by the function ec_tell() (entcode.h) and to fractional 1/8th bit precision by the function ec_tell_frac() (entcode.c). Like all operations in the range coder, it must be implemented in a bit-exact manner, and it must produce exactly the same value returned by the same functions in the decoder after decoding the same symbols. 5.2. SILK Encoder In many respects, the SILK encoder mirrors the SILK decoder described in Section 4.2. Details such as the quantization and range coder tables can be found there, while this section describes the high- level design choices that were made. The diagram below shows the basic modules of the SILK encoder. +----------+ +--------+ +---------+ | Sample | | Stereo | | SILK | ------>| Rate |--->| Mixing |--->| Core |----------> Input |Conversion| | | | Encoder | Bitstream +----------+ +--------+ +---------+ Figure 21: SILK Encoder

5.2.1. Sample Rate Conversion The input signal's sampling rate is adjusted by a sample rate conversion module so that it matches the SILK internal sampling rate. The input to the sample rate converter is delayed by a number of samples depending on the sample rate ratio, such that the overall delay is constant for all input and output sample rates. 5.2.2. Stereo Mixing The stereo mixer is only used for stereo input signals. It converts a stereo left-right signal into an adaptive mid-side representation. The first step is to compute non-adaptive mid-side signals as half the sum and difference between left and right signals. The side signal is then minimized in energy by subtracting a prediction of it based on the mid signal. This prediction works well when the left and right signals exhibit linear dependency, for instance, for an amplitude-panned input signal. Like in the decoder, the prediction coefficients are linearly interpolated during the first 8 ms of the frame. The mid signal is always encoded, whereas the residual side signal is only encoded if it has sufficient energy compared to the mid signal's energy. If it has not, the "mid_only_flag" is set without encoding the side signal. The predictor coefficients are coded regardless of whether the side signal is encoded. For each frame, two predictor coefficients are computed, one that predicts between low-passed mid and side channels, and one that predicts between high-passed mid and side channels. The low-pass filter is a simple three-tap filter and creates a delay of one sample. The high-pass filtered signal is the difference between the mid signal delayed by one sample and the low-passed signal. Instead of explicitly computing the high-passed signal, it is computationally more efficient to transform the prediction coefficients before applying them to the filtered mid signal, as follows: pred(n) = LP(n) * w0 + HP(n) * w1 = LP(n) * w0 + (mid(n-1) - LP(n)) * w1 = LP(n) * (w0 - w1) + mid(n-1) * w1 where w0 and w1 are the low-pass and high-pass prediction coefficients, mid(n-1) is the mid signal delayed by one sample, LP(n) and HP(n) are the low-passed and high-passed signals and pred(n) is the prediction signal that is subtracted from the side signal.

5.2.3. SILK Core Encoder What follows is a description of the core encoder and its components. For simplicity, the core encoder is referred to simply as the encoder in the remainder of this section. An overview of the encoder is given in Figure 22. +---+ +--------------------------------->| | +---------+ | +---------+ | | |Voice | | |LTP |12 | | +-->|Activity |--+ +----->|Scaling |-----------+---->| | | |Detection|3 | | |Control |<--+ | | | | +---------+ | | +---------+ | | | | | | | +---------+ | | | | | | | |Gains | | | | | | | | +-->|Processor|---|---+---|---->| R | | | | | | |11 | | | | a | | \/ | | +---------+ | | | | n | | +---------+ | | +---------+ | | | | g | | |Pitch | | | |LSF | | | | | e | | +->|Analysis |---+ | |Quantizer|---|---|---|---->| | | | | |4 | | | |8 | | | | E |--> | | +---------+ | | +---------+ | | | | n | 2 | | | | 9/\ 10| | | | | c | | | | | | \/ | | | | o | | | +---------+ | | +----------+ | | | | d | | | |Noise | +--|-->|Prediction|--+---|---|---->| e | | +->|Shaping |---|--+ |Analysis |7 | | | | r | | | |Analysis |5 | | | | | | | | | | | +---------+ | | +----------+ | | | | | | | | | /\ | | | | | | | +----------|--|--------+ | | | | | | | | \/ \/ \/ \/ \/ | | | | | +----------+ +------------+ | | | | | | | |Noise | | | -+-------+-----+----->|Pre-filter|--------->|Shaping |-->| | 1 | | 6 |Quantization|13 | | +----------+ +------------+ +---+ 1: Input speech signal 2: Range encoded bitstream 3: Voice activity estimate 4: Pitch lags (per 5 ms) and voicing decision (per 20 ms) 5: Noise shaping quantization coefficients - Short-term synthesis and analysis noise shaping coefficients (per 5 ms)

- Long-term synthesis and analysis noise shaping coefficients (per 5 ms and for voiced speech only) - Noise shaping tilt (per 5 ms) - Quantizer gain/step size (per 5 ms) 6: Input signal filtered with analysis noise shaping filters 7: Short- and Long-Term Prediction coefficients LTP (per 5 ms) and LPC (per 20 ms) 8: LSF quantization indices 9: LSF coefficients 10: Quantized LSF coefficients 11: Processed gains, and synthesis noise shape coefficients 12: LTP state scaling coefficient. Controlling error propagation / prediction gain trade-off 13: Quantized signal Figure 22: SILK Core Encoder 5.2.3.1. Voice Activity Detection The input signal is processed by a Voice Activity Detection (VAD) algorithm to produce a measure of voice activity, spectral tilt, and signal-to-noise estimates for each frame. The VAD uses a sequence of half-band filterbanks to split the signal into four subbands: 0...Fs/16, Fs/16...Fs/8, Fs/8...Fs/4, and Fs/4...Fs/2, where Fs is the sampling frequency (8, 12, 16, or 24 kHz). The lowest subband, from 0 - Fs/16, is high-pass filtered with a first-order moving average (MA) filter (with transfer function H(z) = 1-z**(-1)) to reduce the energy at the lowest frequencies. For each frame, the signal energy per subband is computed. In each subband, a noise level estimator tracks the background noise level and a Signal-to- Noise Ratio (SNR) value is computed as the logarithm of the ratio of energy-to-noise level. Using these intermediate variables, the following parameters are calculated for use in other SILK modules: o Average SNR. The average of the subband SNR values. o Smoothed subband SNRs. Temporally smoothed subband SNR values. o Speech activity level. Based on the average SNR and a weighted average of the subband energies. o Spectral tilt. A weighted average of the subband SNRs, with positive weights for the low subbands and negative weights for the high subbands.

5.2.3.2. Pitch Analysis The input signal is processed by the open loop pitch estimator shown in Figure 23. +--------+ +----------+ |2 x Down| |Time- | +->|sampling|->|Correlator| | | | | | | |4 | +--------+ +----------+ \/ | | 2 +-------+ | | +-->|Speech |5 +---------+ +--------+ | \/ | |Type |-> |LPC | |Down | | +----------+ | | +->|Analysis | +->|sample |-+------------->|Time- | +-------+ | | | | |to 8 kHz| |Correlator|-----------> | +---------+ | +--------+ |__________| 6 | | | |3 | \/ | \/ | +---------+ | +----------+ | |Whitening| | |Time- | -+->|Filter |-+--------------------------->|Correlator|-----------> 1 | | | | 7 +---------+ +----------+ 1: Input signal 2: Lag candidates from stage 1 3: Lag candidates from stage 2 4: Correlation threshold 5: Voiced/unvoiced flag 6: Pitch correlation 7: Pitch lags Figure 23: Block Diagram of the Pitch Estimator The pitch analysis finds a binary voiced/unvoiced classification, and, for frames classified as voiced, four pitch lags per frame -- one for each 5 ms subframe -- and a pitch correlation indicating the periodicity of the signal. The input is first whitened using a Linear Prediction (LP) whitening filter, where the coefficients are computed through standard Linear Predictive Coding (LPC) analysis. The order of the whitening filter is 16 for best results, but is reduced to 12 for medium complexity and 8 for low complexity modes. The whitened signal is analyzed to find pitch lags for which the time correlation is high. The analysis consists of three stages for reducing the complexity:

o In the first stage, the whitened signal is downsampled to 4 kHz (from 8 kHz), and the current frame is correlated to a signal delayed by a range of lags, starting from a shortest lag corresponding to 500 Hz, to a longest lag corresponding to 56 Hz. o The second stage operates on an 8 kHz signal (downsampled from 12, 16, or 24 kHz) and measures time correlations only near the lags corresponding to those that had sufficiently high correlations in the first stage. The resulting correlations are adjusted for a small bias towards short lags to avoid ending up with a multiple of the true pitch lag. The highest adjusted correlation is compared to a threshold depending on: * Whether the previous frame was classified as voiced. * The speech activity level. * The spectral tilt. If the threshold is exceeded, the current frame is classified as voiced and the lag with the highest adjusted correlation is stored for a final pitch analysis of the highest precision in the third stage. o The last stage operates directly on the whitened input signal to compute time correlations for each of the four subframes independently in a narrow range around the lag with highest correlation from the second stage. 5.2.3.3. Noise Shaping Analysis The noise shaping analysis finds gains and filter coefficients used in the pre-filter and noise shaping quantizer. These parameters are chosen such that they will fulfill several requirements: o Balancing quantization noise and bitrate. The quantization gains determine the step size between reconstruction levels of the excitation signal. Therefore, increasing the quantization gain amplifies quantization noise, but also reduces the bitrate by lowering the entropy of the quantization indices. o Spectral shaping of the quantization noise; the noise shaping quantizer is capable of reducing quantization noise in some parts of the spectrum at the cost of increased noise in other parts without substantially changing the bitrate. By shaping the noise such that it follows the signal spectrum, it becomes less audible. In practice, best results are obtained by making the shape of the noise spectrum slightly flatter than the signal spectrum.

o De-emphasizing spectral valleys; by using different coefficients in the analysis and synthesis part of the pre-filter and noise shaping quantizer, the levels of the spectral valleys can be decreased relative to the levels of the spectral peaks such as speech formants and harmonics. This reduces the entropy of the signal, which is the difference between the coded signal and the quantization noise, thus lowering the bitrate. o Matching the levels of the decoded speech formants to the levels of the original speech formants; an adjustment gain and a first order tilt coefficient are computed to compensate for the effect of the noise shaping quantization on the level and spectral tilt. / \ ___ | // \\ | // \\ ____ |_// \\___// \\ ____ | / ___ \ / \\ // \\ P |/ / \ \_/ \\_____// \\ o | / \ ____ \ / \\ w | / \___/ \ \___/ ____ \\___ 1 e |/ \ / \ \ r | \_____/ \ \__ 2 | \ | \___ 3 | +----------------------------------------> Frequency 1: Input signal spectrum 2: De-emphasized and level matched spectrum 3: Quantization noise spectrum Figure 24: Noise Shaping and Spectral De-emphasis Illustration Figure 24 shows an example of an input signal spectrum (1). After de-emphasis and level matching, the spectrum has deeper valleys (2). The quantization noise spectrum (3) more or less follows the input signal spectrum, while having slightly less pronounced peaks. The entropy, which provides a lower bound on the bitrate for encoding the excitation signal, is proportional to the area between the de- emphasized spectrum (2) and the quantization noise spectrum (3). Without de-emphasis, the entropy is proportional to the area between input spectrum (1) and quantization noise (3) -- clearly higher. The transformation from input signal to de-emphasized signal can be described as a filtering operation with a filter

-1 Wana(z) H(z) = G * ( 1 - c_tilt * z ) * ------- Wsyn(z) having an adjustment gain G, a first order tilt adjustment filter with tilt coefficient c_tilt, and where 16 d __ -k -L __ -k Wana(z) = (1 - \ a_ana(k) * z )*(1 - z * \ b_ana(k) * z ) /_ /_ k=1 k=-d is the analysis part of the de-emphasis filter, consisting of the short-term shaping filter with coefficients a_ana(k) and the long- term shaping filter with coefficients b_ana(k) and pitch lag L. The parameter d determines the number of long-term shaping filter taps. Similarly, but without the tilt adjustment, the synthesis part can be written as 16 d __ -k -L __ -k Wsyn(z) = (1 - \ a_syn(k) * z )*(1 - z * \ b_syn(k) * z ) /_ /_ k=1 k=-d All noise shaping parameters are computed and applied per subframe of 5 ms. First, an LPC analysis is performed on a windowed signal block of 15 ms. The signal block has a look-ahead of 5 ms relative to the current subframe, and the window is an asymmetric sine window. The LPC analysis is done with the autocorrelation method, with an order of between 8, in lowest-complexity mode, and 16, for best quality. Optionally, the LPC analysis and noise shaping filters are warped by replacing the delay elements by first-order allpass filters. This increases the frequency resolution at low frequencies and reduces it at high ones, which better matches the human auditory system and improves quality. The warped analysis and filtering comes at a cost in complexity and is therefore only done in higher complexity modes. The quantization gain is found by taking the square root of the residual energy from the LPC analysis and multiplying it by a value inversely proportional to the coding quality control parameter and the pitch correlation.

Next, the two sets of short-term noise shaping coefficients a_ana(k) and a_syn(k) are obtained by applying different amounts of bandwidth expansion to the coefficients found in the LPC analysis. This bandwidth expansion moves the roots of the LPC polynomial towards the origin, using the formulas k a_ana(k) = a(k)*g_ana and k a_syn(k) = a(k)*g_syn where a(k) is the k'th LPC coefficient, and the bandwidth expansion factors g_ana and g_syn are calculated as g_ana = 0.95 - 0.01*C and g_syn = 0.95 + 0.01*C where C is the coding quality control parameter between 0 and 1. Applying more bandwidth expansion to the analysis part than to the synthesis part gives the desired de-emphasis of spectral valleys in between formants. The long-term shaping is applied only during voiced frames. It uses a three-tap filter, described by b_ana = F_ana * [0.25, 0.5, 0.25] and b_syn = F_syn * [0.25, 0.5, 0.25]. For unvoiced frames, these coefficients are set to 0. The multiplication factors F_ana and F_syn are chosen between 0 and 1, depending on the coding quality control parameter, as well as the calculated pitch correlation and smoothed subband SNR of the lowest subband. By having F_ana less than F_syn, the pitch harmonics are emphasized relative to the valleys in between the harmonics. The tilt coefficient c_tilt is for unvoiced frames chosen as c_tilt = 0.25 and as c_tilt = 0.25 + 0.2625 * V for voiced frames, where V is the voice activity level between 0 and 1.

The adjustment gain G serves to correct any level mismatch between the original and decoded signals that might arise from the noise shaping and de-emphasis. This gain is computed as the ratio of the prediction gain of the short-term analysis and synthesis filter coefficients. The prediction gain of an LPC synthesis filter is the square root of the output energy when the filter is excited by a unit-energy impulse on the input. An efficient way to compute the prediction gain is by first computing the reflection coefficients from the LPC coefficients through the step-down algorithm, and extracting the prediction gain from the reflection coefficients as K ___ 2 -0.5 predGain = ( | | 1 - (r_k) ) k=1 where r_k is the k'th reflection coefficient. Initial values for the quantization gains are computed as the square root of the residual energy of the LPC analysis, adjusted by the coding quality control parameter. These quantization gains are later adjusted based on the results of the prediction analysis. 5.2.3.4. Prediction Analysis The prediction analysis is performed in one of two ways depending on how the pitch estimator classified the frame. The processing for voiced and unvoiced speech is described in Section 5.2.3.4.1 and Section 5.2.3.4.2, respectively. Inputs to this function include the pre-whitened signal from the pitch estimator (see Section 5.2.3.2). 5.2.3.4.1. Voiced Speech For a frame of voiced speech, the pitch pulses will remain dominant in the pre-whitened input signal. Further whitening is desirable as it leads to higher quality at the same available bitrate. To achieve this, a Long-Term Prediction (LTP) analysis is carried out to estimate the coefficients of a fifth-order LTP filter for each of four subframes. The LTP coefficients are quantized using the method described in Section 5.2.3.6, and the quantized LTP coefficients are used to compute the LTP residual signal. This LTP residual signal is the input to an LPC analysis where the LPC coefficients are estimated using Burg's method [BURG], such that the residual energy is minimized. The estimated LPC coefficients are converted to a Line Spectral Frequency (LSF) vector and quantized as described in Section 5.2.3.5. After quantization, the quantized LSF vector is converted back to LPC coefficients using the full procedure in Section 4.2.7.5. By using quantized LTP coefficients and LPC

coefficients derived from the quantized LSF coefficients, the encoder remains fully synchronized with the decoder. The quantized LPC and LTP coefficients are also used to filter the input signal and measure residual energy for each of the four subframes. 5.2.3.4.2. Unvoiced Speech For a speech signal that has been classified as unvoiced, there is no need for LTP filtering, as it has already been determined that the pre-whitened input signal is not periodic enough within the allowed pitch period range for LTP analysis to be worth the cost in terms of complexity and bitrate. The pre-whitened input signal is therefore discarded, and, instead, the input signal is used for LPC analysis using Burg's method. The resulting LPC coefficients are converted to an LSF vector and quantized as described in the following section. They are then transformed back to obtain quantized LPC coefficients, which are then used to filter the input signal and measure residual energy for each of the four subframes. 5.2.3.4.2.1. Burg's Method The main purpose of linear prediction in SILK is to reduce the bitrate by minimizing the residual energy. At least at high bitrates, perceptual aspects are handled independently by the noise shaping filter. Burg's method is used because it provides higher prediction gain than the autocorrelation method and, unlike the covariance method, produces stable filters (assuming numerical errors don't spoil that). SILK's implementation of Burg's method is also computationally faster than the autocovariance method. The implementation of Burg's method differs from traditional implementations in two aspects. The first difference is that it operates on autocorrelations, similar to the Schur algorithm [SCHUR], but with a simple update to the autocorrelations after finding each reflection coefficient to make the result identical to Burg's method. This brings down the complexity of Burg's method to near that of the autocorrelation method. The second difference is that the signal in each subframe is scaled by the inverse of the residual quantization step size. Subframes with a small quantization step size will, on average, spend more bits for a given amount of residual energy than subframes with a large step size. Without scaling, Burg's method minimizes the total residual energy in all subframes, which doesn't necessarily minimize the total number of bits needed for coding the quantized residual. The residual energy of the scaled subframes is a better measure for that number of bits.

5.2.3.5. LSF Quantization Unlike many other speech codecs, SILK uses variable bitrate coding for the LSFs. This improves the average rate-distortion (R-D) trade- off and reduces outliers. The variable bitrate coding minimizes a linear combination of the weighted quantization errors and the bitrate. The weights for the quantization errors are the Inverse Harmonic Mean Weighting (IHMW) function proposed by Laroia et al. (see [LAROIA-ICASSP]). These weights are referred to here as Laroia weights. The LSF quantizer consists of two stages. The first stage is an (unweighted) vector quantizer (VQ), with a codebook size of 32 vectors. The quantization errors for the codebook vector are sorted, and for the N best vectors a second stage quantizer is run. By varying the number N, a trade-off is made between R-D performance and computational efficiency. For each of the N codebook vectors, the Laroia weights corresponding to that vector (and not to the input vector) are calculated. Then, the residual between the input LSF vector and the codebook vector is scaled by the square roots of these Laroia weights. This scaling partially normalizes error sensitivity for the residual vector so that a uniform quantizer with fixed step sizes can be used in the second stage without too much performance loss. Additionally, by scaling with Laroia weights determined from the first-stage codebook vector, the process can be reversed in the decoder. The second stage uses predictive delayed decision scalar quantization. The quantization error is weighted by Laroia weights determined from the LSF input vector. The predictor multiplies the previous quantized residual value by a prediction coefficient that depends on the vector index from the first stage VQ and on the location in the LSF vector. The prediction is subtracted from the LSF residual value before quantizing the result and is added back afterwards. This subtraction can be interpreted as shifting the quantization levels of the scalar quantizer, and as a result the quantization error of each value depends on the quantization decision of the previous value. This dependency is exploited by the delayed decision mechanism to search for a quantization sequency with best R-D performance with a Viterbi-like algorithm [VITERBI]. The quantizer processes the residual LSF vector in reverse order (i.e., it starts with the highest residual LSF value). This is done because the prediction works slightly better in the reverse direction.

The quantization index of the first stage is entropy coded. The quantization sequence from the second stage is also entropy coded, where for each element the probability table is chosen depending on the vector index from the first stage and the location of that element in the LSF vector. 5.2.3.5.1. LSF Stabilization If the input is stable, finding the best candidate usually results in a quantized vector that is also stable. Because of the two-stage approach, however, it is possible that the best quantization candidate is unstable. The encoder applies the same stabilization procedure applied by the decoder (see Section 4.2.7.5.4) to ensure the LSF parameters are within their valid range, increasingly sorted, and have minimum distances between each other and the border values. 5.2.3.6. LTP Quantization For voiced frames, the prediction analysis described in Section 5.2.3.4.1 resulted in four sets (one set per subframe) of five LTP coefficients, plus four weighting matrices. The LTP coefficients for each subframe are quantized using entropy constrained vector quantization. A total of three vector codebooks are available for quantization, with different rate-distortion trade- offs. The three codebooks have 10, 20, and 40 vectors and average rates of about 3, 4, and 5 bits per vector, respectively. Consequently, the first codebook has larger average quantization distortion at a lower rate, whereas the last codebook has smaller average quantization distortion at a higher rate. Given the weighting matrix W_ltp and LTP vector b, the weighted rate-distortion measure for a codebook vector cb_i with rate r_i is give by RD = u * (b - cb_i)' * W_ltp * (b - cb_i) + r_i where u is a fixed, heuristically determined parameter balancing the distortion and rate. Which codebook gives the best performance for a given LTP vector depends on the weighting matrix for that LTP vector. For example, for a low valued W_ltp, it is advantageous to use the codebook with 10 vectors as it has a lower average rate. For a large W_ltp, on the other hand, it is often better to use the codebook with 40 vectors, as it is more likely to contain the best codebook vector. The weighting matrix W_ltp depends mostly on two aspects of the input signal. The first is the periodicity of the signal; the more periodic, the larger W_ltp. The second is the change in signal energy in the current subframe, relative to the signal one pitch lag earlier. A decaying energy leads to a larger W_ltp than an increasing energy. Both aspects fluctuate relatively slowly, which causes the W_ltp matrices for different subframes of one frame often

to be similar. Because of this, one of the three codebooks typically gives good performance for all subframes. Therefore, the codebook search for the subframe LTP vectors is constrained to only allow codebook vectors to be chosen from the same codebook, resulting in a rate reduction. To find the best codebook, each of the three vector codebooks is used to quantize all subframe LTP vectors and produce a combined weighted rate-distortion measure for each vector codebook. The vector codebook with the lowest combined rate-distortion over all subframes is chosen. The quantized LTP vectors are used in the noise shaping quantizer, and the index of the codebook plus the four indices for the four subframe codebook vectors are passed on to the range encoder. 5.2.3.7. Pre-filter In the pre-filter, the input signal is filtered using the spectral valley de-emphasis filter coefficients from the noise shaping analysis (see Section 5.2.3.3). By applying only the noise shaping analysis filter to the input signal, it provides the input to the noise shaping quantizer. 5.2.3.8. Noise Shaping Quantizer The noise shaping quantizer independently shapes the signal and coding noise spectra to obtain a perceptually higher quality at the same bitrate. The pre-filter output signal is multiplied with a compensation gain G computed in the noise shaping analysis. Then, the output of a synthesis shaping filter is added, and the output of a prediction filter is subtracted to create a residual signal. The residual signal is multiplied by the inverse quantized quantization gain from the noise shaping analysis and input to a scalar quantizer. The quantization indices of the scalar quantizer represent a signal of pulses that is input to the pyramid range encoder. The scalar quantizer also outputs a quantization signal, which is multiplied by the quantized quantization gain from the noise shaping analysis to create an excitation signal. The output of the prediction filter is added to the excitation signal to form the quantized output signal y(n). The quantized output signal y(n) is input to the synthesis shaping and prediction filters.

Optionally, the noise shaping quantizer operates in a delayed decision mode. In this mode, it uses a Viterbi algorithm to keep track of multiple rounding choices in the quantizer and select the best one after a delay of 32 samples. This improves the rate/ distortion performance of the quantizer. 5.2.3.9. Constant Bitrate Mode SILK was designed to run in variable bitrate (VBR) mode. However, the reference implementation also has a constant bitrate (CBR) mode for SILK. In CBR mode, SILK will attempt to encode each packet with no more than the allowed number of bits. The Opus wrapper code then pads the bitstream if any unused bits are left in SILK mode, or it encodes the high band with the remaining number of bits in Hybrid mode. The number of payload bits is adjusted by changing the quantization gains and the rate/distortion trade-off in the noise shaping quantizer, in an iterative loop around the noise shaping quantizer and entropy coding. Compared to the SILK VBR mode, the CBR mode has lower audio quality at a given average bitrate and has higher computational complexity. 5.3. CELT Encoder Most of the aspects of the CELT encoder can be directly derived from the description of the decoder. For example, the filters and rotations in the encoder are simply the inverse of the operation performed by the decoder. Similarly, the quantizers generally optimize for the mean square error (because noise shaping is part of the bitstream itself), so no special search is required. For this reason, only the less straightforward aspects of the encoder are described here. 5.3.1. Pitch Pre-filter The pitch pre-filter is applied after the pre-emphasis. It is applied in such a way as to be the inverse of the decoder's post- filter. The main non-obvious aspect of the pre-filter is the selection of the pitch period. The pitch search should be optimized for the following criteria: o continuity: it is important that the pitch period does not change abruptly between frames; and o avoidance of pitch multiples: when the period used is a multiple of the real period (lower frequency fundamental), the post-filter loses most of its ability to reduce noise

5.3.2. Bands and Normalization The MDCT output is divided into bands that are designed to match the ear's critical bands for the smallest (2.5 ms) frame size. The larger frame sizes use integer multiples of the 2.5 ms layout. For each band, the encoder computes the energy that will later be encoded. Each band is then normalized by the square root of the *unquantized* energy, such that each band now forms a unit vector X. The energy and the normalization are computed by compute_band_energies() and normalise_bands() (bands.c), respectively. 5.3.3. Energy Envelope Quantization Energy quantization (both coarse and fine) can be easily understood from the decoding process. For all useful bitrates, the coarse quantizer always chooses the quantized log energy value that minimizes the error for each band. Only at very low rate does the encoder allow larger errors to minimize the rate and avoid using more bits than are available. When the available CPU requirements allow it, it is best to try encoding the coarse energy both with and without inter-frame prediction such that the best prediction mode can be selected. The optimal mode depends on the coding rate, the available bitrate, and the current rate of packet loss. The fine energy quantizer always chooses the quantized log energy value that minimizes the error for each band because the rate of the fine quantization depends only on the bit allocation and not on the values that are coded. 5.3.4. Bit Allocation The encoder must use exactly the same bit allocation process as used by the decoder and described in Section 4.3.3. The three mechanisms that can be used by the encoder to adjust the bitrate on a frame-by- frame basis are band boost, allocation trim, and band skipping. 5.3.4.1. Band Boost The reference encoder makes a decision to boost a band when the energy of that band is significantly higher than that of the neighboring bands. Let E_j be the log-energy of band j, we define D_j = 2*E_j - E_j-1 - E_j+1 The allocation of band j is boosted once if D_j > t1 and twice if D_j > t2. For LM>=1, t1=2 and t2=4, while for LM<1, t1=3 and t2=5.

5.3.4.2. Allocation Trim The allocation trim is a value between 0 and 10 (inclusively) that controls the allocation balance between the low and high frequencies. The encoder starts with a safe "default" of 5 and deviates from that default in two different ways. First, the trim can deviate by +/- 2 depending on the spectral tilt of the input signal. For signals with more low frequencies, the trim is increased by up to 2, while for signals with more high frequencies, the trim is decreased by up to 2. For stereo inputs, the trim value can be decreased by up to 4 when the inter-channel correlation at low frequency (first 8 bands) is high. 5.3.4.3. Band Skipping The encoder uses band skipping to ensure that the shape of the bands is only coded if there is at least 1/2 bit per sample available for the PVQ. If not, then no bit is allocated and folding is used instead. To ensure continuity in the allocation, some amount of hysteresis is added to the process, such that a band that received PVQ bits in the previous frame only needs 7/16 bit/sample to be coded for the current frame, while a band that did not receive PVQ bits in the previous frames needs at least 9/16 bit/sample to be coded. 5.3.5. Stereo Decisions Because CELT applies mid-side stereo coupling in the normalized domain, it does not suffer from important stereo image problems even when the two channels are completely uncorrelated. For this reason, it is always safe to use stereo coupling on any audio frame. That being said, there are some frames for which dual (independent) stereo is still more efficient. This decision is made by comparing the estimated entropy with and without coupling over the first 13 bands, taking into account the fact that all bands with more than two MDCT bins require one extra degree of freedom when coded in mid-side. Let L1_ms and L1_lr be the L1-norm of the mid-side vector and the L1-norm of the left-right vector, respectively. The decision to use mid-side is made if and only if L1_ms L1_lr -------- < ----- bins + E bins where bins is the number of MDCT bins in the first 13 bands and E is the number of extra degrees of freedom for mid-side coding. For LM>1, E=13, otherwise E=5.

The reference encoder decides on the intensity stereo threshold based on the bitrate alone. After taking into account the frame size by subtracting 80 bits per frame for coarse energy, the first band using intensity coding is as follows: +------------------+------------+ | bitrate (kbit/s) | start band | +------------------+------------+ | <35 | 8 | | | | | 35-50 | 12 | | | | | 50-68 | 16 | | | | | 84-84 | 18 | | | | | 84-102 | 19 | | | | | 102-130 | 20 | | | | | >130 | disabled | +------------------+------------+ Table 66: Thresholds for Intensity Stereo 5.3.6. Time-Frequency Decision The choice of time-frequency resolution used in Section 4.3.4.5 is based on R-D optimization. The distortion is the L1-norm (sum of absolute values) of each band after each TF resolution under consideration. The L1 norm is used because it represents the entropy for a Laplacian source. The number of bits required to code a change in TF resolution between two bands is higher than the cost of having those two bands use the same resolution, which is what requires the R-D optimization. The optimal decision is computed using the Viterbi algorithm. See tf_analysis() in celt/celt.c. 5.3.7. Spreading Values Decision The choice of the spreading value in Table 59 has an impact on the nature of the coding noise introduced by CELT. The larger the f_r value, the lower the impact of the rotation, and the more tonal the coding noise. The more tonal the signal, the more tonal the noise should be, so the CELT encoder determines the optimal value for f_r by estimating how tonal the signal is. The tonality estimate is based on discrete pdf (4-bin histogram) of each band. Bands that

have a large number of small values are considered more tonal and a decision is made by combining all bands with more than 8 samples. See spreading_decision() in celt/bands.c. 5.3.8. Spherical Vector Quantization CELT uses a Pyramid Vector Quantizer (PVQ) [PVQ] for quantizing the details of the spectrum in each band that have not been predicted by the pitch predictor. The PVQ codebook consists of all sums of K signed pulses in a vector of N samples, where two pulses at the same position are required to have the same sign. Thus, the codebook includes all integer codevectors y of N dimensions that satisfy sum(abs(y(j))) = K. In bands where there are sufficient bits allocated, PVQ is used to encode the unit vector that results from the normalization in Section 5.3.2 directly. Given a PVQ codevector y, the unit vector X is obtained as X = y/||y||, where ||.|| denotes the L2 norm. 5.3.8.1. PVQ Search The search for the best codevector y is performed by alg_quant() (vq.c). There are several possible approaches to the search, with a trade-off between quality and complexity. The method used in the reference implementation computes an initial codeword y1 by projecting the normalized spectrum X onto the codebook pyramid of K-1 pulses: y0 = truncate_towards_zero( (K-1) * X / sum(abs(X))) Depending on N, K and the input data, the initial codeword y0 may contain from 0 to K-1 non-zero values. All the remaining pulses, with the exception of the last one, are found iteratively with a greedy search that minimizes the normalized correlation between y and X: T J = -X * y / ||y|| The search described above is considered to be a good trade-off between quality and computational cost. However, there are other possible ways to search the PVQ codebook and the implementers MAY use any other search methods. See alg_quant() in celt/vq.c.

5.3.8.2. PVQ Encoding The vector to encode, X, is converted into an index i such that 0 <= i < V(N,K) as follows. Let i = 0 and k = 0. Then, for j = (N - 1) down to 0, inclusive, do: 1. If k > 0, set i = i + (V(N-j-1,k-1) + V(N-j,k-1))/2. 2. Set k = k + abs(X[j]). 3. If X[j] < 0, set i = i + (V(N-j-1,k) + V(N-j,k))/2. The index i is then encoded using the procedure in Section 5.1.4 with ft = V(N,K).