Experimental

Pages: 194

Pages: 194

Part 1 of 6, p. 1 to 19

Network Working Group S. Andersen Request for Comments: 3951 Aalborg University Category: Experimental A. Duric Telio H. Astrom R. Hagen W. Kleijn J. Linden Global IP Sound December 2004 Internet Low Bit Rate Codec (iLBC) Status of this Memo This memo defines an Experimental Protocol for the Internet community. It does not specify an Internet standard of any kind. Discussion and suggestions for improvement are requested. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2004). Abstract This document specifies a speech codec suitable for robust voice communication over IP. The codec is developed by Global IP Sound (GIPS). It is designed for narrow band speech and results in a payload bit rate of 13.33 kbit/s for 30 ms frames and 15.20 kbit/s for 20 ms frames. The codec enables graceful speech quality degradation in the case of lost frames, which occurs in connection with lost or delayed IP packets.

Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Outline of the Codec . . . . . . . . . . . . . . . . . . . . . 5 2.1. Encoder. . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2. Decoder. . . . . . . . . . . . . . . . . . . . . . . . . 7 3. Encoder Principles . . . . . . . . . . . . . . . . . . . . . . 7 3.1. Pre-processing . . . . . . . . . . . . . . . . . . . . . 9 3.2. LPC Analysis and Quantization. . . . . . . . . . . . . . 9 3.2.1. Computation of Autocorrelation Coefficients. . . 10 3.2.2. Computation of LPC Coefficients. . . . . . . . . 11 3.2.3. Computation of LSF Coefficients from LPC Coefficients . . . . . . . . . . . . . . . . . . 11 3.2.4. Quantization of LSF Coefficients . . . . . . . . 12 3.2.5. Stability Check of LSF Coefficients. . . . . . . 13 3.2.6. Interpolation of LSF Coefficients. . . . . . . . 13 3.2.7. LPC Analysis and Quantization for 20 ms Frames . 14 3.3. Calculation of the Residual. . . . . . . . . . . . . . . 15 3.4. Perceptual Weighting Filter. . . . . . . . . . . . . . . 15 3.5. Start State Encoder. . . . . . . . . . . . . . . . . . . 15 3.5.1. Start State Estimation . . . . . . . . . . . . . 16 3.5.2. All-Pass Filtering and Scale Quantization. . . . 17 3.5.3. Scalar Quantization. . . . . . . . . . . . . . . 18 3.6. Encoding the Remaining Samples . . . . . . . . . . . . . 19 3.6.1. Codebook Memory. . . . . . . . . . . . . . . . . 20 3.6.2. Perceptual Weighting of Codebook Memory and Target . . . . . . . . . . . . . . . . . . . 22 3.6.3. Codebook Creation. . . . . . . . . . . . . . . . 23 3.6.3.1. Creation of a Base Codebook . . . . . . 23 3.6.3.2. Codebook Expansion. . . . . . . . . . . 24 3.6.3.3. Codebook Augmentation . . . . . . . . . 24 3.6.4. Codebook Search. . . . . . . . . . . . . . . . . 26 3.6.4.1. Codebook Search at Each Stage . . . . . 26 3.6.4.2. Gain Quantization at Each Stage . . . . 27 3.6.4.3. Preparation of Target for Next Stage. . 28 3.7. Gain Correction Encoding . . . . . . . . . . . . . . . . 28 3.8. Bitstream Definition . . . . . . . . . . . . . . . . . . 29 4. Decoder Principles . . . . . . . . . . . . . . . . . . . . . . 32 4.1. LPC Filter Reconstruction. . . . . . . . . . . . . . . . 33 4.2. Start State Reconstruction . . . . . . . . . . . . . . . 33 4.3. Excitation Decoding Loop . . . . . . . . . . . . . . . . 34 4.4. Multistage Adaptive Codebook Decoding. . . . . . . . . . 35 4.4.1. Construction of the Decoded Excitation Signal. . 35 4.5. Packet Loss Concealment. . . . . . . . . . . . . . . . . 35 4.5.1. Block Received Correctly and Previous Block Also Received. . . . . . . . . . . . . . . . . . 35 4.5.2. Block Not Received . . . . . . . . . . . . . . . 36

4.5.3. Block Received Correctly When Previous Block Not Received . . . . . . . . . . . . . . . . . . 36 4.6. Enhancement. . . . . . . . . . . . . . . . . . . . . . . 37 4.6.1. Estimating the Pitch . . . . . . . . . . . . . . 39 4.6.2. Determination of the Pitch-Synchronous Sequences. . . . . . . . . . . . . . . . . . . . 39 4.6.3. Calculation of the Smoothed Excitation . . . . . 41 4.6.4. Enhancer Criterion . . . . . . . . . . . . . . . 41 4.6.5. Enhancing the Excitation . . . . . . . . . . . . 42 4.7. Synthesis Filtering. . . . . . . . . . . . . . . . . . . 43 4.8. Post Filtering . . . . . . . . . . . . . . . . . . . . . 43 5. Security Considerations. . . . . . . . . . . . . . . . . . . . 43 6. Evaluation of the iLBC Implementations . . . . . . . . . . . . 43 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.1. Normative References . . . . . . . . . . . . . . . . . . 43 7.2. Informative References . . . . . . . . . . . . . . . . . 44 8. ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . 44 Appendix A. Reference Implementation . . . . . . . . . . . . . . . 45 A.1. iLBC_test.c. . . . . . . . . . . . . . . . . . . . . . . 46 A.2 iLBC_encode.h. . . . . . . . . . . . . . . . . . . . . . 52 A.3. iLBC_encode.c. . . . . . . . . . . . . . . . . . . . . . 53 A.4. iLBC_decode.h. . . . . . . . . . . . . . . . . . . . . . 63 A.5. iLBC_decode.c. . . . . . . . . . . . . . . . . . . . . . 64 A.6. iLBC_define.h. . . . . . . . . . . . . . . . . . . . . . 76 A.7. constants.h. . . . . . . . . . . . . . . . . . . . . . . 80 A.8. constants.c. . . . . . . . . . . . . . . . . . . . . . . 82 A.9. anaFilter.h. . . . . . . . . . . . . . . . . . . . . . . 96 A.10. anaFilter.c. . . . . . . . . . . . . . . . . . . . . . . 97 A.11. createCB.h . . . . . . . . . . . . . . . . . . . . . . . 98 A.12. createCB.c . . . . . . . . . . . . . . . . . . . . . . . 99 A.13. doCPLC.h . . . . . . . . . . . . . . . . . . . . . . . .104 A.14. doCPLC.c . . . . . . . . . . . . . . . . . . . . . . . .104 A.15. enhancer.h . . . . . . . . . . . . . . . . . . . . . . .109 A.16. enhancer.c . . . . . . . . . . . . . . . . . . . . . . .110 A.17. filter.h . . . . . . . . . . . . . . . . . . . . . . . .123 A.18. filter.c . . . . . . . . . . . . . . . . . . . . . . . .125 A.19. FrameClassify.h. . . . . . . . . . . . . . . . . . . . .128 A.20. FrameClassify.c. . . . . . . . . . . . . . . . . . . . .129 A.21. gainquant.h. . . . . . . . . . . . . . . . . . . . . . .131 A.22. gainquant.c. . . . . . . . . . . . . . . . . . . . . . .131 A.23. getCBvec.h . . . . . . . . . . . . . . . . . . . . . . .134 A.24. getCBvec.c . . . . . . . . . . . . . . . . . . . . . . .134 A.25. helpfun.h. . . . . . . . . . . . . . . . . . . . . . . .138 A.26. helpfun.c. . . . . . . . . . . . . . . . . . . . . . . .140 A.27. hpInput.h. . . . . . . . . . . . . . . . . . . . . . . .146 A.28. hpInput.c. . . . . . . . . . . . . . . . . . . . . . . .146 A.29. hpOutput.h . . . . . . . . . . . . . . . . . . . . . . .148 A.30. hpOutput.c . . . . . . . . . . . . . . . . . . . . . . .148

A.31. iCBConstruct.h . . . . . . . . . . . . . . . . . . . . .149 A.32. iCBConstruct.c . . . . . . . . . . . . . . . . . . . . .150 A.33. iCBSearch.h. . . . . . . . . . . . . . . . . . . . . . .152 A.34. iCBSearch.c. . . . . . . . . . . . . . . . . . . . . . .153 A.35. LPCdecode.h. . . . . . . . . . . . . . . . . . . . . . .163 A.36. LPCdecode.c. . . . . . . . . . . . . . . . . . . . . . .164 A.37. LPCencode.h. . . . . . . . . . . . . . . . . . . . . . .167 A.38. LPCencode.c. . . . . . . . . . . . . . . . . . . . . . .167 A.39. lsf.h. . . . . . . . . . . . . . . . . . . . . . . . . .172 A.40. lsf.c. . . . . . . . . . . . . . . . . . . . . . . . . .172 A.41. packing.h. . . . . . . . . . . . . . . . . . . . . . . .178 A.42. packing.c. . . . . . . . . . . . . . . . . . . . . . . .179 A.43. StateConstructW.h. . . . . . . . . . . . . . . . . . . .182 A.44. StateConstructW.c. . . . . . . . . . . . . . . . . . . .183 A.45. StateSearchW.h . . . . . . . . . . . . . . . . . . . . .185 A.46. StateSearchW.c . . . . . . . . . . . . . . . . . . . . .186 A.47. syntFilter.h . . . . . . . . . . . . . . . . . . . . . .190 A.48. syntFilter.c . . . . . . . . . . . . . . . . . . . . . .190 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . .192 Full Copyright Statement . . . . . . . . . . . . . . . . . . . . .194 1. Introduction This document contains the description of an algorithm for the coding of speech signals sampled at 8 kHz. The algorithm, called iLBC, uses a block-independent linear-predictive coding (LPC) algorithm and has support for two basic frame lengths: 20 ms at 15.2 kbit/s and 30 ms at 13.33 kbit/s. When the codec operates at block lengths of 20 ms, it produces 304 bits per block, which SHOULD be packetized as in [1]. Similarly, for block lengths of 30 ms it produces 400 bits per block, which SHOULD be packetized as in [1]. The two modes for the different frame sizes operate in a very similar way. When they differ it is explicitly stated in the text, usually with the notation x/y, where x refers to the 20 ms mode and y refers to the 30 ms mode. The described algorithm results in a speech coding system with a controlled response to packet losses similar to what is known from pulse code modulation (PCM) with packet loss concealment (PLC), such as the ITU-T G.711 standard [4], which operates at a fixed bit rate of 64 kbit/s. At the same time, the described algorithm enables fixed bit rate coding with a quality-versus-bit rate tradeoff close to state-of-the-art. A suitable RTP payload format for the iLBC codec is specified in [1]. Some of the applications for which this coder is suitable are real time communications such as telephony and videoconferencing, streaming audio, archival, and messaging.

Cable Television Laboratories (CableLabs(R)) has adopted iLBC as a mandatory PacketCable(TM) audio codec standard for VoIP over Cable applications [3]. This document is organized as follows. Section 2 gives a brief outline of the codec. The specific encoder and decoder algorithms are explained in sections 3 and 4, respectively. Appendix A provides a c-code reference implementation. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119 [2]. 2. Outline of the Codec The codec consists of an encoder and a decoder as described in sections 2.1 and 2.2, respectively. The essence of the codec is LPC and block-based coding of the LPC residual signal. For each 160/240 (20 ms/30 ms) sample block, the following major steps are performed: A set of LPC filters are computed, and the speech signal is filtered through them to produce the residual signal. The codec uses scalar quantization of the dominant part, in terms of energy, of the residual signal for the block. The dominant state is of length 57/58 (20 ms/30 ms) samples and forms a start state for dynamic codebooks constructed from the already coded parts of the residual signal. These dynamic codebooks are used to code the remaining parts of the residual signal. By this method, coding independence between blocks is achieved, resulting in elimination of propagation of perceptual degradations due to packet loss. The method facilitates high-quality packet loss concealment (PLC). 2.1. Encoder The input to the encoder SHOULD be 16 bit uniform PCM sampled at 8 kHz. It SHOULD be partitioned into blocks of BLOCKL=160/240 samples for the 20/30 ms frame size. Each block is divided into NSUB=4/6 consecutive sub-blocks of SUBL=40 samples each. For 30 ms frame size, the encoder performs two LPC_FILTERORDER=10 linear-predictive coding (LPC) analyses. The first analysis applies a smooth window centered over the second sub-block and extending to the middle of the fifth sub-block. The second LPC analysis applies a smooth asymmetric window centered over the fifth sub-block and extending to the end of the sixth sub-block. For 20 ms frame size, one LPC_FILTERORDER=10 linear-predictive coding (LPC) analysis is performed with a smooth window centered over the third sub-frame.

For each of the LPC analyses, a set of line-spectral frequencies (LSFs) are obtained, quantized, and interpolated to obtain LSF coefficients for each sub-block. Subsequently, the LPC residual is computed by using the quantized and interpolated LPC analysis filters. The two consecutive sub-blocks of the residual exhibiting the maximal weighted energy are identified. Within these two sub-blocks, the start state (segment) is selected from two choices: the first 57/58 samples or the last 57/58 samples of the two consecutive sub-blocks. The selected segment is the one of higher energy. The start state is encoded with scalar quantization. A dynamic codebook encoding procedure is used to encode 1) the 23/22 (20 ms/30 ms) remaining samples in the two sub-blocks containing the start state; 2) the sub-blocks after the start state in time; and 3) the sub-blocks before the start state in time. Thus, the encoding target can be either the 23/22 samples remaining of the two sub- blocks containing the start state or a 40-sample sub-block. This target can consist of samples indexed forward in time or backward in time, depending on the location of the start state. The codebook coding is based on an adaptive codebook built from a codebook memory that contains decoded LPC excitation samples from the already encoded part of the block. These samples are indexed in the same time direction as the target vector, ending at the sample instant prior to the first sample instant represented in the target vector. The codebook is used in CB_NSTAGES=3 stages in a successive refinement approach, and the resulting three code vector gains are encoded with 5-, 4-, and 3-bit scalar quantization, respectively. The codebook search method employs noise shaping derived from the LPC filters, and the main decision criterion is to minimize the squared error between the target vector and the code vectors. Each code vector in this codebook comes from one of CB_EXPAND=2 codebook sections. The first section is filled with delayed, already encoded residual vectors. The code vectors of the second codebook section are constructed by predefined linear combinations of vectors in the first section of the codebook. As codebook encoding with squared-error matching is known to produce a coded signal of less power than does the scalar quantized start state signal, a gain re-scaling method is implemented by a refined search for a better set of codebook gains in terms of power matching after encoding. This is done by searching for a higher value of the gain factor for the first stage codebook, as the subsequent stage codebook gains are scaled by the first stage gain.

2.2. Decoder Typically for packet communications, a jitter buffer placed at the receiving end decides whether the packet containing an encoded signal block has been received or lost. This logic is not part of the codec described here. For each encoded signal block received the decoder performs a decoding. For each lost signal block, the decoder performs a PLC operation. The decoding for each block starts by decoding and interpolating the LPC coefficients. Subsequently the start state is decoded. For codebook-encoded segments, each segment is decoded by constructing the three code vectors given by the received codebook indices in the same way that the code vectors were constructed in the encoder. The three gain factors are also decoded and the resulting decoded signal is given by the sum of the three codebook vectors scaled with respective gain. An enhancement algorithm is applied to the reconstructed excitation signal. This enhancement augments the periodicity of voiced speech regions. The enhancement is optimized under the constraint that the modification signal (defined as the difference between the enhanced excitation and the excitation signal prior to enhancement) has a short-time energy that does not exceed a preset fraction of the short-time energy of the excitation signal prior to enhancement. A packet loss concealment (PLC) operation is easily embedded in the decoder. The PLC operation can, e.g., be based on repeating LPC filters and obtaining the LPC residual signal by using a long-term prediction estimate from previous residual blocks. 3. Encoder Principles The following block diagram is an overview of all the components of the iLBC encoding procedure. The description of the blocks contains references to the section where that particular procedure is further described.

+-----------+ +---------+ +---------+ speech -> | 1. Pre P | -> | 2. LPC | -> | 3. Ana | -> +-----------+ +---------+ +---------+ +---------------+ +--------------+ -> | 4. Start Sel | ->| 5. Scalar Qu | -> +---------------+ +--------------+ +--------------+ +---------------+ -> |6. CB Search | -> | 7. Packetize | -> payload | +--------------+ | +---------------+ ----<---------<------ sub-frame 0..2/4 (20 ms/30 ms) Figure 3.1. Flow chart of the iLBC encoder 1. Pre-process speech with a HP filter, if needed (section 3.1). 2. Compute LPC parameters, quantize, and interpolate (section 3.2). 3. Use analysis filters on speech to compute residual (section 3.3). 4. Select position of 57/58-sample start state (section 3.5). 5. Quantize the 57/58-sample start state with scalar quantization (section 3.5). 6. Search the codebook for each sub-frame. Start with 23/22 sample block, then encode sub-blocks forward in time, and then encode sub-blocks backward in time. For each block, the steps in Figure 3.4 are performed (section 3.6). 7. Packetize the bits into the payload specified in Table 3.2. The input to the encoder SHOULD be 16-bit uniform PCM sampled at 8 kHz. Also it SHOULD be partitioned into blocks of BLOCKL=160/240 samples. Each block input to the encoder is divided into NSUB=4/6 consecutive sub-blocks of SUBL=40 samples each.

0 39 79 119 159 +---------------------------------------+ | 1 | 2 | 3 | 4 | +---------------------------------------+ 20 ms frame 0 39 79 119 159 199 239 +-----------------------------------------------------------+ | 1 | 2 | 3 | 4 | 5 | 6 | +-----------------------------------------------------------+ 30 ms frame Figure 3.2. One input block to the encoder for 20 ms (with four sub- frames) and 30 ms (with six sub-frames). 3.1. Pre-processing In some applications, the recorded speech signal contains DC level and/or 50/60 Hz noise. If these components have not been removed prior to the encoder call, they should be removed by a high-pass filter. A reference implementation of this, using a filter with a cutoff frequency of 90 Hz, can be found in Appendix A.28. 3.2. LPC Analysis and Quantization The input to the LPC analysis module is a possibly high-pass filtered speech buffer, speech_hp, that contains 240/300 (LPC_LOOKBACK + BLOCKL = 80/60 + 160/240 = 240/300) speech samples, where samples 0 through 79/59 are from the previous block and samples 80/60 through 239/299 are from the current block. No look-ahead into the next block is used. For the very first block processed, the look-back samples are assumed to be zeros. For each input block, the LPC analysis calculates one/two set(s) of LPC_FILTERORDER=10 LPC filter coefficients using the autocorrelation method and the Levinson-Durbin recursion. These coefficients are converted to the Line Spectrum Frequency representation. In the 20 ms case, the single lsf set represents the spectral characteristics as measured at the center of the third sub-block. For 30 ms frames, the first set, lsf1, represents the spectral properties of the input signal at the center of the second sub-block, and the other set, lsf2, represents the spectral characteristics as measured at the center of the fifth sub-block. The details of the computation for 30 ms frames are described in sections 3.2.1 through 3.2.6. Section 3.2.7 explains how the LPC Analysis and Quantization differs for 20 ms frames.

3.2.1. Computation of Autocorrelation Coefficients The first step in the LPC analysis procedure is to calculate autocorrelation coefficients by using windowed speech samples. This windowing is the only difference in the LPC analysis procedure for the two sets of coefficients. For the first set, a 240-sample-long standard symmetric Hanning window is applied to samples 0 through 239 of the input data. The first window, lpc_winTbl, is defined as lpc_winTbl[i]= 0.5 * (1.0 - cos((2*PI*(i+1))/(BLOCKL+1))); i=0,...,119 lpc_winTbl[i] = winTbl[BLOCKL - i - 1]; i=120,...,239 The windowed speech speech_hp_win1 is then obtained by multiplying the first 240 samples of the input speech buffer with the window coefficients: speech_hp_win1[i] = speech_hp[i] * lpc_winTbl[i]; i=0,...,BLOCKL-1 From these 240 windowed speech samples, 11 (LPC_FILTERORDER + 1) autocorrelation coefficients, acf1, are calculated: acf1[lag] += speech_hp_win1[n] * speech_hp_win1[n + lag]; lag=0,...,LPC_FILTERORDER; n=0,...,BLOCKL-lag-1 In order to make the analysis more robust against numerical precision problems, a spectral smoothing procedure is applied by windowing the autocorrelation coefficients before the LPC coefficients are computed. Also, a white noise floor is added to the autocorrelation function by multiplying coefficient zero by 1.0001 (40dB below the energy of the windowed speech signal). These two steps are implemented by multiplying the autocorrelation coefficients with the following window: lpc_lagwinTbl[0] = 1.0001; lpc_lagwinTbl[i] = exp(-0.5 * ((2 * PI * 60.0 * i) /FS)^2); i=1,...,LPC_FILTERORDER where FS=8000 is the sampling frequency Then, the windowed acf function acf1_win is obtained by acf1_win[i] = acf1[i] * lpc_lagwinTbl[i]; i=0,...,LPC_FILTERORDER The second set of autocorrelation coefficients, acf2_win, are obtained in a similar manner. The window, lpc_asymwinTbl, is applied to samples 60 through 299, i.e., the entire current block. The

window consists of two segments, the first (samples 0 to 219) being half a Hanning window with length 440 and the second a quarter of a cycle of a cosine wave. By using this asymmetric window, an LPC analysis centered in the fifth sub-block is obtained without the need for any look-ahead, which would add delay. The asymmetric window is defined as lpc_asymwinTbl[i] = (sin(PI * (i + 1) / 441))^2; i=0,...,219 lpc_asymwinTbl[i] = cos((i - 220) * PI / 40); i=220,...,239 and the windowed speech is computed by speech_hp_win2[i] = speech_hp[i + LPC_LOOKBACK] * lpc_asymwinTbl[i]; i=0,....BLOCKL-1 The windowed autocorrelation coefficients are then obtained in exactly the same way as for the first analysis instance. The generation of the windows lpc_winTbl, lpc_asymwinTbl, and lpc_lagwinTbl are typically done in advance, and the arrays are stored in ROM rather than repeating the calculation for every block. 3.2.2. Computation of LPC Coefficients From the 2 x 11 smoothed autocorrelation coefficients, acf1_win and acf2_win, the 2 x 11 LPC coefficients, lp1 and lp2, are calculated in the same way for both analysis locations by using the well known Levinson-Durbin recursion. The first LPC coefficient is always 1.0, resulting in ten unique coefficients. After determining the LPC coefficients, a bandwidth expansion procedure is applied to smooth the spectral peaks in the short-term spectrum. The bandwidth addition is obtained by the following modification of the LPC coefficients: lp1_bw[i] = lp1[i] * chirp^i; i=0,...,LPC_FILTERORDER lp2_bw[i] = lp2[i] * chirp^i; i=0,...,LPC_FILTERORDER where "chirp" is a real number between 0 and 1. It is RECOMMENDED to use a value of 0.9. 3.2.3. Computation of LSF Coefficients from LPC Coefficients Thus far, two sets of LPC coefficients that represent the short-term spectral characteristics of the speech signal for two different time locations within the current block have been determined. These coefficients SHOULD be quantized and interpolated. Before this is

done, it is advantageous to convert the LPC parameters into another type of representation called Line Spectral Frequencies (LSF). The LSF parameters are used because they are better suited for quantization and interpolation than the regular LPC coefficients. Many computationally efficient methods for calculating the LSFs from the LPC coefficients have been proposed in the literature. The detailed implementation of one applicable method can be found in Appendix A.26. The two arrays of LSF coefficients obtained, lsf1 and lsf2, are of dimension 10 (LPC_FILTERORDER). 3.2.4. Quantization of LSF Coefficients Because the LPC filters defined by the two sets of LSFs are also needed in the decoder, the LSF parameters need to be quantized and transmitted as side information. The total number of bits required to represent the quantization of the two LSF representations for one block of speech is 40, with 20 bits used for each of lsf1 and lsf2. For computational and storage reasons, the LSF vectors are quantized using three-split vector quantization (VQ). That is, the LSF vectors are split into three sub-vectors that are each quantized with a regular VQ. The quantized versions of lsf1 and lsf2, qlsf1 and qlsf2, are obtained by using the same memoryless split VQ. The length of each of these two LSF vectors is 10, and they are split into three sub-vectors containing 3, 3, and 4 values, respectively. For each of the sub-vectors, a separate codebook of quantized values has been designed with a standard VQ training method for a large database containing speech from a large number of speakers recorded under various conditions. The size of each of the three codebooks associated with the split definitions above is int size_lsfCbTbl[LSF_NSPLIT] = {64,128,128}; The actual values of the vector quantization codebook that must be used can be found in the reference code of Appendix A. Both sets of LSF coefficients, lsf1 and lsf2, are quantized with a standard memoryless split vector quantization (VQ) structure using the squared error criterion in the LSF domain. The split VQ quantization consists of the following steps: 1) Quantize the first three LSF coefficients (1 - 3) with a VQ codebook of size 64. 2) Quantize the next three LSF coefficients 4 - 6 with VQ a codebook of size 128. 3) Quantize the last four LSF coefficients (7 - 10) with a VQ codebook of size 128.

This procedure, repeated for lsf1 and lsf2, gives six quantization indices and the quantized sets of LSF coefficients qlsf1 and qlsf2. Each set of three indices is encoded with 6 + 7 + 7 = 20 bits. The total number of bits used for LSF quantization in a block is thus 40 bits. 3.2.5. Stability Check of LSF Coefficients The LSF representation of the LPC filter has the convenient property that the coefficients are ordered by increasing value, i.e., lsf(n-1) < lsf(n), 0 < n < 10, if the corresponding synthesis filter is stable. As we are employing a split VQ scheme, it is possible that at the split boundaries the LSF coefficients are not ordered correctly and hence that the corresponding LP filter is unstable. To ensure that the filter used is stable, a stability check is performed for the quantized LSF vectors. If it turns out that the coefficients are not ordered appropriately (with a safety margin of 50 Hz to ensure that formant peaks are not too narrow), they will be moved apart. The detailed method for this can be found in Appendix A.40. The same procedure is performed in the decoder. This ensures that exactly the same LSF representations are used in both encoder and decoder. 3.2.6. Interpolation of LSF Coefficients From the two sets of LSF coefficients that are computed for each block of speech, different LSFs are obtained for each sub-block by means of interpolation. This procedure is performed for the original LSFs (lsf1 and lsf2), as well as the quantized versions qlsf1 and qlsf2, as both versions are used in the encoder. Here follows a brief summary of the interpolation scheme; the details are found in the c-code of Appendix A. In the first sub-block, the average of the second LSF vector from the previous block and the first LSF vector in the current block is used. For sub-blocks two through five, the LSFs used are obtained by linear interpolation from lsf1 (and qlsf1) to lsf2 (and qlsf2), with lsf1 used in sub-block two and lsf2 in sub- block five. In the last sub-block, lsf2 is used. For the very first block it is assumed that the last LSF vector of the previous block is equal to a predefined vector, lsfmeanTbl, obtained by calculating the mean LSF vector of the LSF design database. lsfmeanTbl[LPC_FILTERORDER] = {0.281738, 0.445801, 0.663330, 0.962524, 1.251831, 1.533081, 1.850586, 2.137817, 2.481445, 2.777344}

The interpolation method is standard linear interpolation in the LSF domain. The interpolated LSF values are converted to LPC coefficients for each sub-block. The unquantized and quantized LPC coefficients form two sets of filters respectively. The unquantized analysis filter for sub-block k is defined as follows ___ \ Ak(z)= 1 + > ak(i)*z^(-i) /__ i=1...LPC_FILTERORDER The quantized analysis filter for sub-block k is defined as follows ___ \ A~k(z)= 1 + > a~k(i)*z^(-i) /__ i=1...LPC_FILTERORDER A reference implementation of the lsf encoding is given in Appendix A.38. A reference implementation of the corresponding decoding can be found in Appendix A.36. 3.2.7. LPC Analysis and Quantization for 20 ms Frames As previously stated, the codec only calculates one set of LPC parameters for the 20 ms frame size as opposed to two sets for 30 ms frames. A single set of autocorrelation coefficients is calculated on the LPC_LOOKBACK + BLOCKL = 80 + 160 = 240 samples. These samples are windowed with the asymmetric window lpc_asymwinTbl, centered over the third sub-frame, to form speech_hp_win. Autocorrelation coefficients, acf, are calculated on the 240 samples in speech_hp_win and then windowed exactly as in section 3.2.1 (resulting in acf_win). This single set of windowed autocorrelation coefficients is used to calculate LPC coefficients, LSF coefficients, and quantized LSF coefficients in exactly the same manner as in sections 3.2.3 through 3.2.4. As for the 30 ms frame size, the ten LSF coefficients are divided into three sub-vectors of size 3, 3, and 4 and quantized by using the same scheme and codebook as in section 3.2.4 to finally get 3 quantization indices. The quantized LSF coefficients are stabilized with the algorithm described in section 3.2.5. From the set of LSF coefficients computed for this block and those from the previous block, different LSFs are obtained for each sub- block by means of interpolation. The interpolation is done linearly in the LSF domain over the four sub-blocks, so that the n-th sub-

frame uses the weight (4-n)/4 for the LSF from old frame and the weight n/4 of the LSF from the current frame. For the very first block the mean LSF, lsfmeanTbl, is used as the LSF from the previous block. Similarly as seen in section 3.2.6, both unquantized, A(z), and quantized, A~(z), analysis filters are calculated for each of the four sub-blocks. 3.3. Calculation of the Residual The block of speech samples is filtered by the quantized and interpolated LPC analysis filters to yield the residual signal. In particular, the corresponding LPC analysis filter for each 40 sample sub-block is used to filter the speech samples for the same sub- block. The filter memory at the end of each sub-block is carried over to the LPC filter of the next sub-block. The signal at the output of each LP analysis filter constitutes the residual signal for the corresponding sub-block. A reference implementation of the LPC analysis filters is given in Appendix A.10. 3.4. Perceptual Weighting Filter In principle any good design of a perceptual weighting filter can be applied in the encoder without compromising this codec definition. However, it is RECOMMENDED to use the perceptual weighting filter Wk for sub-block k specified below: Wk(z)=1/Ak(z/LPC_CHIRP_WEIGHTDENUM), where LPC_CHIRP_WEIGHTDENUM = 0.4222 This is a simple design with low complexity that is applied in the LPC residual domain. Here Ak(z) is the filter obtained for sub-block k from unquantized but interpolated LSF coefficients. 3.5. Start State Encoder The start state is quantized by using a common 6-bit scalar quantizer for the block and a 3-bit scalar quantizer operating on scaled samples in the weighted speech domain. In the following we describe the state encoding in greater detail.

3.5.1. Start State Estimation The two sub-blocks containing the start state are determined by finding the two consecutive sub-blocks in the block having the highest power. Advantageously, down-weighting is used in the beginning and end of the sub-frames, i.e., the following measure is computed (NSUB=4/6 for 20/30 ms frame size): nsub=1,...,NSUB-1 ssqn[nsub] = 0.0; for (i=(nsub-1)*SUBL; i<(nsub-1)*SUBL+5; i++) ssqn[nsub] += sampEn_win[i-(nsub-1)*SUBL]* residual[i]*residual[i]; for (i=(nsub-1)*SUBL+5; i<(nsub+1)*SUBL-5; i++) ssqn[nsub] += residual[i]*residual[i]; for (i=(nsub+1)*SUBL-5; i<(nsub+1)*SUBL; i++) ssqn[nsub] += sampEn_win[(nsub+1)*SUBL-i-1]* residual[i]*residual[i]; where sampEn_win[5]={1/6, 2/6, 3/6, 4/6, 5/6}; MAY be used. The sub-frame number corresponding to the maximum value of ssqEn_win[nsub-1]*ssqn[nsub] is selected as the start state indicator. A weighting of ssqEn_win[]={0.8,0.9,1.0,0.9,0.8} for 30 ms frames and ssqEn_win[]={0.9,1.0,0.9} for 20 ms frames; MAY advantageously be used to bias the start state towards the middle of the frame. For 20 ms frames there are three possible positions for the two-sub- block length maximum power segment; the start state position is encoded with 2 bits. The start state position, start, MUST be encoded as start=1: start state in sub-frame 0 and 1 start=2: start state in sub-frame 1 and 2 start=3: start state in sub-frame 2 and 3 For 30 ms frames there are five possible positions of the two-sub- block length maximum power segment, the start state position is encoded with 3 bits. The start state position, start, MUST be encoded as start=1: start state in sub-frame 0 and 1 start=2: start state in sub-frame 1 and 2 start=3: start state in sub-frame 2 and 3 start=4: start state in sub-frame 3 and 4 start=5: start state in sub-frame 4 and 5

Hence, in both cases, index 0 is not used. In order to shorten the start state for bit rate efficiency, the start state is brought down to STATE_SHORT_LEN=57 samples for 20 ms frames and STATE_SHORT_LEN=58 samples for 30 ms frames. The power of the first 23/22 and last 23/22 samples of the two sub-frame blocks identified above is computed as the sum of the squared signal sample values, and the 23/22-sample segment with the lowest power is excluded from the start state. One bit is transmitted to indicate which of the two possible 57/58 sample segments is used. The start state position within the two sub-frames determined above, state_first, MUST be encoded as state_first=1: start state is first STATE_SHORT_LEN samples state_first=0: start state is last STATE_SHORT_LEN samples 3.5.2. All-Pass Filtering and Scale Quantization The block of residual samples in the start state is first filtered by an all-pass filter with the quantized LPC coefficients as denominator and reversed quantized LPC coefficients as numerator. The purpose of this phase-dispersion filter is to get a more even distribution of the sample values in the residual signal. The filtering is performed by circular convolution, where the initial filter memory is set to zero. res(0..(STATE_SHORT_LEN-1)) = uncoded start state residual res((STATE_SHORT_LEN)..(2*STATE_SHORT_LEN-1)) = 0 Pk(z) = A~rk(z)/A~k(z), where ___ \ A~rk(z)= z^(-LPC_FILTERORDER)+>a~k(i+1)*z^(i-(LPC_FILTERORDER-1)) /__ i=0...(LPC_FILTERORDER-1) and A~k(z) is taken from the block where the start state begins res -> Pk(z) -> filtered ccres(k) = filtered(k) + filtered(k+STATE_SHORT_LEN), k=0..(STATE_SHORT_LEN-1) The all-pass filtered block is searched for its largest magnitude sample. The 10-logarithm of this magnitude is quantized with a 6-bit quantizer, state_frgqTbl, by finding the nearest representation.

This results in an index, idxForMax, corresponding to a quantized value, qmax. The all-pass filtered residual samples in the block are then multiplied with a scaling factor scal=4.5/(10^qmax) to yield normalized samples. state_frgqTbl[64] = {1.000085, 1.071695, 1.140395, 1.206868, 1.277188, 1.351503, 1.429380, 1.500727, 1.569049, 1.639599, 1.707071, 1.781531, 1.840799, 1.901550, 1.956695, 2.006750, 2.055474, 2.102787, 2.142819, 2.183592, 2.217962, 2.257177, 2.295739, 2.332967, 2.369248, 2.402792, 2.435080, 2.468598, 2.503394, 2.539284, 2.572944, 2.605036, 2.636331, 2.668939, 2.698780, 2.729101, 2.759786, 2.789834, 2.818679, 2.848074, 2.877470, 2.906899, 2.936655, 2.967804, 3.000115, 3.033367, 3.066355, 3.104231, 3.141499, 3.183012, 3.222952, 3.265433, 3.308441, 3.350823, 3.395275, 3.442793, 3.490801, 3.542514, 3.604064, 3.666050, 3.740994, 3.830749, 3.938770, 4.101764} 3.5.3. Scalar Quantization The normalized samples are quantized in the perceptually weighted speech domain by a sample-by-sample scalar DPCM quantization as depicted in Figure 3.3. Each sample in the block is filtered by a weighting filter Wk(z), specified in section 3.4, to form a weighted speech sample x[n]. The target sample d[n] is formed by subtracting a predicted sample y[n], where the prediction filter is given by Pk(z) = 1 - 1 / Wk(z). +-------+ x[n] + d[n] +-----------+ u[n] residual -->| Wk(z) |-------->(+)---->| Quantizer |------> quantized +-------+ - /|\ +-----------+ | residual | \|/ y[n] +--------------------->(+) | | | +------+ | +--------| Pk(z)|<------+ +------+ Figure 3.3. Quantization of start state samples by DPCM in weighted speech domain. The coded state sample u[n] is obtained by quantizing d[n] with a 3- bit quantizer with quantization table state_sq3Tbl. state_sq3Tbl[8] = {-3.719849, -2.177490, -1.130005, -0.309692, 0.444214, 1.329712, 2.436279, 3.983887}

The quantized samples are transformed back to the residual domain by 1) scaling with 1/scal; 2) time-reversing the scaled samples; 3) filtering the time-reversed samples by the same all-pass filter, as in section 3.5.2, by using circular convolution; and 4) time- reversing the filtered samples. (More detail is in section 4.2.) A reference implementation of the start-state encoding can be found in Appendix A.46.