RFC 3951

Internet Low Bit Rate Codec (iLBC)

Pages: 194
Experimental

Part 1 of 6 – Pages 1 to 19

RFC3951 - Page 1

Network Working Group                                        S. Andersen
Request for Comments: 3951                            Aalborg University
Category: Experimental                                          A. Duric
                                                                   Telio
                                                               H. Astrom
                                                                R. Hagen
                                                               W. Kleijn
                                                               J. Linden
                                                         Global IP Sound
                                                           December 2004


                   Internet Low Bit Rate Codec (iLBC)

Status of this Memo

   This memo defines an Experimental Protocol for the Internet
   community.  It does not specify an Internet standard of any kind.
   Discussion and suggestions for improvement are requested.
   Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2004).

Abstract

   This document specifies a speech codec suitable for robust voice
   communication over IP.  The codec is developed by Global IP Sound
   (GIPS).  It is designed for narrow band speech and results in a
   payload bit rate of 13.33 kbit/s for 30 ms frames and 15.20 kbit/s
   for 20 ms frames.  The codec enables graceful speech quality
   degradation in the case of lost frames, which occurs in connection
   with lost or delayed IP packets.

RFC3951 - Page 2

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Outline of the Codec . . . . . . . . . . . . . . . . . . . . .  5
       2.1.  Encoder. . . . . . . . . . . . . . . . . . . . . . . . .  5
       2.2.  Decoder. . . . . . . . . . . . . . . . . . . . . . . . .  7
   3.  Encoder Principles . . . . . . . . . . . . . . . . . . . . . .  7
       3.1.  Pre-processing . . . . . . . . . . . . . . . . . . . . .  9
       3.2.  LPC Analysis and Quantization. . . . . . . . . . . . . .  9
             3.2.1.  Computation of Autocorrelation Coefficients. . . 10
             3.2.2.  Computation of LPC Coefficients. . . . . . . . . 11
             3.2.3.  Computation of LSF Coefficients from LPC
                     Coefficients . . . . . . . . . . . . . . . . . . 11
             3.2.4.  Quantization of LSF Coefficients . . . . . . . . 12
             3.2.5.  Stability Check of LSF Coefficients. . . . . . . 13
             3.2.6.  Interpolation of LSF Coefficients. . . . . . . . 13
             3.2.7.  LPC Analysis and Quantization for 20 ms Frames . 14
       3.3.  Calculation of the Residual. . . . . . . . . . . . . . . 15
       3.4.  Perceptual Weighting Filter. . . . . . . . . . . . . . . 15
       3.5.  Start State Encoder. . . . . . . . . . . . . . . . . . . 15
             3.5.1.  Start State Estimation . . . . . . . . . . . . . 16
             3.5.2.  All-Pass Filtering and Scale Quantization. . . . 17
             3.5.3.  Scalar Quantization. . . . . . . . . . . . . . . 18
       3.6.  Encoding the Remaining Samples . . . . . . . . . . . . . 19
             3.6.1.  Codebook Memory. . . . . . . . . . . . . . . . . 20
             3.6.2.  Perceptual Weighting of Codebook Memory
                     and Target . . . . . . . . . . . . . . . . . . . 22
             3.6.3.  Codebook Creation. . . . . . . . . . . . . . . . 23
                     3.6.3.1. Creation of a Base Codebook . . . . . . 23
                     3.6.3.2. Codebook Expansion. . . . . . . . . . . 24
                     3.6.3.3. Codebook Augmentation . . . . . . . . . 24
             3.6.4.  Codebook Search. . . . . . . . . . . . . . . . . 26
                     3.6.4.1. Codebook Search at Each Stage . . . . . 26
                     3.6.4.2. Gain Quantization at Each Stage . . . . 27
                     3.6.4.3. Preparation of Target for Next Stage. . 28
       3.7.  Gain Correction Encoding . . . . . . . . . . . . . . . . 28
       3.8.  Bitstream Definition . . . . . . . . . . . . . . . . . . 29
   4.  Decoder Principles . . . . . . . . . . . . . . . . . . . . . . 32
       4.1.  LPC Filter Reconstruction. . . . . . . . . . . . . . . . 33
       4.2.  Start State Reconstruction . . . . . . . . . . . . . . . 33
       4.3.  Excitation Decoding Loop . . . . . . . . . . . . . . . . 34
       4.4.  Multistage Adaptive Codebook Decoding. . . . . . . . . . 35
             4.4.1.  Construction of the Decoded Excitation Signal. . 35
       4.5.  Packet Loss Concealment. . . . . . . . . . . . . . . . . 35
             4.5.1.  Block Received Correctly and Previous Block
                     Also Received. . . . . . . . . . . . . . . . . . 35
             4.5.2.  Block Not Received . . . . . . . . . . . . . . . 36

RFC3951 - Page 3

             4.5.3.  Block Received Correctly When Previous Block
                     Not Received . . . . . . . . . . . . . . . . . . 36
       4.6.  Enhancement. . . . . . . . . . . . . . . . . . . . . . . 37
             4.6.1.  Estimating the Pitch . . . . . . . . . . . . . . 39
             4.6.2.  Determination of the Pitch-Synchronous
                     Sequences. . . . . . . . . . . . . . . . . . . . 39
             4.6.3.  Calculation of the Smoothed Excitation . . . . . 41
             4.6.4.  Enhancer Criterion . . . . . . . . . . . . . . . 41
             4.6.5.  Enhancing the Excitation . . . . . . . . . . . . 42
       4.7.  Synthesis Filtering. . . . . . . . . . . . . . . . . . . 43
       4.8.  Post Filtering . . . . . . . . . . . . . . . . . . . . . 43
   5.  Security Considerations. . . . . . . . . . . . . . . . . . . . 43
   6.  Evaluation of the iLBC Implementations . . . . . . . . . . . . 43
   7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 43
       7.1.  Normative References . . . . . . . . . . . . . . . . . . 43
       7.2.  Informative References . . . . . . . . . . . . . . . . . 44
   8.  ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . 44
   Appendix A. Reference Implementation . . . . . . . . . . . . . . . 45
       A.1.  iLBC_test.c. . . . . . . . . . . . . . . . . . . . . . . 46
       A.2   iLBC_encode.h. . . . . . . . . . . . . . . . . . . . . . 52
       A.3.  iLBC_encode.c. . . . . . . . . . . . . . . . . . . . . . 53
       A.4.  iLBC_decode.h. . . . . . . . . . . . . . . . . . . . . . 63
       A.5.  iLBC_decode.c. . . . . . . . . . . . . . . . . . . . . . 64
       A.6.  iLBC_define.h. . . . . . . . . . . . . . . . . . . . . . 76
       A.7.  constants.h. . . . . . . . . . . . . . . . . . . . . . . 80
       A.8.  constants.c. . . . . . . . . . . . . . . . . . . . . . . 82
       A.9.  anaFilter.h. . . . . . . . . . . . . . . . . . . . . . . 96
       A.10. anaFilter.c. . . . . . . . . . . . . . . . . . . . . . . 97
       A.11. createCB.h . . . . . . . . . . . . . . . . . . . . . . . 98
       A.12. createCB.c . . . . . . . . . . . . . . . . . . . . . . . 99
       A.13. doCPLC.h . . . . . . . . . . . . . . . . . . . . . . . .104
       A.14. doCPLC.c . . . . . . . . . . . . . . . . . . . . . . . .104
       A.15. enhancer.h . . . . . . . . . . . . . . . . . . . . . . .109
       A.16. enhancer.c . . . . . . . . . . . . . . . . . . . . . . .110
       A.17. filter.h . . . . . . . . . . . . . . . . . . . . . . . .123
       A.18. filter.c . . . . . . . . . . . . . . . . . . . . . . . .125
       A.19. FrameClassify.h. . . . . . . . . . . . . . . . . . . . .128
       A.20. FrameClassify.c. . . . . . . . . . . . . . . . . . . . .129
       A.21. gainquant.h. . . . . . . . . . . . . . . . . . . . . . .131
       A.22. gainquant.c. . . . . . . . . . . . . . . . . . . . . . .131
       A.23. getCBvec.h . . . . . . . . . . . . . . . . . . . . . . .134
       A.24. getCBvec.c . . . . . . . . . . . . . . . . . . . . . . .134
       A.25. helpfun.h. . . . . . . . . . . . . . . . . . . . . . . .138
       A.26. helpfun.c. . . . . . . . . . . . . . . . . . . . . . . .140
       A.27. hpInput.h. . . . . . . . . . . . . . . . . . . . . . . .146
       A.28. hpInput.c. . . . . . . . . . . . . . . . . . . . . . . .146
       A.29. hpOutput.h . . . . . . . . . . . . . . . . . . . . . . .148
       A.30. hpOutput.c . . . . . . . . . . . . . . . . . . . . . . .148

RFC3951 - Page 4

       A.31. iCBConstruct.h . . . . . . . . . . . . . . . . . . . . .149
       A.32. iCBConstruct.c . . . . . . . . . . . . . . . . . . . . .150
       A.33. iCBSearch.h. . . . . . . . . . . . . . . . . . . . . . .152
       A.34. iCBSearch.c. . . . . . . . . . . . . . . . . . . . . . .153
       A.35. LPCdecode.h. . . . . . . . . . . . . . . . . . . . . . .163
       A.36. LPCdecode.c. . . . . . . . . . . . . . . . . . . . . . .164
       A.37. LPCencode.h. . . . . . . . . . . . . . . . . . . . . . .167
       A.38. LPCencode.c. . . . . . . . . . . . . . . . . . . . . . .167
       A.39. lsf.h. . . . . . . . . . . . . . . . . . . . . . . . . .172
       A.40. lsf.c. . . . . . . . . . . . . . . . . . . . . . . . . .172
       A.41. packing.h. . . . . . . . . . . . . . . . . . . . . . . .178
       A.42. packing.c. . . . . . . . . . . . . . . . . . . . . . . .179
       A.43. StateConstructW.h. . . . . . . . . . . . . . . . . . . .182
       A.44. StateConstructW.c. . . . . . . . . . . . . . . . . . . .183
       A.45. StateSearchW.h . . . . . . . . . . . . . . . . . . . . .185
       A.46. StateSearchW.c . . . . . . . . . . . . . . . . . . . . .186
       A.47. syntFilter.h . . . . . . . . . . . . . . . . . . . . . .190
       A.48. syntFilter.c . . . . . . . . . . . . . . . . . . . . . .190
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . .192
   Full Copyright Statement . . . . . . . . . . . . . . . . . . . . .194

1.  Introduction

   This document contains the description of an algorithm for the coding
   of speech signals sampled at 8 kHz.  The algorithm, called iLBC, uses
   a block-independent linear-predictive coding (LPC) algorithm and has
   support for two basic frame lengths: 20 ms at 15.2 kbit/s and 30 ms
   at 13.33 kbit/s.  When the codec operates at block lengths of 20 ms,
   it produces 304 bits per block, which SHOULD be packetized as in [1].
   Similarly, for block lengths of 30 ms it produces 400 bits per block,
   which SHOULD be packetized as in [1].  The two modes for the
   different frame sizes operate in a very similar way.  When they
   differ it is explicitly stated in the text, usually with the notation
   x/y, where x refers to the 20 ms mode and y refers to the 30 ms mode.

   The described algorithm results in a speech coding system with a
   controlled response to packet losses similar to what is known from
   pulse code modulation (PCM) with packet loss concealment (PLC), such
   as the ITU-T G.711 standard [4], which operates at a fixed bit rate
   of 64 kbit/s.  At the same time, the described algorithm enables
   fixed bit rate coding with a quality-versus-bit rate tradeoff close
   to state-of-the-art.  A suitable RTP payload format for the iLBC
   codec is specified in [1].

   Some of the applications for which this coder is suitable are real
   time communications such as telephony and videoconferencing,
   streaming audio, archival, and messaging.

RFC3951 - Page 5

   Cable Television Laboratories (CableLabs(R)) has adopted iLBC as a
   mandatory PacketCable(TM) audio codec standard for VoIP over Cable
   applications [3].

   This document is organized as follows.  Section 2 gives a brief
   outline of the codec.  The specific encoder and decoder algorithms
   are explained in sections 3 and 4, respectively.  Appendix A provides
   a c-code reference implementation.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in BCP 14, RFC 2119 [2].

2.  Outline of the Codec

   The codec consists of an encoder and a decoder as described in
   sections 2.1 and 2.2, respectively.

   The essence of the codec is LPC and block-based coding of the LPC
   residual signal.  For each 160/240 (20 ms/30 ms) sample block, the
   following major steps are performed: A set of LPC filters are
   computed, and the speech signal is filtered through them to produce
   the residual signal.  The codec uses scalar quantization of the
   dominant part, in terms of energy, of the residual signal for the
   block.  The dominant state is of length 57/58 (20 ms/30 ms) samples
   and forms a start state for dynamic codebooks constructed from the
   already coded parts of the residual signal.  These dynamic codebooks
   are used to code the remaining parts of the residual signal.  By this
   method, coding independence between blocks is achieved, resulting in
   elimination of propagation of perceptual degradations due to packet
   loss.  The method facilitates high-quality packet loss concealment
   (PLC).

2.1.  Encoder

   The input to the encoder SHOULD be 16 bit uniform PCM sampled at 8
   kHz.  It SHOULD be partitioned into blocks of BLOCKL=160/240 samples
   for the 20/30 ms frame size.  Each block is divided into NSUB=4/6
   consecutive sub-blocks of SUBL=40 samples each.  For 30 ms frame
   size, the encoder performs two LPC_FILTERORDER=10 linear-predictive
   coding (LPC) analyses.  The first analysis applies a smooth window
   centered over the second sub-block and extending to the middle of the
   fifth sub-block.  The second LPC analysis applies a smooth asymmetric
   window centered over the fifth sub-block and extending to the end of
   the sixth sub-block.  For 20 ms frame size, one LPC_FILTERORDER=10
   linear-predictive coding (LPC) analysis is performed with a smooth
   window centered over the third sub-frame.

RFC3951 - Page 6

   For each of the LPC analyses, a set of line-spectral frequencies
   (LSFs) are obtained, quantized, and interpolated to obtain LSF
   coefficients for each sub-block.  Subsequently, the LPC residual is
   computed by using the quantized and interpolated LPC analysis
   filters.

   The two consecutive sub-blocks of the residual exhibiting the maximal
   weighted energy are identified.  Within these two sub-blocks, the
   start state (segment) is selected from two choices: the first 57/58
   samples or the last 57/58 samples of the two consecutive sub-blocks.
   The selected segment is the one of higher energy.  The start state is
   encoded with scalar quantization.

   A dynamic codebook encoding procedure is used to encode 1) the 23/22
   (20 ms/30 ms) remaining samples in the two sub-blocks containing the
   start state; 2) the sub-blocks after the start state in time; and 3)
   the sub-blocks before the start state in time.  Thus, the encoding
   target can be either the 23/22 samples remaining of the two sub-
   blocks containing the start state or a 40-sample sub-block.  This
   target can consist of samples indexed forward in time or backward in
   time, depending on the location of the start state.

   The codebook coding is based on an adaptive codebook built from a
   codebook memory that contains decoded LPC excitation samples from the
   already encoded part of the block.  These samples are indexed in the
   same time direction as the target vector, ending at the sample
   instant prior to the first sample instant represented in the target
   vector.  The codebook is used in CB_NSTAGES=3 stages in a successive
   refinement approach, and the resulting three code vector gains are
   encoded with 5-, 4-, and 3-bit scalar quantization, respectively.

   The codebook search method employs noise shaping derived from the LPC
   filters, and the main decision criterion is to minimize the squared
   error between the target vector and the code vectors.  Each code
   vector in this codebook comes from one of CB_EXPAND=2 codebook
   sections.  The first section is filled with delayed, already encoded
   residual vectors.  The code vectors of the second codebook section
   are constructed by predefined linear combinations of vectors in the
   first section of the codebook.

   As codebook encoding with squared-error matching is known to produce
   a coded signal of less power than does the scalar quantized start
   state signal, a gain re-scaling method is implemented by a refined
   search for a better set of codebook gains in terms of power matching
   after encoding.  This is done by searching for a higher value of the
   gain factor for the first stage codebook, as the subsequent stage
   codebook gains are scaled by the first stage gain.

RFC3951 - Page 7

2.2.  Decoder

   Typically for packet communications, a jitter buffer placed at the
   receiving end decides whether the packet containing an encoded signal
   block has been received or lost.  This logic is not part of the codec
   described here.  For each encoded signal block received the decoder
   performs a decoding.  For each lost signal block, the decoder
   performs a PLC operation.

   The decoding for each block starts by decoding and interpolating the
   LPC coefficients.  Subsequently the start state is decoded.

   For codebook-encoded segments, each segment is decoded by
   constructing the three code vectors given by the received codebook
   indices in the same way that the code vectors were constructed in the
   encoder.  The three gain factors are also decoded and the resulting
   decoded signal is given by the sum of the three codebook vectors
   scaled with respective gain.

   An enhancement algorithm is applied to the reconstructed excitation
   signal.  This enhancement augments the periodicity of voiced speech
   regions.  The enhancement is optimized under the constraint that the
   modification signal (defined as the difference between the enhanced
   excitation and the excitation signal prior to enhancement) has a
   short-time energy that does not exceed a preset fraction of the
   short-time energy of the excitation signal prior to enhancement.

   A packet loss concealment (PLC) operation is easily embedded in the
   decoder.  The PLC operation can, e.g., be based on repeating LPC
   filters and obtaining the LPC residual signal by using a long-term
   prediction estimate from previous residual blocks.

3.  Encoder Principles

   The following block diagram is an overview of all the components of
   the iLBC encoding procedure.  The description of the blocks contains
   references to the section where that particular procedure is further
   described.

RFC3951 - Page 8

             +-----------+    +---------+    +---------+
   speech -> | 1. Pre P  | -> | 2. LPC  | -> | 3. Ana  | ->
             +-----------+    +---------+    +---------+

             +---------------+   +--------------+
          -> | 4. Start Sel  | ->| 5. Scalar Qu | ->
             +---------------+   +--------------+

             +--------------+    +---------------+
          -> |6. CB Search  | -> | 7. Packetize  | -> payload
          |  +--------------+ |  +---------------+
          ----<---------<------
       sub-frame 0..2/4 (20 ms/30 ms)

   Figure 3.1. Flow chart of the iLBC encoder

   1. Pre-process speech with a HP filter, if needed (section 3.1).

   2. Compute LPC parameters, quantize, and interpolate (section 3.2).

   3. Use analysis filters on speech to compute residual (section 3.3).

   4. Select position of 57/58-sample start state (section 3.5).

   5. Quantize the 57/58-sample start state with scalar quantization
      (section 3.5).

   6. Search the codebook for each sub-frame.  Start with 23/22 sample
      block, then encode sub-blocks forward in time, and then encode
      sub-blocks backward in time.  For each block, the steps in Figure
      3.4 are performed (section 3.6).

   7. Packetize the bits into the payload specified in Table 3.2.

   The input to the encoder SHOULD be 16-bit uniform PCM sampled at 8
   kHz.  Also it SHOULD be partitioned into blocks of BLOCKL=160/240
   samples.  Each block input to the encoder is divided into NSUB=4/6
   consecutive sub-blocks of SUBL=40 samples each.

RFC3951 - Page 9

             0        39        79       119       159
             +---------------------------------------+
             |    1    |    2    |    3    |    4    |
             +---------------------------------------+
                            20 ms frame

   0        39        79       119       159       199       239
   +-----------------------------------------------------------+
   |    1    |    2    |    3    |    4    |    5    |    6    |
   +-----------------------------------------------------------+
                                  30 ms frame
   Figure 3.2. One input block to the encoder for 20 ms (with four sub-
   frames) and 30 ms (with six sub-frames).

3.1.  Pre-processing

   In some applications, the recorded speech signal contains DC level
   and/or 50/60 Hz noise.  If these components have not been removed
   prior to the encoder call, they should be removed by a high-pass
   filter.  A reference implementation of this, using a filter with a
   cutoff frequency of 90 Hz, can be found in Appendix A.28.

3.2.  LPC Analysis and Quantization

   The input to the LPC analysis module is a possibly high-pass filtered
   speech buffer, speech_hp, that contains 240/300 (LPC_LOOKBACK +
   BLOCKL = 80/60 + 160/240 = 240/300) speech samples, where samples 0
   through 79/59 are from the previous block and samples 80/60 through
   239/299 are from the current block.  No look-ahead into the next
   block is used.  For the very first block processed, the look-back
   samples are assumed to be zeros.

   For each input block, the LPC analysis calculates one/two set(s) of
   LPC_FILTERORDER=10 LPC filter coefficients using the autocorrelation
   method and the Levinson-Durbin recursion.  These coefficients are
   converted to the Line Spectrum Frequency representation.  In the 20
   ms case, the single lsf set represents the spectral characteristics
   as measured at the center of the third sub-block.  For 30 ms frames,
   the first set, lsf1, represents the spectral properties of the input
   signal at the center of the second sub-block, and the other set,
   lsf2, represents the spectral characteristics as measured at the
   center of the fifth sub-block.  The details of the computation for 30
   ms frames are described in sections 3.2.1 through 3.2.6.  Section
   3.2.7 explains how the LPC Analysis and Quantization differs for 20
   ms frames.

RFC3951 - Page 10

3.2.1.  Computation of Autocorrelation Coefficients

   The first step in the LPC analysis procedure is to calculate
   autocorrelation coefficients by using windowed speech samples.  This
   windowing is the only difference in the LPC analysis procedure for
   the two sets of coefficients.  For the first set, a 240-sample-long
   standard symmetric Hanning window is applied to samples 0 through 239
   of the input data.  The first window, lpc_winTbl, is defined as

      lpc_winTbl[i]= 0.5 * (1.0 - cos((2*PI*(i+1))/(BLOCKL+1)));
               i=0,...,119
      lpc_winTbl[i] = winTbl[BLOCKL - i - 1]; i=120,...,239

   The windowed speech speech_hp_win1 is then obtained by multiplying
   the first 240 samples of the input speech buffer with the window
   coefficients:

      speech_hp_win1[i] = speech_hp[i] * lpc_winTbl[i];
               i=0,...,BLOCKL-1

   From these 240 windowed speech samples, 11 (LPC_FILTERORDER + 1)
   autocorrelation coefficients, acf1, are calculated:

      acf1[lag] += speech_hp_win1[n] * speech_hp_win1[n + lag];
               lag=0,...,LPC_FILTERORDER; n=0,...,BLOCKL-lag-1

   In order to make the analysis more robust against numerical precision
   problems, a spectral smoothing procedure is applied by windowing the
   autocorrelation coefficients before the LPC coefficients are
   computed.  Also, a white noise floor is added to the autocorrelation
   function by multiplying coefficient zero by 1.0001 (40dB below the
   energy of the windowed speech signal).  These two steps are
   implemented by multiplying the autocorrelation coefficients with the
   following window:

      lpc_lagwinTbl[0] = 1.0001;
      lpc_lagwinTbl[i] = exp(-0.5 * ((2 * PI * 60.0 * i) /FS)^2);
               i=1,...,LPC_FILTERORDER
               where FS=8000 is the sampling frequency

   Then, the windowed acf function acf1_win is obtained by

      acf1_win[i] = acf1[i] * lpc_lagwinTbl[i];
               i=0,...,LPC_FILTERORDER

   The second set of autocorrelation coefficients, acf2_win, are
   obtained in a similar manner.  The window, lpc_asymwinTbl, is applied
   to samples 60 through 299, i.e., the entire current block.  The

RFC3951 - Page 11

   window consists of two segments, the first (samples 0 to 219) being
   half a Hanning window with length 440 and the second a quarter of a
   cycle of a cosine wave.  By using this asymmetric window, an LPC
   analysis centered in the fifth sub-block is obtained without the need
   for any look-ahead, which would add delay.  The asymmetric window is
   defined as

      lpc_asymwinTbl[i] = (sin(PI * (i + 1) / 441))^2; i=0,...,219

      lpc_asymwinTbl[i] = cos((i - 220) * PI / 40); i=220,...,239

   and the windowed speech is computed by

      speech_hp_win2[i] = speech_hp[i + LPC_LOOKBACK] *
               lpc_asymwinTbl[i];  i=0,....BLOCKL-1

   The windowed autocorrelation coefficients are then obtained in
   exactly the same way as for the first analysis instance.

   The generation of the windows lpc_winTbl, lpc_asymwinTbl, and
   lpc_lagwinTbl are typically done in advance, and the arrays are
   stored in ROM rather than repeating the calculation for every block.

3.2.2.  Computation of LPC Coefficients

   From the 2 x 11 smoothed autocorrelation coefficients, acf1_win and
   acf2_win, the 2 x 11 LPC coefficients, lp1 and lp2, are calculated
   in the same way for both analysis locations by using the well known
   Levinson-Durbin recursion.  The first LPC coefficient is always 1.0,
   resulting in ten unique coefficients.

   After determining the LPC coefficients, a bandwidth expansion
   procedure is applied to smooth the spectral peaks in the
   short-term spectrum.  The bandwidth addition is obtained by the
   following modification of the LPC coefficients:

      lp1_bw[i] = lp1[i] * chirp^i; i=0,...,LPC_FILTERORDER
      lp2_bw[i] = lp2[i] * chirp^i; i=0,...,LPC_FILTERORDER

   where "chirp" is a real number between 0 and 1.  It is RECOMMENDED to
   use a value of 0.9.

3.2.3.  Computation of LSF Coefficients from LPC Coefficients

   Thus far, two sets of LPC coefficients that represent the short-term
   spectral characteristics of the speech signal for two different time
   locations within the current block have been determined.  These
   coefficients SHOULD be quantized and interpolated.  Before this is

RFC3951 - Page 12

   done, it is advantageous to convert the LPC parameters into another
   type of representation called Line Spectral Frequencies (LSF).  The
   LSF parameters are used because they are better suited for
   quantization and interpolation than the regular LPC coefficients.
   Many computationally efficient methods for calculating the LSFs from
   the LPC coefficients have been proposed in the literature.  The
   detailed implementation of one applicable method can be found in
   Appendix A.26.  The two arrays of LSF coefficients obtained, lsf1 and
   lsf2, are of dimension 10 (LPC_FILTERORDER).

3.2.4.  Quantization of LSF Coefficients

   Because the LPC filters defined by the two sets of LSFs are also
   needed in the decoder, the LSF parameters need to be quantized and
   transmitted as side information.  The total number of bits required
   to represent the quantization of the two LSF representations for one
   block of speech is 40, with 20 bits used for each of lsf1 and lsf2.

   For computational and storage reasons, the LSF vectors are quantized
   using three-split vector quantization (VQ).  That is, the LSF vectors
   are split into three sub-vectors that are each quantized with a
   regular VQ.  The quantized versions of lsf1 and lsf2, qlsf1 and
   qlsf2, are obtained by using the same memoryless split VQ.  The
   length of each of these two LSF vectors is 10, and they are split
   into three sub-vectors containing 3, 3, and 4 values, respectively.

   For each of the sub-vectors, a separate codebook of quantized values
   has been designed with a standard VQ training method for a large
   database containing speech from a large number of speakers recorded
   under various conditions.  The size of each of the three codebooks
   associated with the split definitions above is

      int size_lsfCbTbl[LSF_NSPLIT] = {64,128,128};

   The actual values of the vector quantization codebook that must be
   used can be found in the reference code of Appendix A.  Both sets of
   LSF coefficients, lsf1 and lsf2, are quantized with a standard
   memoryless split vector quantization (VQ) structure using the squared
   error criterion in the LSF domain.  The split VQ quantization
   consists of the following steps:

   1) Quantize the first three LSF coefficients (1 - 3) with a VQ
      codebook of size 64.
   2) Quantize the next three LSF coefficients 4 - 6 with VQ a codebook
      of size 128.
   3) Quantize the last four LSF coefficients (7 - 10) with a VQ
      codebook of size 128.

RFC3951 - Page 13

   This procedure, repeated for lsf1 and lsf2, gives six quantization
   indices and the quantized sets of LSF coefficients qlsf1 and qlsf2.
   Each set of three indices is encoded with 6 + 7 + 7 = 20 bits.  The
   total number of bits used for LSF quantization in a block is thus 40
   bits.

3.2.5.  Stability Check of LSF Coefficients

   The LSF representation of the LPC filter has the convenient property
   that the coefficients are ordered by increasing value, i.e., lsf(n-1)
   < lsf(n), 0 < n < 10, if the corresponding synthesis filter is
   stable.  As we are employing a split VQ scheme, it is possible that
   at the split boundaries the LSF coefficients are not ordered
   correctly and hence that the corresponding LP filter is unstable.  To
   ensure that the filter used is stable, a stability check is performed
   for the quantized LSF vectors.  If it turns out that the coefficients
   are not ordered appropriately (with a safety margin of 50 Hz to
   ensure that formant peaks are not too narrow), they will be moved
   apart.  The detailed method for this can be found in Appendix A.40.
   The same procedure is performed in the decoder.  This ensures that
   exactly the same LSF representations are used in both encoder and
   decoder.

3.2.6.  Interpolation of LSF Coefficients

   From the two sets of LSF coefficients that are computed for each
   block of speech, different LSFs are obtained for each sub-block by
   means of interpolation.  This procedure is performed for the original
   LSFs (lsf1 and lsf2), as well as the quantized versions qlsf1 and
   qlsf2, as both versions are used in the encoder.  Here follows a
   brief summary of the interpolation scheme; the details are found in
   the c-code of Appendix A.  In the first sub-block, the average of the
   second LSF vector from the previous block and the first LSF vector in
   the current block is used.  For sub-blocks two through five, the LSFs
   used are obtained by linear interpolation from lsf1 (and qlsf1) to
   lsf2 (and qlsf2), with lsf1 used in sub-block two and lsf2 in sub-
   block five.  In the last sub-block, lsf2 is used.  For the very first
   block it is assumed that the last LSF vector of the previous block is
   equal to a predefined vector, lsfmeanTbl, obtained by calculating the
   mean LSF vector of the LSF design database.

   lsfmeanTbl[LPC_FILTERORDER] = {0.281738, 0.445801, 0.663330,
                  0.962524, 1.251831, 1.533081, 1.850586, 2.137817,
                  2.481445, 2.777344}

RFC3951 - Page 14

   The interpolation method is standard linear interpolation in the LSF
   domain.  The interpolated LSF values are converted to LPC
   coefficients for each sub-block.  The unquantized and quantized LPC
   coefficients form two sets of filters respectively.  The unquantized
   analysis filter for sub-block k is defined as follows

                ___
                \
      Ak(z)= 1 + > ak(i)*z^(-i)
                /__
             i=1...LPC_FILTERORDER

   The quantized analysis filter for sub-block k is defined as follows
                 ___
                 \
      A~k(z)= 1 + > a~k(i)*z^(-i)
                 /__
             i=1...LPC_FILTERORDER

   A reference implementation of the lsf encoding is given in Appendix
   A.38.  A reference implementation of the corresponding decoding can
   be found in Appendix A.36.

3.2.7.  LPC Analysis and Quantization for 20 ms Frames

   As previously stated, the codec only calculates one set of LPC
   parameters for the 20 ms frame size as opposed to two sets for 30 ms
   frames.  A single set of autocorrelation coefficients is calculated
   on the LPC_LOOKBACK + BLOCKL = 80 + 160 = 240 samples.  These samples
   are windowed with the asymmetric window lpc_asymwinTbl, centered over
   the third sub-frame, to form speech_hp_win.  Autocorrelation
   coefficients, acf, are calculated on the 240 samples in speech_hp_win
   and then windowed exactly as in section 3.2.1 (resulting in
   acf_win).

   This single set of windowed autocorrelation coefficients is used to
   calculate LPC coefficients, LSF coefficients, and quantized LSF
   coefficients in exactly the same manner as in sections 3.2.3 through
   3.2.4.  As for the 30 ms frame size, the ten LSF coefficients are
   divided into three sub-vectors of size 3, 3, and 4 and quantized by
   using the same scheme and codebook as in section 3.2.4 to finally get
   3 quantization indices.  The quantized LSF coefficients are
   stabilized with the algorithm described in section 3.2.5.

   From the set of LSF coefficients computed for this block and those
   from the previous block, different LSFs are obtained for each sub-
   block by means of interpolation.  The interpolation is done linearly
   in the LSF domain over the four sub-blocks, so that the n-th sub-

RFC3951 - Page 15

   frame uses the weight (4-n)/4 for the LSF from old frame and the
   weight n/4 of the LSF from the current frame.  For the very first
   block the mean LSF, lsfmeanTbl, is used as the LSF from the previous
   block.  Similarly as seen in section 3.2.6, both unquantized, A(z),
   and quantized, A~(z), analysis filters are calculated for each of the
   four sub-blocks.

3.3.  Calculation of the Residual

   The block of speech samples is filtered by the quantized and
   interpolated LPC analysis filters to yield the residual signal.  In
   particular, the corresponding LPC analysis filter for each 40 sample
   sub-block is used to filter the speech samples for the same sub-
   block.  The filter memory at the end of each sub-block is carried
   over to the LPC filter of the next sub-block.  The signal at the
   output of each LP analysis filter constitutes the residual signal for
   the corresponding sub-block.

   A reference implementation of the LPC analysis filters is given in
   Appendix A.10.

3.4.  Perceptual Weighting Filter

   In principle any good design of a perceptual weighting filter can be
   applied in the encoder without compromising this codec definition.
   However, it is RECOMMENDED to use the perceptual weighting filter Wk
   for sub-block k specified below:

      Wk(z)=1/Ak(z/LPC_CHIRP_WEIGHTDENUM), where
                               LPC_CHIRP_WEIGHTDENUM = 0.4222

   This is a simple design with low complexity that is applied in the
   LPC residual domain.  Here Ak(z) is the filter obtained for sub-block
   k from unquantized but interpolated LSF coefficients.

3.5.  Start State Encoder

   The start state is quantized by using a common 6-bit scalar quantizer
   for the block and a 3-bit scalar quantizer operating on scaled
   samples in the weighted speech domain.  In the following we describe
   the state encoding in greater detail.

RFC3951 - Page 16

3.5.1.  Start State Estimation

   The two sub-blocks containing the start state are determined by
   finding the two consecutive sub-blocks in the block having the
   highest power.  Advantageously, down-weighting is used in the
   beginning and end of the sub-frames, i.e., the following measure is
   computed (NSUB=4/6 for 20/30 ms frame size):

      nsub=1,...,NSUB-1
      ssqn[nsub] = 0.0;
      for (i=(nsub-1)*SUBL; i<(nsub-1)*SUBL+5; i++)
               ssqn[nsub] += sampEn_win[i-(nsub-1)*SUBL]*
                                 residual[i]*residual[i];
      for (i=(nsub-1)*SUBL+5; i<(nsub+1)*SUBL-5; i++)
               ssqn[nsub] += residual[i]*residual[i];
      for (i=(nsub+1)*SUBL-5; i<(nsub+1)*SUBL; i++)
               ssqn[nsub] += sampEn_win[(nsub+1)*SUBL-i-1]*
                                 residual[i]*residual[i];

   where sampEn_win[5]={1/6, 2/6, 3/6, 4/6, 5/6}; MAY be used.  The
   sub-frame number corresponding to the maximum value of
   ssqEn_win[nsub-1]*ssqn[nsub] is selected as the start state
   indicator.  A weighting of ssqEn_win[]={0.8,0.9,1.0,0.9,0.8} for 30
   ms frames and ssqEn_win[]={0.9,1.0,0.9} for 20 ms frames; MAY
   advantageously be used to bias the start state towards the middle of
   the frame.

   For 20 ms frames there are three possible positions for the two-sub-
   block length maximum power segment; the start state position is
   encoded with 2 bits.  The start state position, start, MUST be
   encoded as

      start=1: start state in sub-frame 0 and 1
      start=2: start state in sub-frame 1 and 2
      start=3: start state in sub-frame 2 and 3

   For 30 ms frames there are five possible positions of the two-sub-
   block length maximum power segment, the start state position is
   encoded with 3 bits.  The start state position, start, MUST be
   encoded as

      start=1: start state in sub-frame 0 and 1
      start=2: start state in sub-frame 1 and 2
      start=3: start state in sub-frame 2 and 3
      start=4: start state in sub-frame 3 and 4
      start=5: start state in sub-frame 4 and 5

RFC3951 - Page 17

   Hence, in both cases, index 0 is not used.  In order to shorten the
   start state for bit rate efficiency, the start state is brought down
   to STATE_SHORT_LEN=57 samples for 20 ms frames and STATE_SHORT_LEN=58
   samples for 30 ms frames.  The power of the first 23/22 and last
   23/22 samples of the two sub-frame blocks identified above is
   computed as the sum of the squared signal sample values, and the
   23/22-sample segment with the lowest power is excluded from the start
   state.  One bit is transmitted to indicate which of the two possible
   57/58 sample segments is used.  The start state position within the
   two sub-frames determined above, state_first, MUST be encoded as

      state_first=1: start state is first STATE_SHORT_LEN samples
      state_first=0: start state is last STATE_SHORT_LEN samples

3.5.2.  All-Pass Filtering and Scale Quantization

   The block of residual samples in the start state is first filtered by
   an all-pass filter with the quantized LPC coefficients as denominator
   and reversed quantized LPC coefficients as numerator.  The purpose of
   this phase-dispersion filter is to get a more even distribution of
   the sample values in the residual signal.  The filtering is performed
   by circular convolution, where the initial filter memory is set to
   zero.

      res(0..(STATE_SHORT_LEN-1))   = uncoded start state residual
      res((STATE_SHORT_LEN)..(2*STATE_SHORT_LEN-1)) = 0

      Pk(z) = A~rk(z)/A~k(z), where
                                   ___
                                   \
      A~rk(z)= z^(-LPC_FILTERORDER)+>a~k(i+1)*z^(i-(LPC_FILTERORDER-1))
                                   /__
                               i=0...(LPC_FILTERORDER-1)

      and A~k(z) is taken from the block where the start state begins

      res -> Pk(z) -> filtered

      ccres(k) = filtered(k) + filtered(k+STATE_SHORT_LEN),
                                        k=0..(STATE_SHORT_LEN-1)

   The all-pass filtered block is searched for its largest magnitude
   sample.  The 10-logarithm of this magnitude is quantized with a 6-bit
   quantizer, state_frgqTbl, by finding the nearest representation.

RFC3951 - Page 18

   This results in an index, idxForMax, corresponding to a quantized
   value, qmax.  The all-pass filtered residual samples in the block are
   then multiplied with a scaling factor scal=4.5/(10^qmax) to yield
   normalized samples.

   state_frgqTbl[64] = {1.000085, 1.071695, 1.140395, 1.206868,
                  1.277188, 1.351503, 1.429380, 1.500727, 1.569049,
                  1.639599, 1.707071, 1.781531, 1.840799, 1.901550,
                  1.956695, 2.006750, 2.055474, 2.102787, 2.142819,
                  2.183592, 2.217962, 2.257177, 2.295739, 2.332967,
                  2.369248, 2.402792, 2.435080, 2.468598, 2.503394,
                  2.539284, 2.572944, 2.605036, 2.636331, 2.668939,
                  2.698780, 2.729101, 2.759786, 2.789834, 2.818679,
                  2.848074, 2.877470, 2.906899, 2.936655, 2.967804,
                  3.000115, 3.033367, 3.066355, 3.104231, 3.141499,
                  3.183012, 3.222952, 3.265433, 3.308441, 3.350823,
                  3.395275, 3.442793, 3.490801, 3.542514, 3.604064,
                  3.666050, 3.740994, 3.830749, 3.938770, 4.101764}

3.5.3.  Scalar Quantization

   The normalized samples are quantized in the perceptually weighted
   speech domain by a sample-by-sample scalar DPCM quantization as
   depicted in Figure 3.3.  Each sample in the block is filtered by a
   weighting filter Wk(z), specified in section 3.4, to form a weighted
   speech sample x[n].  The target sample d[n] is formed by subtracting
   a predicted sample y[n], where the prediction filter is given by

           Pk(z) = 1 - 1 / Wk(z).

               +-------+  x[n] +    d[n] +-----------+ u[n]
   residual -->| Wk(z) |-------->(+)---->| Quantizer |------> quantized
               +-------+       - /|\     +-----------+    |   residual
                                  |                      \|/
                             y[n] +--------------------->(+)
                                  |                       |
                                  |        +------+       |
                                  +--------| Pk(z)|<------+
                                           +------+

   Figure 3.3.  Quantization of start state samples by DPCM in weighted
   speech domain.

   The coded state sample u[n] is obtained by quantizing d[n] with a 3-
   bit quantizer with quantization table state_sq3Tbl.

   state_sq3Tbl[8] = {-3.719849, -2.177490, -1.130005, -0.309692,
                  0.444214, 1.329712, 2.436279, 3.983887}

RFC3951 - Page 19

   The quantized samples are transformed back to the residual domain by
   1) scaling with 1/scal; 2) time-reversing the scaled samples; 3)
   filtering the time-reversed samples by the same all-pass filter, as
   in section 3.5.2, by using circular convolution; and 4) time-
   reversing the filtered samples.  (More detail is in section 4.2.)

   A reference implementation of the start-state encoding can be found
   in Appendix A.46.

(page 19 continued on part 2)