RFC 6716

Definition of the Opus Audio Codec

Pages: 326
Proposed Standard
→ Errata
Updated by: 8251

Part 1 of 14 – Pages 1 to 13

RFC6716 - Page 1

Internet Engineering Task Force (IETF)                         JM. Valin
Request for Comments: 6716                           Mozilla Corporation
Category: Standards Track                                         K. Vos
ISSN: 2070-1721                                  Skype Technologies S.A.
                                                           T. Terriberry
                                                     Mozilla Corporation
                                                          September 2012


                   Definition of the Opus Audio Codec

Abstract

   This document defines the Opus interactive speech and audio codec.
   Opus is designed to handle a wide range of interactive audio
   applications, including Voice over IP, videoconferencing, in-game
   chat, and even live, distributed music performances.  It scales from
   low bitrate narrowband speech at 6 kbit/s to very high quality stereo
   music at 510 kbit/s.  Opus uses both Linear Prediction (LP) and the
   Modified Discrete Cosine Transform (MDCT) to achieve good compression
   of both speech and music.

Status of This Memo

   This is an Internet Standards Track document.

   This document is a product of the Internet Engineering Task Force
   (IETF).  It represents the consensus of the IETF community.  It has
   received public review and has been approved for publication by the
   Internet Engineering Steering Group (IESG).  Further information on
   Internet Standards is available in Section 2 of RFC 5741.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   http://www.rfc-editor.org/info/rfc6716.

RFC6716 - Page 2

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   The licenses granted by the IETF Trust to this RFC under Section 3.c
   of the Trust Legal Provisions shall also include the right to extract
   text from Sections 1 through 8 and Appendix A and Appendix B of this
   RFC and create derivative works from these extracts, and to copy,
   publish, display and distribute such derivative works in any medium
   and for any purpose, provided that no such derivative work shall be
   presented, displayed or published in a manner that states or implies
   that it is part of this RFC or any other IETF Document.

Table of Contents

   1. Introduction ....................................................5
      1.1. Notation and Conventions ...................................6
   2. Opus Codec Overview .............................................8
      2.1. Control Parameters ........................................10
           2.1.1. Bitrate ............................................10
           2.1.2. Number of Channels (Mono/Stereo) ...................11
           2.1.3. Audio Bandwidth ....................................11
           2.1.4. Frame Duration .....................................11
           2.1.5. Complexity .........................................11
           2.1.6. Packet Loss Resilience .............................12
           2.1.7. Forward Error Correction (FEC) .....................12
           2.1.8. Constant/Variable Bitrate ..........................12
           2.1.9. Discontinuous Transmission (DTX) ...................13
   3. Internal Framing ...............................................13
      3.1. The TOC Byte ..............................................13
      3.2. Frame Packing .............................................16
           3.2.1. Frame Length Coding ................................16
           3.2.2. Code 0: One Frame in the Packet ....................16
           3.2.3. Code 1: Two Frames in the Packet, Each with
                  Equal Compressed Size ..............................17
           3.2.4. Code 2: Two Frames in the Packet, with
                  Different Compressed Sizes .........................17

RFC6716 - Page 3

           3.2.5. Code 3: A Signaled Number of Frames in the Packet ..18
      3.3. Examples ..................................................21
      3.4. Receiving Malformed Packets ...............................22
   4. Opus Decoder ...................................................23
      4.1. Range Decoder .............................................23
           4.1.1. Range Decoder Initialization .......................25
           4.1.2. Decoding Symbols ...................................25
           4.1.3. Alternate Decoding Methods .........................27
           4.1.4. Decoding Raw Bits ..................................29
           4.1.5. Decoding Uniformly Distributed Integers ............29
           4.1.6. Current Bit Usage ..................................30
      4.2. SILK Decoder ..............................................32
           4.2.1. SILK Decoder Modules ...............................32
           4.2.2. LP Layer Organization ..............................33
           4.2.3. Header Bits ........................................35
           4.2.4. Per-Frame LBRR Flags ...............................36
           4.2.5. LBRR Frames ........................................36
           4.2.6. Regular SILK Frames ................................37
           4.2.7. SILK Frame Contents ................................37
                  4.2.7.1. Stereo Prediction Weights .................40
                  4.2.7.2. Mid-Only Flag .............................42
                  4.2.7.3. Frame Type ................................43
                  4.2.7.4. Subframe Gains ............................44
                  4.2.7.5. Normalized Line Spectral Frequency
                           (LSF) and Linear Predictive Coding (LPC)
                           Coeffieients ..............................46
                  4.2.7.6. Long-Term Prediction (LTP) Parameters .....74
                  4.2.7.7. Linear Congruential Generator (LCG) Seed ..86
                  4.2.7.8. Excitation ................................86
                  4.2.7.9. SILK Frame Reconstruction .................98
           4.2.8. Stereo Unmixing ...................................102
           4.2.9. Resampling ........................................103
      4.3. CELT Decoder .............................................104
           4.3.1. Transient Decoding ................................108
           4.3.2. Energy Envelope Decoding ..........................108
           4.3.3. Bit Allocation ....................................110
           4.3.4. Shape Decoding ....................................116
           4.3.5. Anti-collapse Processing ..........................120
           4.3.6. Denormalization ...................................121
           4.3.7. Inverse MDCT ......................................121
      4.4. Packet Loss Concealment (PLC) ............................122
           4.4.1. Clock Drift Compensation ..........................122
      4.5. Configuration Switching ..................................123
           4.5.1. Transition Side Information (Redundancy) ..........124
           4.5.2. State Reset .......................................127
           4.5.3. Summary of Transitions ............................128
   5. Opus Encoder ..................................................131
      5.1. Range Encoder ............................................132

RFC6716 - Page 4

           5.1.1. Encoding Symbols ..................................133
           5.1.2. Alternate Encoding Methods ........................134
           5.1.3. Encoding Raw Bits .................................135
           5.1.4. Encoding Uniformly Distributed Integers ...........135
           5.1.5. Finalizing the Stream .............................135
           5.1.6. Current Bit Usage .................................136
      5.2. SILK Encoder .............................................136
           5.2.1. Sample Rate Conversion ............................137
           5.2.2. Stereo Mixing .....................................137
           5.2.3. SILK Core Encoder .................................138
      5.3. CELT Encoder .............................................150
           5.3.1. Pitch Pre-filter ..................................150
           5.3.2. Bands and Normalization ...........................151
           5.3.3. Energy Envelope Quantization ......................151
           5.3.4. Bit Allocation ....................................151
           5.3.5. Stereo Decisions ..................................152
           5.3.6. Time-Frequency Decision ...........................153
           5.3.7. Spreading Values Decision .........................153
           5.3.8. Spherical Vector Quantization .....................154
   6. Conformance ...................................................155
      6.1. Testing ..................................................155
      6.2. Opus Custom ..............................................156
   7. Security Considerations .......................................157
   8. Acknowledgements ..............................................158
   9. References ....................................................159
      9.1. Normative References .....................................159
      9.2. Informative References ...................................159
   Appendix A. Reference Implementation .............................163
      A.1. Extracting the Source ....................................164
      A.2. Up-to-Date Implementation ................................164
      A.3. Base64-Encoded Source Code ...............................164
      A.4. Test Vectors .............................................321
   Appendix B. Self-Delimiting Framing ..............................321

RFC6716 - Page 5

1.  Introduction

   The Opus codec is a real-time interactive audio codec designed to
   meet the requirements described in [REQUIREMENTS].  It is composed of
   a layer based on Linear Prediction (LP) [LPC] and a layer based on
   the Modified Discrete Cosine Transform (MDCT) [MDCT].  The main idea
   behind using two layers is as follows: in speech, linear prediction
   techniques (such as Code-Excited Linear Prediction, or CELP) code low
   frequencies more efficiently than transform (e.g., MDCT) domain
   techniques, while the situation is reversed for music and higher
   speech frequencies.  Thus, a codec with both layers available can
   operate over a wider range than either one alone and can achieve
   better quality by combining them than by using either one
   individually.

   The primary normative part of this specification is provided by the
   source code in Appendix A.  Only the decoder portion of this software
   is normative, though a significant amount of code is shared by both
   the encoder and decoder.  Section 6 provides a decoder conformance
   test.  The decoder contains a great deal of integer and fixed-point
   arithmetic that needs to be performed exactly, including all rounding
   considerations, so any useful specification requires domain-specific
   symbolic language to adequately define these operations.
   Additionally, any conflict between the symbolic representation and
   the included reference implementation must be resolved.  For the
   practical reasons of compatibility and testability, it would be
   advantageous to give the reference implementation priority in any
   disagreement.  The C language is also one of the most widely
   understood, human-readable symbolic representations for machine
   behavior.  For these reasons, this RFC uses the reference
   implementation as the sole symbolic representation of the codec.

   While the symbolic representation is unambiguous and complete, it is
   not always the easiest way to understand the codec's operation.  For
   this reason, this document also describes significant parts of the
   codec in prose and takes the opportunity to explain the rationale
   behind many of the more surprising elements of the design.  These
   descriptions are intended to be accurate and informative, but the
   limitations of common English sometimes result in ambiguity, so it is
   expected that the reader will always read them alongside the symbolic
   representation.  Numerous references to the implementation are
   provided for this purpose.  The descriptions sometimes differ from
   the reference in ordering or through mathematical simplification
   wherever such deviation makes an explanation easier to understand.
   For example, the right shift and left shift operations in the
   reference implementation are often described using division and

RFC6716 - Page 6

   multiplication in the text.  In general, the text is focused on the
   "what" and "why" while the symbolic representation most clearly
   provides the "how".

1.1.  Notation and Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

   Various operations in the codec require bit-exact fixed-point
   behavior, even when writing a floating point implementation.  The
   notation "Q<n>", where n is an integer, denotes the number of binary
   digits to the right of the decimal point in a fixed-point number.
   For example, a signed Q14 value in a 16-bit word can represent values
   from -2.0 to 1.99993896484375, inclusive.  This notation is for
   informational purposes only.  Arithmetic, when described, always
   operates on the underlying integer.  For example, the text will
   explicitly indicate any shifts required after a multiplication.

   Expressions, where included in the text, follow C operator rules and
   precedence, with the exception that the syntax "x**y" indicates x
   raised to the power y.  The text also makes use of the following
   functions.

1.1.1.  min(x,y)

   The smallest of two values x and y.

1.1.2.  max(x,y)

   The largest of two values x and y.

1.1.3.  clamp(lo,x,hi)

                     clamp(lo,x,hi) = max(lo,min(x,hi))

   With this definition, if lo > hi, then lo is returned.

1.1.4.  sign(x)

   The sign of x, i.e.,

                                    ( -1,  x < 0
                          sign(x) = <  0,  x == 0
                                    (  1,  x > 0

RFC6716 - Page 7

1.1.5.  abs(x)

   The absolute value of x, i.e.,

                             abs(x) = sign(x)*x

1.1.6.  floor(f)

   The largest integer z such that z <= f.

1.1.7.  ceil(f)

   The smallest integer z such that z >= f.

1.1.8.  round(f)

   The integer z nearest to f, with ties rounded towards negative
   infinity, i.e.,

                           round(f) = ceil(f - 0.5)

1.1.9.  log2(f)

   The base-two logarithm of f.

1.1.10.  ilog(n)

   The minimum number of bits required to store a positive integer n in
   binary, or 0 for a non-positive integer n.

                              ( 0,                 n <= 0
                    ilog(n) = <
                              ( floor(log2(n))+1,  n > 0

   Examples:

   o  ilog(-1) = 0

   o  ilog(0) = 0

   o  ilog(1) = 1

   o  ilog(2) = 2

   o  ilog(3) = 2

   o  ilog(4) = 3

RFC6716 - Page 8

   o  ilog(7) = 3

2.  Opus Codec Overview

   The Opus codec scales from 6 kbit/s narrowband mono speech to
   510 kbit/s fullband stereo music, with algorithmic delays ranging
   from 5 ms to 65.2 ms.  At any given time, either the LP layer, the
   MDCT layer, or both, may be active.  It can seamlessly switch between
   all of its various operating modes, giving it a great deal of
   flexibility to adapt to varying content and network conditions
   without renegotiating the current session.  The codec allows input
   and output of various audio bandwidths, defined as follows:

   +----------------------+-----------------+-------------------------+
   | Abbreviation         | Audio Bandwidth | Sample Rate (Effective) |
   +----------------------+-----------------+-------------------------+
   | NB (narrowband)      |           4 kHz |                   8 kHz |
   |                      |                 |                         |
   | MB (medium-band)     |           6 kHz |                  12 kHz |
   |                      |                 |                         |
   | WB (wideband)        |           8 kHz |                  16 kHz |
   |                      |                 |                         |
   | SWB (super-wideband) |          12 kHz |                  24 kHz |
   |                      |                 |                         |
   | FB (fullband)        |      20 kHz (*) |                  48 kHz |
   +----------------------+-----------------+-------------------------+

                                  Table 1

   (*) Although the sampling theorem allows a bandwidth as large as half
   the sampling rate, Opus never codes audio above 20 kHz, as that is
   the generally accepted upper limit of human hearing.

   Opus defines super-wideband (SWB) with an effective sample rate of
   24 kHz, unlike some other audio coding standards that use 32 kHz.
   This was chosen for a number of reasons.  The band layout in the MDCT
   layer naturally allows skipping coefficients for frequencies over
   12 kHz, but does not allow cleanly dropping just those frequencies
   over 16 kHz.  A sample rate of 24 kHz also makes resampling in the
   MDCT layer easier, as 24 evenly divides 48, and when 24 kHz is
   sufficient, it can save computation in other processing, such as
   Acoustic Echo Cancellation (AEC).  Experimental changes to the band
   layout to allow a 16 kHz cutoff (32 kHz effective sample rate) showed
   potential quality degradations at other sample rates, and, at typical
   bitrates, the number of bits saved by using such a cutoff instead of
   coding in fullband (FB) mode is very small.  Therefore, if an
   application wishes to process a signal sampled at 32 kHz, it should
   just use FB.

RFC6716 - Page 9

   The LP layer is based on the SILK codec [SILK].  It supports NB, MB,
   or WB audio and frame sizes from 10 ms to 60 ms, and requires an
   additional 5 ms look-ahead for noise shaping estimation.  A small
   additional delay (up to 1.5 ms) may be required for sampling rate
   conversion.  Like Vorbis [VORBIS-WEBSITE] and many other modern
   codecs, SILK is inherently designed for variable bitrate (VBR)
   coding, though the encoder can also produce constant bitrate (CBR)
   streams.  The version of SILK used in Opus is substantially modified
   from, and not compatible with, the stand-alone SILK codec previously
   deployed by Skype.  This document does not serve to define that
   format, but those interested in the original SILK codec should see
   [SILK] instead.

   The MDCT layer is based on the Constrained-Energy Lapped Transform
   (CELT) codec [CELT].  It supports NB, WB, SWB, or FB audio and frame
   sizes from 2.5 ms to 20 ms, and requires an additional 2.5 ms look-
   ahead due to the overlapping MDCT windows.  The CELT codec is
   inherently designed for CBR coding, but unlike many CBR codecs, it is
   not limited to a set of predetermined rates.  It internally allocates
   bits to exactly fill any given target budget, and an encoder can
   produce a VBR stream by varying the target on a per-frame basis.  The
   MDCT layer is not used for speech when the audio bandwidth is WB or
   less, as it is not useful there.  On the other hand, non-speech
   signals are not always adequately coded using linear prediction.
   Therefore, the MDCT layer should be used for music signals.

   A "Hybrid" mode allows the use of both layers simultaneously with a
   frame size of 10 or 20 ms and an SWB or FB audio bandwidth.  The LP
   layer codes the low frequencies by resampling the signal down to WB.
   The MDCT layer follows, coding the high frequency portion of the
   signal.  The cutoff between the two lies at 8 kHz, the maximum WB
   audio bandwidth.  In the MDCT layer, all bands below 8 kHz are
   discarded, so there is no coding redundancy between the two layers.

   The sample rate (in contrast to the actual audio bandwidth) can be
   chosen independently on the encoder and decoder side, e.g., a
   fullband signal can be decoded as wideband, or vice versa.  This
   approach ensures a sender and receiver can always interoperate,
   regardless of the capabilities of their actual audio hardware.
   Internally, the LP layer always operates at a sample rate of twice
   the audio bandwidth, up to a maximum of 16 kHz, which it continues to
   use for SWB and FB.  The decoder simply resamples its output to
   support different sample rates.  The MDCT layer always operates
   internally at a sample rate of 48 kHz.  Since all the supported
   sample rates evenly divide this rate, and since the decoder may
   easily zero out the high frequency portion of the spectrum in the
   frequency domain, it can simply decimate the MDCT layer output to
   achieve the other supported sample rates very cheaply.

RFC6716 - Page 10

   After conversion to the common, desired output sample rate, the
   decoder simply adds the output from the two layers together.  To
   compensate for the different look-ahead required by each layer, the
   CELT encoder input is delayed by an additional 2.7 ms.  This ensures
   that low frequencies and high frequencies arrive at the same time.
   This extra delay may be reduced by an encoder by using less look-
   ahead for noise shaping or using a simpler resampler in the LP layer,
   but this will reduce quality.  However, the base 2.5 ms look-ahead in
   the CELT layer cannot be reduced in the encoder because it is needed
   for the MDCT overlap, whose size is fixed by the decoder.

   Both layers use the same entropy coder, avoiding any waste from
   "padding bits" between them.  The hybrid approach makes it easy to
   support both CBR and VBR coding.  Although the LP layer is VBR, the
   bit allocation of the MDCT layer can produce a final stream that is
   CBR by using all the bits left unused by the LP layer.

2.1.  Control Parameters

   The Opus codec includes a number of control parameters that can be
   changed dynamically during regular operation of the codec, without
   interrupting the audio stream from the encoder to the decoder.  These
   parameters only affect the encoder since any impact they have on the
   bitstream is signaled in-band such that a decoder can decode any Opus
   stream without any out-of-band signaling.  Any Opus implementation
   can add or modify these control parameters without affecting
   interoperability.  The most important encoder control parameters in
   the reference encoder are listed below.

2.1.1.  Bitrate

   Opus supports all bitrates from 6 kbit/s to 510 kbit/s.  All other
   parameters being equal, higher bitrate results in higher quality.
   For a frame size of 20 ms, these are the bitrate "sweet spots" for
   Opus in various configurations:

   o  8-12 kbit/s for NB speech,

   o  16-20 kbit/s for WB speech,

   o  28-40 kbit/s for FB speech,

   o  48-64 kbit/s for FB mono music, and

   o  64-128 kbit/s for FB stereo music.

RFC6716 - Page 11

2.1.2.  Number of Channels (Mono/Stereo)

   Opus can transmit either mono or stereo frames within a single
   stream.  When decoding a mono frame in a stereo decoder, the left and
   right channels are identical, and when decoding a stereo frame in a
   mono decoder, the mono output is the average of the left and right
   channels.  In some cases, it is desirable to encode a stereo input
   stream in mono (e.g., because the bitrate is too low to encode stereo
   with sufficient quality).  The number of channels encoded can be
   selected in real-time, but by default the reference encoder attempts
   to make the best decision possible given the current bitrate.

2.1.3.  Audio Bandwidth

   The audio bandwidths supported by Opus are listed in Table 1.  Just
   like for the number of channels, any decoder can decode audio that is
   encoded at any bandwidth.  For example, any Opus decoder operating at
   8 kHz can decode an FB Opus frame, and any Opus decoder operating at
   48 kHz can decode an NB frame.  Similarly, the reference encoder can
   take a 48 kHz input signal and encode it as NB.  The higher the audio
   bandwidth, the higher the required bitrate to achieve acceptable
   quality.  The audio bandwidth can be explicitly specified in real-
   time, but, by default, the reference encoder attempts to make the
   best bandwidth decision possible given the current bitrate.

2.1.4.  Frame Duration

   Opus can encode frames of 2.5, 5, 10, 20, 40, or 60 ms.  It can also
   combine multiple frames into packets of up to 120 ms.  For real-time
   applications, sending fewer packets per second reduces the bitrate,
   since it reduces the overhead from IP, UDP, and RTP headers.
   However, it increases latency and sensitivity to packet losses, as
   losing one packet constitutes a loss of a bigger chunk of audio.
   Increasing the frame duration also slightly improves coding
   efficiency, but the gain becomes small for frame sizes above 20 ms.
   For this reason, 20 ms frames are a good choice for most
   applications.

2.1.5.  Complexity

   There are various aspects of the Opus encoding process where trade-
   offs can be made between CPU complexity and quality/bitrate.  In the
   reference encoder, the complexity is selected using an integer from 0
   to 10, where 0 is the lowest complexity and 10 is the highest.
   Examples of computations for which such trade-offs may occur are:

   o  The order of the pitch analysis whitening filter [WHITENING],

RFC6716 - Page 12

   o  The order of the short-term noise shaping filter,

   o  The number of states in delayed decision quantization of the
      residual signal, and

   o  The use of certain bitstream features such as variable time-
      frequency resolution and the pitch post-filter.

2.1.6.  Packet Loss Resilience

   Audio codecs often exploit inter-frame correlations to reduce the
   bitrate at a cost in error propagation: after losing one packet,
   several packets need to be received before the decoder is able to
   accurately reconstruct the speech signal.  The extent to which Opus
   exploits inter-frame dependencies can be adjusted on the fly to
   choose a trade-off between bitrate and amount of error propagation.

2.1.7.  Forward Error Correction (FEC)

   Another mechanism providing robustness against packet loss is the in-
   band Forward Error Correction (FEC).  Packets that are determined to
   contain perceptually important speech information, such as onsets or
   transients, are encoded again at a lower bitrate and this re-encoded
   information is added to a subsequent packet.

2.1.8.  Constant/Variable Bitrate

   Opus is more efficient when operating with variable bitrate (VBR),
   which is the default.  When low-latency transmission is required over
   a relatively slow connection, then constrained VBR can also be used.
   This uses VBR in a way that simulates a "bit reservoir" and is
   equivalent to what MP3 (MPEG 1, Layer 3) and AAC (Advanced Audio
   Coding) call CBR (i.e., not true CBR due to the bit reservoir).  In
   some (rare) applications, constant bitrate (CBR) is required.  There
   are two main reasons to operate in CBR mode:

   o  When the transport only supports a fixed size for each compressed
      frame, or

   o  When encryption is used for an audio stream that is either highly
      constrained (e.g., yes/no, recorded prompts) or highly sensitive
      [SRTP-VBR].

   Bitrate may still be allowed to vary, even with sensitive data, as
   long as the variation is not driven by the input signal (for example,
   to match changing network conditions).  To achieve this, an
   application should still run Opus in CBR mode, but change the target
   rate before each packet.

RFC6716 - Page 13

2.1.9.  Discontinuous Transmission (DTX)

   Discontinuous Transmission (DTX) reduces the bitrate during silence
   or background noise.  When DTX is enabled, only one frame is encoded
   every 400 milliseconds.

(page 13 continued on part 2)