Internet Engineering Task Force (IETF) JM. Valin
Request for Comments: 6716 Mozilla Corporation
Category: Standards Track K. Vos
ISSN: 2070-1721 Skype Technologies S.A.
September 2012 Definition of the Opus Audio Codec
This document defines the Opus interactive speech and audio codec.
Opus is designed to handle a wide range of interactive audio
applications, including Voice over IP, videoconferencing, in-game
chat, and even live, distributed music performances. It scales from
low bitrate narrowband speech at 6 kbit/s to very high quality stereo
music at 510 kbit/s. Opus uses both Linear Prediction (LP) and the
Modified Discrete Cosine Transform (MDCT) to achieve good compression
of both speech and music.
Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in Section 2 of RFC 5741.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
The licenses granted by the IETF Trust to this RFC under Section 3.c
of the Trust Legal Provisions shall also include the right to extract
text from Sections 1 through 8 and Appendix A and Appendix B of this
RFC and create derivative works from these extracts, and to copy,
publish, display and distribute such derivative works in any medium
and for any purpose, provided that no such derivative work shall be
presented, displayed or published in a manner that states or implies
that it is part of this RFC or any other IETF Document.
Table of Contents
1. Introduction ....................................................51.1. Notation and Conventions ...................................62. Opus Codec Overview .............................................82.1. Control Parameters ........................................102.1.1. Bitrate ............................................102.1.2. Number of Channels (Mono/Stereo) ...................112.1.3. Audio Bandwidth ....................................112.1.4. Frame Duration .....................................112.1.5. Complexity .........................................112.1.6. Packet Loss Resilience .............................122.1.7. Forward Error Correction (FEC) .....................122.1.8. Constant/Variable Bitrate ..........................122.1.9. Discontinuous Transmission (DTX) ...................133. Internal Framing ...............................................133.1. The TOC Byte ..............................................133.2. Frame Packing .............................................163.2.1. Frame Length Coding ................................163.2.2. Code 0: One Frame in the Packet ....................163.2.3. Code 1: Two Frames in the Packet, Each with
Equal Compressed Size ..............................173.2.4. Code 2: Two Frames in the Packet, with
Different Compressed Sizes .........................17
The Opus codec is a real-time interactive audio codec designed to
meet the requirements described in [REQUIREMENTS]. It is composed of
a layer based on Linear Prediction (LP) [LPC] and a layer based on
the Modified Discrete Cosine Transform (MDCT) [MDCT]. The main idea
behind using two layers is as follows: in speech, linear prediction
techniques (such as Code-Excited Linear Prediction, or CELP) code low
frequencies more efficiently than transform (e.g., MDCT) domain
techniques, while the situation is reversed for music and higher
speech frequencies. Thus, a codec with both layers available can
operate over a wider range than either one alone and can achieve
better quality by combining them than by using either one
The primary normative part of this specification is provided by the
source code in Appendix A. Only the decoder portion of this software
is normative, though a significant amount of code is shared by both
the encoder and decoder. Section 6 provides a decoder conformance
test. The decoder contains a great deal of integer and fixed-point
arithmetic that needs to be performed exactly, including all rounding
considerations, so any useful specification requires domain-specific
symbolic language to adequately define these operations.
Additionally, any conflict between the symbolic representation and
the included reference implementation must be resolved. For the
practical reasons of compatibility and testability, it would be
advantageous to give the reference implementation priority in any
disagreement. The C language is also one of the most widely
understood, human-readable symbolic representations for machine
behavior. For these reasons, this RFC uses the reference
implementation as the sole symbolic representation of the codec.
While the symbolic representation is unambiguous and complete, it is
not always the easiest way to understand the codec's operation. For
this reason, this document also describes significant parts of the
codec in prose and takes the opportunity to explain the rationale
behind many of the more surprising elements of the design. These
descriptions are intended to be accurate and informative, but the
limitations of common English sometimes result in ambiguity, so it is
expected that the reader will always read them alongside the symbolic
representation. Numerous references to the implementation are
provided for this purpose. The descriptions sometimes differ from
the reference in ordering or through mathematical simplification
wherever such deviation makes an explanation easier to understand.
For example, the right shift and left shift operations in the
reference implementation are often described using division and
multiplication in the text. In general, the text is focused on the
"what" and "why" while the symbolic representation most clearly
provides the "how".
1.1. Notation and Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Various operations in the codec require bit-exact fixed-point
behavior, even when writing a floating point implementation. The
notation "Q<n>", where n is an integer, denotes the number of binary
digits to the right of the decimal point in a fixed-point number.
For example, a signed Q14 value in a 16-bit word can represent values
from -2.0 to 1.99993896484375, inclusive. This notation is for
informational purposes only. Arithmetic, when described, always
operates on the underlying integer. For example, the text will
explicitly indicate any shifts required after a multiplication.
Expressions, where included in the text, follow C operator rules and
precedence, with the exception that the syntax "x**y" indicates x
raised to the power y. The text also makes use of the following
The smallest of two values x and y.
The largest of two values x and y.
clamp(lo,x,hi) = max(lo,min(x,hi))
With this definition, if lo > hi, then lo is returned.
The sign of x, i.e.,
( -1, x < 0
sign(x) = < 0, x == 0
( 1, x > 0
The absolute value of x, i.e.,
abs(x) = sign(x)*x
The largest integer z such that z <= f.
The smallest integer z such that z >= f.
The integer z nearest to f, with ties rounded towards negative
round(f) = ceil(f - 0.5)
The base-two logarithm of f.
The minimum number of bits required to store a positive integer n in
binary, or 0 for a non-positive integer n.
( 0, n <= 0
ilog(n) = <
( floor(log2(n))+1, n > 0
o ilog(-1) = 0
o ilog(0) = 0
o ilog(1) = 1
o ilog(2) = 2
o ilog(3) = 2
o ilog(4) = 3
o ilog(7) = 3
2. Opus Codec Overview
The Opus codec scales from 6 kbit/s narrowband mono speech to
510 kbit/s fullband stereo music, with algorithmic delays ranging
from 5 ms to 65.2 ms. At any given time, either the LP layer, the
MDCT layer, or both, may be active. It can seamlessly switch between
all of its various operating modes, giving it a great deal of
flexibility to adapt to varying content and network conditions
without renegotiating the current session. The codec allows input
and output of various audio bandwidths, defined as follows:
| Abbreviation | Audio Bandwidth | Sample Rate (Effective) |
| NB (narrowband) | 4 kHz | 8 kHz |
| | | |
| MB (medium-band) | 6 kHz | 12 kHz |
| | | |
| WB (wideband) | 8 kHz | 16 kHz |
| | | |
| SWB (super-wideband) | 12 kHz | 24 kHz |
| | | |
| FB (fullband) | 20 kHz (*) | 48 kHz |
(*) Although the sampling theorem allows a bandwidth as large as half
the sampling rate, Opus never codes audio above 20 kHz, as that is
the generally accepted upper limit of human hearing.
Opus defines super-wideband (SWB) with an effective sample rate of
24 kHz, unlike some other audio coding standards that use 32 kHz.
This was chosen for a number of reasons. The band layout in the MDCT
layer naturally allows skipping coefficients for frequencies over
12 kHz, but does not allow cleanly dropping just those frequencies
over 16 kHz. A sample rate of 24 kHz also makes resampling in the
MDCT layer easier, as 24 evenly divides 48, and when 24 kHz is
sufficient, it can save computation in other processing, such as
Acoustic Echo Cancellation (AEC). Experimental changes to the band
layout to allow a 16 kHz cutoff (32 kHz effective sample rate) showed
potential quality degradations at other sample rates, and, at typical
bitrates, the number of bits saved by using such a cutoff instead of
coding in fullband (FB) mode is very small. Therefore, if an
application wishes to process a signal sampled at 32 kHz, it should
just use FB.
The LP layer is based on the SILK codec [SILK]. It supports NB, MB,
or WB audio and frame sizes from 10 ms to 60 ms, and requires an
additional 5 ms look-ahead for noise shaping estimation. A small
additional delay (up to 1.5 ms) may be required for sampling rate
conversion. Like Vorbis [VORBIS-WEBSITE] and many other modern
codecs, SILK is inherently designed for variable bitrate (VBR)
coding, though the encoder can also produce constant bitrate (CBR)
streams. The version of SILK used in Opus is substantially modified
from, and not compatible with, the stand-alone SILK codec previously
deployed by Skype. This document does not serve to define that
format, but those interested in the original SILK codec should see
The MDCT layer is based on the Constrained-Energy Lapped Transform
(CELT) codec [CELT]. It supports NB, WB, SWB, or FB audio and frame
sizes from 2.5 ms to 20 ms, and requires an additional 2.5 ms look-
ahead due to the overlapping MDCT windows. The CELT codec is
inherently designed for CBR coding, but unlike many CBR codecs, it is
not limited to a set of predetermined rates. It internally allocates
bits to exactly fill any given target budget, and an encoder can
produce a VBR stream by varying the target on a per-frame basis. The
MDCT layer is not used for speech when the audio bandwidth is WB or
less, as it is not useful there. On the other hand, non-speech
signals are not always adequately coded using linear prediction.
Therefore, the MDCT layer should be used for music signals.
A "Hybrid" mode allows the use of both layers simultaneously with a
frame size of 10 or 20 ms and an SWB or FB audio bandwidth. The LP
layer codes the low frequencies by resampling the signal down to WB.
The MDCT layer follows, coding the high frequency portion of the
signal. The cutoff between the two lies at 8 kHz, the maximum WB
audio bandwidth. In the MDCT layer, all bands below 8 kHz are
discarded, so there is no coding redundancy between the two layers.
The sample rate (in contrast to the actual audio bandwidth) can be
chosen independently on the encoder and decoder side, e.g., a
fullband signal can be decoded as wideband, or vice versa. This
approach ensures a sender and receiver can always interoperate,
regardless of the capabilities of their actual audio hardware.
Internally, the LP layer always operates at a sample rate of twice
the audio bandwidth, up to a maximum of 16 kHz, which it continues to
use for SWB and FB. The decoder simply resamples its output to
support different sample rates. The MDCT layer always operates
internally at a sample rate of 48 kHz. Since all the supported
sample rates evenly divide this rate, and since the decoder may
easily zero out the high frequency portion of the spectrum in the
frequency domain, it can simply decimate the MDCT layer output to
achieve the other supported sample rates very cheaply.
After conversion to the common, desired output sample rate, the
decoder simply adds the output from the two layers together. To
compensate for the different look-ahead required by each layer, the
CELT encoder input is delayed by an additional 2.7 ms. This ensures
that low frequencies and high frequencies arrive at the same time.
This extra delay may be reduced by an encoder by using less look-
ahead for noise shaping or using a simpler resampler in the LP layer,
but this will reduce quality. However, the base 2.5 ms look-ahead in
the CELT layer cannot be reduced in the encoder because it is needed
for the MDCT overlap, whose size is fixed by the decoder.
Both layers use the same entropy coder, avoiding any waste from
"padding bits" between them. The hybrid approach makes it easy to
support both CBR and VBR coding. Although the LP layer is VBR, the
bit allocation of the MDCT layer can produce a final stream that is
CBR by using all the bits left unused by the LP layer.
2.1. Control Parameters
The Opus codec includes a number of control parameters that can be
changed dynamically during regular operation of the codec, without
interrupting the audio stream from the encoder to the decoder. These
parameters only affect the encoder since any impact they have on the
bitstream is signaled in-band such that a decoder can decode any Opus
stream without any out-of-band signaling. Any Opus implementation
can add or modify these control parameters without affecting
interoperability. The most important encoder control parameters in
the reference encoder are listed below.
Opus supports all bitrates from 6 kbit/s to 510 kbit/s. All other
parameters being equal, higher bitrate results in higher quality.
For a frame size of 20 ms, these are the bitrate "sweet spots" for
Opus in various configurations:
o 8-12 kbit/s for NB speech,
o 16-20 kbit/s for WB speech,
o 28-40 kbit/s for FB speech,
o 48-64 kbit/s for FB mono music, and
o 64-128 kbit/s for FB stereo music.
2.1.2. Number of Channels (Mono/Stereo)
Opus can transmit either mono or stereo frames within a single
stream. When decoding a mono frame in a stereo decoder, the left and
right channels are identical, and when decoding a stereo frame in a
mono decoder, the mono output is the average of the left and right
channels. In some cases, it is desirable to encode a stereo input
stream in mono (e.g., because the bitrate is too low to encode stereo
with sufficient quality). The number of channels encoded can be
selected in real-time, but by default the reference encoder attempts
to make the best decision possible given the current bitrate.
2.1.3. Audio Bandwidth
The audio bandwidths supported by Opus are listed in Table 1. Just
like for the number of channels, any decoder can decode audio that is
encoded at any bandwidth. For example, any Opus decoder operating at
8 kHz can decode an FB Opus frame, and any Opus decoder operating at
48 kHz can decode an NB frame. Similarly, the reference encoder can
take a 48 kHz input signal and encode it as NB. The higher the audio
bandwidth, the higher the required bitrate to achieve acceptable
quality. The audio bandwidth can be explicitly specified in real-
time, but, by default, the reference encoder attempts to make the
best bandwidth decision possible given the current bitrate.
2.1.4. Frame Duration
Opus can encode frames of 2.5, 5, 10, 20, 40, or 60 ms. It can also
combine multiple frames into packets of up to 120 ms. For real-time
applications, sending fewer packets per second reduces the bitrate,
since it reduces the overhead from IP, UDP, and RTP headers.
However, it increases latency and sensitivity to packet losses, as
losing one packet constitutes a loss of a bigger chunk of audio.
Increasing the frame duration also slightly improves coding
efficiency, but the gain becomes small for frame sizes above 20 ms.
For this reason, 20 ms frames are a good choice for most
There are various aspects of the Opus encoding process where trade-
offs can be made between CPU complexity and quality/bitrate. In the
reference encoder, the complexity is selected using an integer from 0
to 10, where 0 is the lowest complexity and 10 is the highest.
Examples of computations for which such trade-offs may occur are:
o The order of the pitch analysis whitening filter [WHITENING],
o The order of the short-term noise shaping filter,
o The number of states in delayed decision quantization of the
residual signal, and
o The use of certain bitstream features such as variable time-
frequency resolution and the pitch post-filter.
2.1.6. Packet Loss Resilience
Audio codecs often exploit inter-frame correlations to reduce the
bitrate at a cost in error propagation: after losing one packet,
several packets need to be received before the decoder is able to
accurately reconstruct the speech signal. The extent to which Opus
exploits inter-frame dependencies can be adjusted on the fly to
choose a trade-off between bitrate and amount of error propagation.
2.1.7. Forward Error Correction (FEC)
Another mechanism providing robustness against packet loss is the in-
band Forward Error Correction (FEC). Packets that are determined to
contain perceptually important speech information, such as onsets or
transients, are encoded again at a lower bitrate and this re-encoded
information is added to a subsequent packet.
2.1.8. Constant/Variable Bitrate
Opus is more efficient when operating with variable bitrate (VBR),
which is the default. When low-latency transmission is required over
a relatively slow connection, then constrained VBR can also be used.
This uses VBR in a way that simulates a "bit reservoir" and is
equivalent to what MP3 (MPEG 1, Layer 3) and AAC (Advanced Audio
Coding) call CBR (i.e., not true CBR due to the bit reservoir). In
some (rare) applications, constant bitrate (CBR) is required. There
are two main reasons to operate in CBR mode:
o When the transport only supports a fixed size for each compressed
o When encryption is used for an audio stream that is either highly
constrained (e.g., yes/no, recorded prompts) or highly sensitive
Bitrate may still be allowed to vary, even with sensitive data, as
long as the variation is not driven by the input signal (for example,
to match changing network conditions). To achieve this, an
application should still run Opus in CBR mode, but change the target
rate before each packet.