Internet Engineering Task Force (IETF) J. Lennox, Ed. Request for Comments: 6464 Vidyo Category: Standards Track E. Ivov ISSN: 2070-1721 Jitsi E. Marocco Telecom Italia December 2011 A Real-time Transport Protocol (RTP) Header Extension for Client-to-Mixer Audio Level Indication
AbstractThis document defines a mechanism by which packets of Real-time Transport Protocol (RTP) audio streams can indicate, in an RTP header extension, the audio level of the audio sample carried in the RTP packet. In large conferences, this can reduce the load on an audio mixer or other middlebox that wants to forward only a few of the loudest audio streams, without requiring it to decode and measure every stream that is received. Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6464.
Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. 1. Introduction ....................................................2 2. Terminology .....................................................3 3. Audio Levels ....................................................3 4. Signaling (Setup) Information ...................................5 5. Considerations on Use ...........................................6 6. Security Considerations .........................................6 7. IANA Considerations .............................................7 8. References ......................................................7 8.1. Normative References .......................................7 8.2. Informative References .....................................8 RFC3550] audio conference, an audio mixer or forwarder receives audio streams from many or all of the conference participants. It then selectively forwards some of them to other participants in the conference. In large conferences, it is possible that such a server might be receiving a large number of streams, of which only a few are intended to be forwarded to the other conference participants. In such a scenario, in order to pick the audio streams to forward, a centralized server needs to decode, measure audio levels, and possibly perform voice activity detection on audio data from a large number of streams. The need for such processing limits the size or number of conferences such a server can support. As an alternative, this document defines an RTP header extension [RFC5285] through which senders of audio packets can indicate the audio level of the packets' payload, reducing the processing load for a server.
The header extension in this document is different than, but complementary with, the one defined in [RFC6465], which defines a mechanism by which audio mixers can indicate to clients the levels of the contributing sources that made up the mixed audio. RFC2119] and indicate requirement levels for compliant implementations. RFC3550] payload of the packet with which it is associated. This information is carried in an RTP header extension element as defined by "A General Mechanism for RTP Header Extensions" [RFC5285]. The payload of the audio level header extension element can be encoded using either the one-byte or two-byte header defined in [RFC5285]. Figures 1 and 2 show sample audio level encodings with each of these header formats. 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ID | len=0 |V| level | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1: Sample Audio Level Encoding Using the One-Byte Header Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ID | len=1 |V| level | 0 (pad) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2: Sample Audio Level Encoding Using the Two-Byte Header Format Note that, as indicated in [RFC5285], the length field in the one- byte header format takes the value 0 to indicate that 1 byte follows. In the two-byte header format, on the other hand, the length field takes the value of 1.
The magnitude of the audio level itself is packed into the seven least significant bits of the single byte of the header extension, shown in Figures 1 and 2. The least significant bit of the audio level magnitude is packed into the least significant bit of the byte. The most significant bit of the byte is used as a separate flag bit "V", defined below. The audio level is expressed in -dBov, with values from 0 to 127 representing 0 to -127 dBov. dBov is the level, in decibels, relative to the overload point of the system, i.e., the highest-intensity signal encodable by the payload format. (Note: Representation relative to the overload point of a system is particularly useful for digital implementations, since one does not need to know the relative calibration of the analog circuitry.) For example, in the case of u-law (audio/pcmu) audio [ITU.G711], the 0 dBov reference would be a square wave with values +/- 8031. (This translates to 6.18 dBm0, relative to u-law's dBm0 definition in Table 6 of [ITU.G711].) The audio level for digital silence -- for a muted audio source, for example -- MUST be represented as 127 (-127 dBov), regardless of the dynamic range of the encoded audio format. The audio level header extension only carries the level of the audio in the RTP payload of the packet with which it is associated, with no long-term averaging or smoothing applied. For payload formats that contain extra error-correction bits or loss-concealment information, the level corresponds only to the data that would result from the payload's normal decoding process, not what it would produce under error or packet loss concealment. The level is measured as a root mean square of all the samples in the audio encoded by the packet. To simplify implementation of the encoding procedures described here, Appendix A of [RFC6465] provides a sample Java implementation of an audio level calculator that helps obtain such values from raw linear Pulse Code Modulation (PCM) audio samples. In addition, a flag bit (labeled "V") optionally indicates whether the encoder believes the audio packet contains voice activity. If the V bit is in use, the value 1 indicates that the encoder believes the audio packet contains voice activity, and the value 0 indicates that the encoder believes it does not. (The voice activity detection algorithm is unspecified and left implementation-specific.) If the V bit is not in use, its value is unspecified and MUST be ignored by receivers. The use of the V bit is signaled using the extension attribute "vad", discussed in Section 4.
When this header extension is used with RTP data sent using the RTP Payload for Redundant Audio Data [RFC2198], the header's data describes the contents of the primary encoding. Note: This audio level is defined in the same manner as is audio noise level in the RTP Payload Comfort Noise specification [RFC3389]. In [RFC3389], the overall magnitude of the noise level in comfort noise is encoded into the first byte of the payload, with spectral information about the noise in subsequent bytes. This specification's audio level parameter is defined so as to be identical to the comfort noise payload's noise-level byte. Section 3 is in use, and MUST be set by senders. If the header extension element is signaled with "vad=off", the V bit is not in use, and its value MUST be ignored by receivers. If the vad extension attribute is not specified, the default is "vad=on". An example attribute line in the Session Description Protocol (SDP) for a conference might hence be: a=extmap:6 urn:ietf:params:rtp-hdrext:ssrc-audio-level vad=on The vad extension attribute only controls the semantics of this header extension attribute, and does not make any statement about whether the sender is using any other voice activity detection features, such as discontinuous transmission, comfort noise, or silence suppression. Using the mechanisms of [RFC5285], an endpoint MAY signal multiple instances of the header extension element, with different values of the vad attribute, so long as these instances use different values for the extension identifier. However, again following the rules of [RFC5285], the semantics chosen for a header extension element (including its vad setting) for a particular extension identifier value MUST NOT be changed within an RTP session.
In the Secure Real-time Transport Protocol (SRTP) [RFC3711], RTP header extensions are authenticated but not encrypted. When this header extension is used, audio levels are therefore visible on a packet-by-packet basis to an attacker passively observing the audio stream. As discussed in [SRTP-VBR-AUDIO], such an attacker might be able to infer information about the conversation, possibly with phoneme-level resolution. In scenarios where this is a concern, additional mechanisms MUST be used to protect the confidentiality of the header extension. This mechanism could be header extension encryption [SRTP-ENCR-HDR], or a lower-level security and authentication mechanism such as IPsec [RFC4301]. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2198] Perkins, C., Kouvelas, I., Hodson, O., Hardman, V., Handley, M., Bolot, J., Vega-Garcia, A., and S. Fosse- Parisis, "RTP Payload for Redundant Audio Data", RFC 2198, September 1997. [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003. [RFC5285] Singer, D. and H. Desineni, "A General Mechanism for RTP Header Extensions", RFC 5285, July 2008.
[ITU.G711] International Telecommunication Union, "Pulse Code Modulation (PCM) of Voice Frequencies", ITU-T Recommendation G.711, November 1988. [RFC3389] Zopf, R., "Real-time Transport Protocol (RTP) Payload for Comfort Noise (CN)", RFC 3389, September 2002. [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC 3711, March 2004. [RFC4301] Kent, S. and K. Seo, "Security Architecture for the Internet Protocol", RFC 4301, December 2005. [RFC6465] Ivov, E., Ed., Marocco, E., Ed., and J. Lennox, "A Real-time Transport Protocol (RTP) Header Extension for Mixer-to-Client Audio Level Indication", RFC 6465, December 2011. [SRTP-ENCR-HDR] Lennox, J., "Encryption of Header Extensions in the Secure Real-Time Transport Protocol (SRTP)", Work in Progress, October 2011. [SRTP-VBR-AUDIO] Perkins, C. and JM. Valin, "Guidelines for the use of Variable Bit Rate Audio with Secure RTP", Work in Progress, July 2011.