Internet Engineering Task Force (IETF) D. Burnett
Request for Comments: 6787 Voxeo
Category: Standards Track S. Shanmugham
ISSN: 2070-1721 Cisco Systems, Inc.
November 2012 Media Resource Control Protocol Version 2 (MRCPv2)
The Media Resource Control Protocol Version 2 (MRCPv2) allows client
hosts to control media service resources such as speech synthesizers,
recognizers, verifiers, and identifiers residing in servers on the
network. MRCPv2 is not a "stand-alone" protocol -- it relies on
other protocols, such as the Session Initiation Protocol (SIP), to
coordinate MRCPv2 clients and servers and manage sessions between
them, and the Session Description Protocol (SDP) to describe,
discover, and exchange capabilities. It also depends on SIP and SDP
to establish the media sessions and associated parameters between the
media source or sink and the media server. Once this is done, the
MRCPv2 exchange operates over the control session established above,
allowing the client to control the media processing resources on the
speech resource server.
Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in Section 2 of RFC 5741.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
15. ABNF Normative Definition . . . . . . . . . . . . . . . . . . 19616. XML Schemas . . . . . . . . . . . . . . . . . . . . . . . . . 21116.1. NLSML Schema Definition . . . . . . . . . . . . . . . . 21116.2. Enrollment Results Schema Definition . . . . . . . . . . 21316.3. Verification Results Schema Definition . . . . . . . . . 21417. References . . . . . . . . . . . . . . . . . . . . . . . . . 21817.1. Normative References . . . . . . . . . . . . . . . . . . 21817.2. Informative References . . . . . . . . . . . . . . . . . 220Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 223Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 2231. Introduction
MRCPv2 is designed to allow a client device to control media
processing resources on the network. Some of these media processing
resources include speech recognition engines, speech synthesis
engines, speaker verification, and speaker identification engines.
MRCPv2 enables the implementation of distributed Interactive Voice
Response platforms using VoiceXML [W3C.REC-voicexml20-20040316]
browsers or other client applications while maintaining separate
back-end speech processing capabilities on specialized speech
processing servers. MRCPv2 is based on the earlier Media Resource
Control Protocol (MRCP) [RFC4463] developed jointly by Cisco Systems,
Inc., Nuance Communications, and Speechworks, Inc. Although some of
the method names are similar, the way in which these methods are
communicated is different. There are also more resources and more
methods for each resource. The first version of MRCP was essentially
taken only as input to the development of this protocol. There is no
expectation that an MRCPv2 client will work with an MRCPv1 server or
vice versa. There is no migration plan or gateway definition between
the two protocols.
The protocol requirements of Speech Services Control (SPEECHSC)
[RFC4313] include that the solution be capable of reaching a media
processing server, setting up communication channels to the media
resources, and sending and receiving control messages and media
streams to/from the server. The Session Initiation Protocol (SIP)
[RFC3261] meets these requirements.
The proprietary version of MRCP ran over the Real Time Streaming
Protocol (RTSP) [RFC2326]. At the time work on MRCPv2 was begun, the
consensus was that this use of RTSP would break the RTSP protocol or
cause backward-compatibility problems, something forbidden by Section
3.2 of [RFC4313]. This is the reason why MRCPv2 does not run over
MRCPv2 leverages these capabilities by building upon SIP and the
Session Description Protocol (SDP) [RFC4566]. MRCPv2 uses SIP to set
up and tear down media and control sessions with the server. In
addition, the client can use a SIP re-INVITE method (an INVITE dialog
sent within an existing SIP session) to change the characteristics of
these media and control session while maintaining the SIP dialog
between the client and server. SDP is used to describe the
parameters of the media sessions associated with that dialog. It is
mandatory to support SIP as the session establishment protocol to
ensure interoperability. Other protocols can be used for session
establishment by prior agreement. This document only describes the
use of SIP and SDP.
MRCPv2 uses SIP and SDP to create the speech client/server dialog and
set up the media channels to the server. It also uses SIP and SDP to
establish MRCPv2 control sessions between the client and the server
for each media processing resource required for that dialog. The
MRCPv2 protocol exchange between the client and the media resource is
carried on that control session. MRCPv2 exchanges do not change the
state of the SIP dialog, the media sessions, or other parameters of
the dialog initiated via SIP. It controls and affects the state of
the media processing resource associated with the MRCPv2 session(s).
MRCPv2 defines the messages to control the different media processing
resources and the state machines required to guide their operation.
It also describes how these messages are carried over a transport-
layer protocol such as the Transmission Control Protocol (TCP)
[RFC0793] or the Transport Layer Security (TLS) Protocol [RFC5246].
(Note: the Stream Control Transmission Protocol (SCTP) [RFC4960] is a
viable transport for MRCPv2 as well, but the mapping onto SCTP is not
described in this specification.)
2. Document Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Since many of the definitions and syntax are identical to those for
the Hypertext Transfer Protocol -- HTTP/1.1 [RFC2616], this
specification refers to the section where they are defined rather
than copying it. For brevity, [HX.Y] is to be taken to refer to
Section X.Y of RFC 2616.
All the mechanisms specified in this document are described in both
prose and an augmented Backus-Naur form (ABNF [RFC5234]).
The complete message format in ABNF form is provided in Section 15
and is the normative format definition. Note that productions may be
duplicated within the main body of the document for reading
convenience. If a production in the body of the text conflicts with
one in the normative definition, the latter rules.
An entity on the speech processing server that can be
controlled through MRCPv2.
Aggregate of one or more "Media Resource" entities on
a server, exposed through MRCPv2. Often, 'server' in
this document refers to an MRCP server.
An entity controlling one or more Media Resources
through MRCPv2 ("Client" for short).
Dual-Tone Multi-Frequency; a method of transmitting
key presses in-band, either as actual tones (Q.23
[Q.23]) or as named tone events (RFC 4733 [RFC4733]).
The process of automatically detecting the beginning
and end of speech in an audio stream. This is
critical both for speech recognition and for automated
recording as one would find in voice mail systems.
A mode of speech recognition where a stream of
utterances is evaluated for match against a small set
of command words. This is generally employed either
to trigger some action or to control the subsequent
grammar to be used for further recognition.
2.2. State-Machine Diagrams
The state-machine diagrams in this document do not show every
possible method call. Rather, they reflect the state of the resource
based on the methods that have moved to IN-PROGRESS or COMPLETE
states (see Section 5.3). Note that since PENDING requests
essentially have not affected the resource yet and are in the queue
to be processed, they are not reflected in the state-machine
2.3. URI Schemes
This document defines many protocol headers that contain URIs
(Uniform Resource Identifiers [RFC3986]) or lists of URIs for
referencing media. The entire document, including the Security
Considerations section (Section 12), assumes that HTTP or HTTP over
TLS (HTTPS) [RFC2818] will be used as the URI addressing scheme
unless otherwise stated. However, implementations MAY support other
schemes (such as 'file'), provided they have addressed any security
considerations described in this document and any others particular
to the specific scheme. For example, implementations where the
client and server both reside on the same physical hardware and the
file system is secured by traditional user-level file access controls
could be reasonable candidates for supporting the 'file' scheme.
A system using MRCPv2 consists of a client that requires the
generation and/or consumption of media streams and a media resource
server that has the resources or "engines" to process these streams
as input or generate these streams as output. The client uses SIP
and SDP to establish an MRCPv2 control channel with the server to use
its media processing resources. MRCPv2 servers are addressed using
SIP uses SDP with the offer/answer model described in RFC 3264
[RFC3264] to set up the MRCPv2 control channels and describe their
characteristics. A separate MRCPv2 session is needed to control each
of the media processing resources associated with the SIP dialog
between the client and server. Within a SIP dialog, the individual
resource control channels for the different resources are added or
removed through SDP offer/answer carried in a SIP re-INVITE
The server, through the SDP exchange, provides the client with a
difficult-to-guess, unambiguous channel identifier and a TCP port
number (see Section 4.2). The client MAY then open a new TCP
connection with the server on this port number. Multiple MRCPv2
channels can share a TCP connection between the client and the
server. All MRCPv2 messages exchanged between the client and the
server carry the specified channel identifier that the server MUST
ensure is unambiguous among all MRCPv2 control channels that are
active on that server. The client uses this channel identifier to
indicate the media processing resource associated with that channel.
For information on message framing, see Section 5.
SIP also establishes the media sessions between the client (or other
source/sink of media) and the MRCPv2 server using SDP "m=" lines.
One or more media processing resources may share a media session
under a SIP session, or each media processing resource may have its
own media session.
The following diagram shows the general architecture of a system that
uses MRCPv2. To simplify the diagram, only a few resources are
MRCPv2 client MRCPv2 Media Resource Server
|| Application Layer|| ||Synthesis|Recognition|Verification||
||------------------|| || Engine | Engine | Engine ||
||Media Resource API|| || || | || | || ||
||------------------|| ||Synthesis|Recognizer | Verifier ||
|| SIP | MRCPv2 || ||Resource | Resource | Resource ||
||Stack | || || Media Resource Management ||
|| | || ||----------------------------------||
||------------------|| || SIP | MRCPv2 ||
|| TCP/IP Stack ||---MRCPv2---|| Stack | ||
|| || ||----------------------------------||
||------------------||----SIP-----|| TCP/IP Stack ||
|--------------------| || ||
| | /
| Media Source/Sink |------------/
Figure 1: Architectural Diagram3.1. MRCPv2 Media Resource Types
An MRCPv2 server may offer one or more of the following media
processing resources to its clients.
A speech synthesizer resource that has very limited
capabilities and can generate its media stream
exclusively from concatenated audio clips. The speech
data is described using a limited subset of the Speech
Synthesis Markup Language (SSML)
[W3C.REC-speech-synthesis-20040907] elements. A basic
synthesizer MUST support the SSML tags <speak>,
<audio>, <say-as>, and <mark>.
A full-capability speech synthesis resource that can
render speech from text. Such a synthesizer MUST have
full SSML [W3C.REC-speech-synthesis-20040907] support.
A resource capable of recording audio and providing a
URI pointer to the recording. A recorder MUST provide
endpointing capabilities for suppressing silence at
the beginning and end of a recording, and MAY also
suppress silence in the middle of a recording. If
such suppression is done, the recorder MUST maintain
timing metadata to indicate the actual timestamps of
the recorded media.
A recognizer resource capable of extracting and
interpreting Dual-Tone Multi-Frequency (DTMF) [Q.23]
digits in a media stream and matching them against a
supplied digit grammar. It could also do a semantic
interpretation based on semantic tags in the grammar.
A full speech recognition resource that is capable of
receiving a media stream containing audio and
interpreting it to recognition results. It also has a
natural language semantic interpreter to post-process
the recognized data according to the semantic data in
the grammar and provide semantic results along with
the recognized input. The recognizer MAY also support
enrolled grammars, where the client can enroll and
create new personal grammars for use in future
A resource capable of verifying the authenticity of a
claimed identity by matching a media stream containing
spoken input to a pre-existing voiceprint. This may
also involve matching the caller's voice against more
than one voiceprint, also called multi-verification or