RFC 6787

Media Resource Control Protocol Version 2 (MRCPv2)

Pages: 224
Proposed Standard
→ Errata

Part 1 of 8 – Pages 1 to 14

RFC6787 - Page 1

Internet Engineering Task Force (IETF)                        D. Burnett
Request for Comments: 6787                                         Voxeo
Category: Standards Track                                  S. Shanmugham
ISSN: 2070-1721                                      Cisco Systems, Inc.
                                                           November 2012


           Media Resource Control Protocol Version 2 (MRCPv2)

Abstract

   The Media Resource Control Protocol Version 2 (MRCPv2) allows client
   hosts to control media service resources such as speech synthesizers,
   recognizers, verifiers, and identifiers residing in servers on the
   network.  MRCPv2 is not a "stand-alone" protocol -- it relies on
   other protocols, such as the Session Initiation Protocol (SIP), to
   coordinate MRCPv2 clients and servers and manage sessions between
   them, and the Session Description Protocol (SDP) to describe,
   discover, and exchange capabilities.  It also depends on SIP and SDP
   to establish the media sessions and associated parameters between the
   media source or sink and the media server.  Once this is done, the
   MRCPv2 exchange operates over the control session established above,
   allowing the client to control the media processing resources on the
   speech resource server.

Status of This Memo

   This is an Internet Standards Track document.

   This document is a product of the Internet Engineering Task Force
   (IETF).  It represents the consensus of the IETF community.  It has
   received public review and has been approved for publication by the
   Internet Engineering Steering Group (IESG).  Further information on
   Internet Standards is available in Section 2 of RFC 5741.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   http://www.rfc-editor.org/info/rfc6787.

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents

RFC6787 - Page 2

   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   8
   2.  Document Conventions  . . . . . . . . . . . . . . . . . . . .   9
     2.1.   Definitions  . . . . . . . . . . . . . . . . . . . . . .  10
     2.2.   State-Machine Diagrams . . . . . . . . . . . . . . . . .  10
     2.3.   URI Schemes  . . . . . . . . . . . . . . . . . . . . . .  11
   3.  Architecture  . . . . . . . . . . . . . . . . . . . . . . . .  11
     3.1.   MRCPv2 Media Resource Types  . . . . . . . . . . . . . .  12
     3.2.   Server and Resource Addressing . . . . . . . . . . . . .  14
   4.  MRCPv2 Basics . . . . . . . . . . . . . . . . . . . . . . . .  14
     4.1.   Connecting to the Server . . . . . . . . . . . . . . . .  14
     4.2.   Managing Resource Control Channels . . . . . . . . . . .  14
     4.3.   SIP Session Example  . . . . . . . . . . . . . . . . . .  17
     4.4.   Media Streams and RTP Ports  . . . . . . . . . . . . . .  22
     4.5.   MRCPv2 Message Transport . . . . . . . . . . . . . . . .  24
     4.6.   MRCPv2 Session Termination . . . . . . . . . . . . . . .  24
   5.  MRCPv2 Specification  . . . . . . . . . . . . . . . . . . . .  24
     5.1.   Common Protocol Elements . . . . . . . . . . . . . . . .  25
     5.2.   Request  . . . . . . . . . . . . . . . . . . . . . . . .  28
     5.3.   Response . . . . . . . . . . . . . . . . . . . . . . . .  29
     5.4.   Status Codes . . . . . . . . . . . . . . . . . . . . . .  30
     5.5.   Events . . . . . . . . . . . . . . . . . . . . . . . . .  31
   6.  MRCPv2 Generic Methods, Headers, and Result Structure . . . .  32
     6.1.   Generic Methods  . . . . . . . . . . . . . . . . . . . .  32
       6.1.1.   SET-PARAMS . . . . . . . . . . . . . . . . . . . . .  32
       6.1.2.   GET-PARAMS . . . . . . . . . . . . . . . . . . . . .  33
     6.2.   Generic Message Headers  . . . . . . . . . . . . . . . .  34
       6.2.1.   Channel-Identifier . . . . . . . . . . . . . . . . .  35
       6.2.2.   Accept . . . . . . . . . . . . . . . . . . . . . . .  36

RFC6787 - Page 3

       6.2.3.   Active-Request-Id-List . . . . . . . . . . . . . . .  36
       6.2.4.   Proxy-Sync-Id  . . . . . . . . . . . . . . . . . . .  36
       6.2.5.   Accept-Charset . . . . . . . . . . . . . . . . . . .  37
       6.2.6.   Content-Type . . . . . . . . . . . . . . . . . . . .  37
       6.2.7.   Content-ID . . . . . . . . . . . . . . . . . . . . .  38
       6.2.8.   Content-Base . . . . . . . . . . . . . . . . . . . .  38
       6.2.9.   Content-Encoding . . . . . . . . . . . . . . . . . .  38
       6.2.10.  Content-Location . . . . . . . . . . . . . . . . . .  39
       6.2.11.  Content-Length . . . . . . . . . . . . . . . . . . .  39
       6.2.12.  Fetch Timeout  . . . . . . . . . . . . . . . . . . .  39
       6.2.13.  Cache-Control  . . . . . . . . . . . . . . . . . . .  40
       6.2.14.  Logging-Tag  . . . . . . . . . . . . . . . . . . . .  41
       6.2.15.  Set-Cookie . . . . . . . . . . . . . . . . . . . . .  42
       6.2.16.  Vendor-Specific Parameters . . . . . . . . . . . . .  44
     6.3.   Generic Result Structure . . . . . . . . . . . . . . . .  44
       6.3.1.   Natural Language Semantics Markup Language . . . . .  45
   7.  Resource Discovery  . . . . . . . . . . . . . . . . . . . . .  46
   8.  Speech Synthesizer Resource . . . . . . . . . . . . . . . . .  47
     8.1.   Synthesizer State Machine  . . . . . . . . . . . . . . .  48
     8.2.   Synthesizer Methods  . . . . . . . . . . . . . . . . . .  48
     8.3.   Synthesizer Events . . . . . . . . . . . . . . . . . . .  49
     8.4.   Synthesizer Header Fields  . . . . . . . . . . . . . . .  49
       8.4.1.   Jump-Size  . . . . . . . . . . . . . . . . . . . . .  49
       8.4.2.   Kill-On-Barge-In . . . . . . . . . . . . . . . . . .  50
       8.4.3.   Speaker-Profile  . . . . . . . . . . . . . . . . . .  51
       8.4.4.   Completion-Cause . . . . . . . . . . . . . . . . . .  51
       8.4.5.   Completion-Reason  . . . . . . . . . . . . . . . . .  52
       8.4.6.   Voice-Parameter  . . . . . . . . . . . . . . . . . .  52
       8.4.7.   Prosody-Parameters . . . . . . . . . . . . . . . . .  53
       8.4.8.   Speech-Marker  . . . . . . . . . . . . . . . . . . .  53
       8.4.9.   Speech-Language  . . . . . . . . . . . . . . . . . .  54
       8.4.10.  Fetch-Hint . . . . . . . . . . . . . . . . . . . . .  54
       8.4.11.  Audio-Fetch-Hint . . . . . . . . . . . . . . . . . .  55
       8.4.12.  Failed-URI . . . . . . . . . . . . . . . . . . . . .  55
       8.4.13.  Failed-URI-Cause . . . . . . . . . . . . . . . . . .  55
       8.4.14.  Speak-Restart  . . . . . . . . . . . . . . . . . . .  56
       8.4.15.  Speak-Length . . . . . . . . . . . . . . . . . . . .  56
       8.4.16.  Load-Lexicon . . . . . . . . . . . . . . . . . . . .  57
       8.4.17.  Lexicon-Search-Order . . . . . . . . . . . . . . . .  57
     8.5.   Synthesizer Message Body . . . . . . . . . . . . . . . .  57
       8.5.1.   Synthesizer Speech Data  . . . . . . . . . . . . . .  57
       8.5.2.   Lexicon Data . . . . . . . . . . . . . . . . . . . .  59
     8.6.   SPEAK Method . . . . . . . . . . . . . . . . . . . . . .  60
     8.7.   STOP . . . . . . . . . . . . . . . . . . . . . . . . . .  62
     8.8.   BARGE-IN-OCCURRED  . . . . . . . . . . . . . . . . . . .  63
     8.9.   PAUSE  . . . . . . . . . . . . . . . . . . . . . . . . .  65
     8.10.  RESUME . . . . . . . . . . . . . . . . . . . . . . . . .  66
     8.11.  CONTROL  . . . . . . . . . . . . . . . . . . . . . . . .  67

RFC6787 - Page 4

     8.12.  SPEAK-COMPLETE . . . . . . . . . . . . . . . . . . . . .  69
     8.13.  SPEECH-MARKER  . . . . . . . . . . . . . . . . . . . . .  70
     8.14.  DEFINE-LEXICON . . . . . . . . . . . . . . . . . . . . .  71
   9.  Speech Recognizer Resource  . . . . . . . . . . . . . . . . .  72
     9.1.   Recognizer State Machine . . . . . . . . . . . . . . . .  74
     9.2.   Recognizer Methods . . . . . . . . . . . . . . . . . . .  74
     9.3.   Recognizer Events  . . . . . . . . . . . . . . . . . . .  75
     9.4.   Recognizer Header Fields . . . . . . . . . . . . . . . .  75
       9.4.1.   Confidence-Threshold . . . . . . . . . . . . . . . .  77
       9.4.2.   Sensitivity-Level  . . . . . . . . . . . . . . . . .  77
       9.4.3.   Speed-Vs-Accuracy  . . . . . . . . . . . . . . . . .  77
       9.4.4.   N-Best-List-Length . . . . . . . . . . . . . . . . .  78
       9.4.5.   Input-Type . . . . . . . . . . . . . . . . . . . . .  78
       9.4.6.   No-Input-Timeout . . . . . . . . . . . . . . . . . .  78
       9.4.7.   Recognition-Timeout  . . . . . . . . . . . . . . . .  79
       9.4.8.   Waveform-URI . . . . . . . . . . . . . . . . . . . .  79
       9.4.9.   Media-Type . . . . . . . . . . . . . . . . . . . . .  80
       9.4.10.  Input-Waveform-URI . . . . . . . . . . . . . . . . .  80
       9.4.11.  Completion-Cause . . . . . . . . . . . . . . . . . .  80
       9.4.12.  Completion-Reason  . . . . . . . . . . . . . . . . .  83
       9.4.13.  Recognizer-Context-Block . . . . . . . . . . . . . .  83
       9.4.14.  Start-Input-Timers . . . . . . . . . . . . . . . . .  83
       9.4.15.  Speech-Complete-Timeout  . . . . . . . . . . . . . .  84
       9.4.16.  Speech-Incomplete-Timeout  . . . . . . . . . . . . .  84
       9.4.17.  DTMF-Interdigit-Timeout  . . . . . . . . . . . . . .  85
       9.4.18.  DTMF-Term-Timeout  . . . . . . . . . . . . . . . . .  85
       9.4.19.  DTMF-Term-Char . . . . . . . . . . . . . . . . . . .  85
       9.4.20.  Failed-URI . . . . . . . . . . . . . . . . . . . . .  86
       9.4.21.  Failed-URI-Cause . . . . . . . . . . . . . . . . . .  86
       9.4.22.  Save-Waveform  . . . . . . . . . . . . . . . . . . .  86
       9.4.23.  New-Audio-Channel  . . . . . . . . . . . . . . . . .  86
       9.4.24.  Speech-Language  . . . . . . . . . . . . . . . . . .  87
       9.4.25.  Ver-Buffer-Utterance . . . . . . . . . . . . . . . .  87
       9.4.26.  Recognition-Mode . . . . . . . . . . . . . . . . . .  87
       9.4.27.  Cancel-If-Queue  . . . . . . . . . . . . . . . . . .  88
       9.4.28.  Hotword-Max-Duration . . . . . . . . . . . . . . . .  88
       9.4.29.  Hotword-Min-Duration . . . . . . . . . . . . . . . .  88
       9.4.30.  Interpret-Text . . . . . . . . . . . . . . . . . . .  89
       9.4.31.  DTMF-Buffer-Time . . . . . . . . . . . . . . . . . .  89
       9.4.32.  Clear-DTMF-Buffer  . . . . . . . . . . . . . . . . .  89
       9.4.33.  Early-No-Match . . . . . . . . . . . . . . . . . . .  90
       9.4.34.  Num-Min-Consistent-Pronunciations  . . . . . . . . .  90
       9.4.35.  Consistency-Threshold  . . . . . . . . . . . . . . .  90
       9.4.36.  Clash-Threshold  . . . . . . . . . . . . . . . . . .  90
       9.4.37.  Personal-Grammar-URI . . . . . . . . . . . . . . . .  91
       9.4.38.  Enroll-Utterance . . . . . . . . . . . . . . . . . .  91
       9.4.39.  Phrase-Id  . . . . . . . . . . . . . . . . . . . . .  91
       9.4.40.  Phrase-NL  . . . . . . . . . . . . . . . . . . . . .  92

RFC6787 - Page 5

       9.4.41.  Weight . . . . . . . . . . . . . . . . . . . . . . .  92
       9.4.42.  Save-Best-Waveform . . . . . . . . . . . . . . . . .  92
       9.4.43.  New-Phrase-Id  . . . . . . . . . . . . . . . . . . .  93
       9.4.44.  Confusable-Phrases-URI . . . . . . . . . . . . . . .  93
       9.4.45.  Abort-Phrase-Enrollment  . . . . . . . . . . . . . .  93
     9.5.   Recognizer Message Body  . . . . . . . . . . . . . . . .  93
       9.5.1.   Recognizer Grammar Data  . . . . . . . . . . . . . .  93
       9.5.2.   Recognizer Result Data . . . . . . . . . . . . . . .  97
       9.5.3.   Enrollment Result Data . . . . . . . . . . . . . . .  98
       9.5.4.   Recognizer Context Block . . . . . . . . . . . . . .  98
     9.6.   Recognizer Results . . . . . . . . . . . . . . . . . . .  99
       9.6.1.   Markup Functions . . . . . . . . . . . . . . . . . .  99
       9.6.2.   Overview of Recognizer Result Elements and Their
                Relationships  . . . . . . . . . . . . . . . . . . . 100
       9.6.3.   Elements and Attributes  . . . . . . . . . . . . . . 101
     9.7.   Enrollment Results . . . . . . . . . . . . . . . . . . . 106
       9.7.1.   <num-clashes> Element  . . . . . . . . . . . . . . . 106
       9.7.2.   <num-good-repetitions> Element . . . . . . . . . . . 106
       9.7.3.   <num-repetitions-still-needed> Element . . . . . . . 107
       9.7.4.   <consistency-status> Element . . . . . . . . . . . . 107
       9.7.5.   <clash-phrase-ids> Element . . . . . . . . . . . . . 107
       9.7.6.   <transcriptions> Element . . . . . . . . . . . . . . 107
       9.7.7.   <confusable-phrases> Element . . . . . . . . . . . . 107
     9.8.   DEFINE-GRAMMAR . . . . . . . . . . . . . . . . . . . . . 107
     9.9.   RECOGNIZE  . . . . . . . . . . . . . . . . . . . . . . . 111
     9.10.  STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 118
     9.11.  GET-RESULT . . . . . . . . . . . . . . . . . . . . . . . 119
     9.12.  START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 120
     9.13.  START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 120
     9.14.  RECOGNITION-COMPLETE . . . . . . . . . . . . . . . . . . 120
     9.15.  START-PHRASE-ENROLLMENT  . . . . . . . . . . . . . . . . 123
     9.16.  ENROLLMENT-ROLLBACK  . . . . . . . . . . . . . . . . . . 124
     9.17.  END-PHRASE-ENROLLMENT  . . . . . . . . . . . . . . . . . 124
     9.18.  MODIFY-PHRASE  . . . . . . . . . . . . . . . . . . . . . 125
     9.19.  DELETE-PHRASE  . . . . . . . . . . . . . . . . . . . . . 125
     9.20.  INTERPRET  . . . . . . . . . . . . . . . . . . . . . . . 125
     9.21.  INTERPRETATION-COMPLETE  . . . . . . . . . . . . . . . . 127
     9.22.  DTMF Detection . . . . . . . . . . . . . . . . . . . . . 128
   10. Recorder Resource . . . . . . . . . . . . . . . . . . . . . . 129
     10.1.  Recorder State Machine . . . . . . . . . . . . . . . . . 129
     10.2.  Recorder Methods . . . . . . . . . . . . . . . . . . . . 130
     10.3.  Recorder Events  . . . . . . . . . . . . . . . . . . . . 130
     10.4.  Recorder Header Fields . . . . . . . . . . . . . . . . . 130
       10.4.1.  Sensitivity-Level  . . . . . . . . . . . . . . . . . 130
       10.4.2.  No-Input-Timeout . . . . . . . . . . . . . . . . . . 131
       10.4.3.  Completion-Cause . . . . . . . . . . . . . . . . . . 131
       10.4.4.  Completion-Reason  . . . . . . . . . . . . . . . . . 132
       10.4.5.  Failed-URI . . . . . . . . . . . . . . . . . . . . . 132

RFC6787 - Page 6

       10.4.6.  Failed-URI-Cause . . . . . . . . . . . . . . . . . . 132
       10.4.7.  Record-URI . . . . . . . . . . . . . . . . . . . . . 132
       10.4.8.  Media-Type . . . . . . . . . . . . . . . . . . . . . 133
       10.4.9.  Max-Time . . . . . . . . . . . . . . . . . . . . . . 133
       10.4.10. Trim-Length  . . . . . . . . . . . . . . . . . . . . 134
       10.4.11. Final-Silence  . . . . . . . . . . . . . . . . . . . 134
       10.4.12. Capture-On-Speech  . . . . . . . . . . . . . . . . . 134
       10.4.13. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 134
       10.4.14. Start-Input-Timers . . . . . . . . . . . . . . . . . 135
       10.4.15. New-Audio-Channel  . . . . . . . . . . . . . . . . . 135
     10.5.  Recorder Message Body  . . . . . . . . . . . . . . . . . 135
     10.6.  RECORD . . . . . . . . . . . . . . . . . . . . . . . . . 135
     10.7.  STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 136
     10.8.  RECORD-COMPLETE  . . . . . . . . . . . . . . . . . . . . 137
     10.9.  START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 138
     10.10. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 138
   11. Speaker Verification and Identification . . . . . . . . . . . 139
     11.1.  Speaker Verification State Machine . . . . . . . . . . . 140
     11.2.  Speaker Verification Methods . . . . . . . . . . . . . . 142
     11.3.  Verification Events  . . . . . . . . . . . . . . . . . . 144
     11.4.  Verification Header Fields . . . . . . . . . . . . . . . 144
       11.4.1.  Repository-URI . . . . . . . . . . . . . . . . . . . 144
       11.4.2.  Voiceprint-Identifier  . . . . . . . . . . . . . . . 145
       11.4.3.  Verification-Mode  . . . . . . . . . . . . . . . . . 145
       11.4.4.  Adapt-Model  . . . . . . . . . . . . . . . . . . . . 146
       11.4.5.  Abort-Model  . . . . . . . . . . . . . . . . . . . . 146
       11.4.6.  Min-Verification-Score . . . . . . . . . . . . . . . 147
       11.4.7.  Num-Min-Verification-Phrases . . . . . . . . . . . . 147
       11.4.8.  Num-Max-Verification-Phrases . . . . . . . . . . . . 147
       11.4.9.  No-Input-Timeout . . . . . . . . . . . . . . . . . . 148
       11.4.10. Save-Waveform  . . . . . . . . . . . . . . . . . . . 148
       11.4.11. Media-Type . . . . . . . . . . . . . . . . . . . . . 148
       11.4.12. Waveform-URI . . . . . . . . . . . . . . . . . . . . 148
       11.4.13. Voiceprint-Exists  . . . . . . . . . . . . . . . . . 149
       11.4.14. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 149
       11.4.15. Input-Waveform-URI . . . . . . . . . . . . . . . . . 149
       11.4.16. Completion-Cause . . . . . . . . . . . . . . . . . . 150
       11.4.17. Completion-Reason  . . . . . . . . . . . . . . . . . 151
       11.4.18. Speech-Complete-Timeout  . . . . . . . . . . . . . . 151
       11.4.19. New-Audio-Channel  . . . . . . . . . . . . . . . . . 152
       11.4.20. Abort-Verification . . . . . . . . . . . . . . . . . 152
       11.4.21. Start-Input-Timers . . . . . . . . . . . . . . . . . 152
     11.5.  Verification Message Body  . . . . . . . . . . . . . . . 152
       11.5.1.  Verification Result Data . . . . . . . . . . . . . . 152
       11.5.2.  Verification Result Elements . . . . . . . . . . . . 153
     11.6.  START-SESSION  . . . . . . . . . . . . . . . . . . . . . 157
     11.7.  END-SESSION  . . . . . . . . . . . . . . . . . . . . . . 158
     11.8.  QUERY-VOICEPRINT . . . . . . . . . . . . . . . . . . . . 159

RFC6787 - Page 7

     11.9.  DELETE-VOICEPRINT  . . . . . . . . . . . . . . . . . . . 160
     11.10. VERIFY . . . . . . . . . . . . . . . . . . . . . . . . . 160
     11.11. VERIFY-FROM-BUFFER . . . . . . . . . . . . . . . . . . . 160
     11.12. VERIFY-ROLLBACK  . . . . . . . . . . . . . . . . . . . . 164
     11.13. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 164
     11.14. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 165
     11.15. VERIFICATION-COMPLETE  . . . . . . . . . . . . . . . . . 165
     11.16. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 166
     11.17. CLEAR-BUFFER . . . . . . . . . . . . . . . . . . . . . . 166
     11.18. GET-INTERMEDIATE-RESULT  . . . . . . . . . . . . . . . . 167
   12. Security Considerations . . . . . . . . . . . . . . . . . . . 168
     12.1.  Rendezvous and Session Establishment . . . . . . . . . . 168
     12.2.  Control Channel Protection . . . . . . . . . . . . . . . 168
     12.3.  Media Session Protection . . . . . . . . . . . . . . . . 169
     12.4.  Indirect Content Access  . . . . . . . . . . . . . . . . 169
     12.5.  Protection of Stored Media . . . . . . . . . . . . . . . 170
     12.6.  DTMF and Recognition Buffers . . . . . . . . . . . . . . 171
     12.7.  Client-Set Server Parameters . . . . . . . . . . . . . . 171
     12.8.  DELETE-VOICEPRINT and Authorization  . . . . . . . . . . 171
   13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 171
     13.1.  New Registries . . . . . . . . . . . . . . . . . . . . . 171
       13.1.1.  MRCPv2 Resource Types  . . . . . . . . . . . . . . . 171
       13.1.2.  MRCPv2 Methods and Events  . . . . . . . . . . . . . 172
       13.1.3.  MRCPv2 Header Fields . . . . . . . . . . . . . . . . 173
       13.1.4.  MRCPv2 Status Codes  . . . . . . . . . . . . . . . . 176
       13.1.5.  Grammar Reference List Parameters  . . . . . . . . . 176
       13.1.6.  MRCPv2 Vendor-Specific Parameters  . . . . . . . . . 176
     13.2.  NLSML-Related Registrations  . . . . . . . . . . . . . . 177
       13.2.1.  'application/nlsml+xml' Media Type Registration  . . 177
     13.3.  NLSML XML Schema Registration  . . . . . . . . . . . . . 178
     13.4.  MRCPv2 XML Namespace Registration  . . . . . . . . . . . 178
     13.5.  Text Media Type Registrations  . . . . . . . . . . . . . 178
       13.5.1.  text/grammar-ref-list  . . . . . . . . . . . . . . . 178
     13.6.  'session' URI Scheme Registration  . . . . . . . . . . . 180
     13.7.  SDP Parameter Registrations  . . . . . . . . . . . . . . 181
       13.7.1.  Sub-Registry "proto" . . . . . . . . . . . . . . . . 181
       13.7.2.  Sub-Registry "att-field (media-level)" . . . . . . . 182
   14. Examples  . . . . . . . . . . . . . . . . . . . . . . . . . . 183
     14.1.  Message Flow . . . . . . . . . . . . . . . . . . . . . . 183
     14.2.  Recognition Result Examples  . . . . . . . . . . . . . . 192
       14.2.1.  Simple ASR Ambiguity . . . . . . . . . . . . . . . . 192
       14.2.2.  Mixed Initiative . . . . . . . . . . . . . . . . . . 192
       14.2.3.  DTMF Input . . . . . . . . . . . . . . . . . . . . . 193
       14.2.4.  Interpreting Meta-Dialog and Meta-Task Utterances  . 194
       14.2.5.  Anaphora and Deixis  . . . . . . . . . . . . . . . . 195
       14.2.6.  Distinguishing Individual Items from Sets with
                One Member . . . . . . . . . . . . . . . . . . . . . 195
       14.2.7.  Extensibility  . . . . . . . . . . . . . . . . . . . 196

RFC6787 - Page 8

   15. ABNF Normative Definition . . . . . . . . . . . . . . . . . . 196
   16. XML Schemas . . . . . . . . . . . . . . . . . . . . . . . . . 211
     16.1.  NLSML Schema Definition  . . . . . . . . . . . . . . . . 211
     16.2.  Enrollment Results Schema Definition . . . . . . . . . . 213
     16.3.  Verification Results Schema Definition . . . . . . . . . 214
   17. References  . . . . . . . . . . . . . . . . . . . . . . . . . 218
     17.1.  Normative References . . . . . . . . . . . . . . . . . . 218
     17.2.  Informative References . . . . . . . . . . . . . . . . . 220
   Appendix A.  Contributors . . . . . . . . . . . . . . . . . . . . 223
   Appendix B.  Acknowledgements . . . . . . . . . . . . . . . . . . 223

1.  Introduction

   MRCPv2 is designed to allow a client device to control media
   processing resources on the network.  Some of these media processing
   resources include speech recognition engines, speech synthesis
   engines, speaker verification, and speaker identification engines.
   MRCPv2 enables the implementation of distributed Interactive Voice
   Response platforms using VoiceXML [W3C.REC-voicexml20-20040316]
   browsers or other client applications while maintaining separate
   back-end speech processing capabilities on specialized speech
   processing servers.  MRCPv2 is based on the earlier Media Resource
   Control Protocol (MRCP) [RFC4463] developed jointly by Cisco Systems,
   Inc., Nuance Communications, and Speechworks, Inc.  Although some of
   the method names are similar, the way in which these methods are
   communicated is different.  There are also more resources and more
   methods for each resource.  The first version of MRCP was essentially
   taken only as input to the development of this protocol.  There is no
   expectation that an MRCPv2 client will work with an MRCPv1 server or
   vice versa.  There is no migration plan or gateway definition between
   the two protocols.

   The protocol requirements of Speech Services Control (SPEECHSC)
   [RFC4313] include that the solution be capable of reaching a media
   processing server, setting up communication channels to the media
   resources, and sending and receiving control messages and media
   streams to/from the server.  The Session Initiation Protocol (SIP)
   [RFC3261] meets these requirements.

   The proprietary version of MRCP ran over the Real Time Streaming
   Protocol (RTSP) [RFC2326].  At the time work on MRCPv2 was begun, the
   consensus was that this use of RTSP would break the RTSP protocol or
   cause backward-compatibility problems, something forbidden by Section
   3.2 of [RFC4313].  This is the reason why MRCPv2 does not run over
   RTSP.

RFC6787 - Page 9

   MRCPv2 leverages these capabilities by building upon SIP and the
   Session Description Protocol (SDP) [RFC4566].  MRCPv2 uses SIP to set
   up and tear down media and control sessions with the server.  In
   addition, the client can use a SIP re-INVITE method (an INVITE dialog
   sent within an existing SIP session) to change the characteristics of
   these media and control session while maintaining the SIP dialog
   between the client and server.  SDP is used to describe the
   parameters of the media sessions associated with that dialog.  It is
   mandatory to support SIP as the session establishment protocol to
   ensure interoperability.  Other protocols can be used for session
   establishment by prior agreement.  This document only describes the
   use of SIP and SDP.

   MRCPv2 uses SIP and SDP to create the speech client/server dialog and
   set up the media channels to the server.  It also uses SIP and SDP to
   establish MRCPv2 control sessions between the client and the server
   for each media processing resource required for that dialog.  The
   MRCPv2 protocol exchange between the client and the media resource is
   carried on that control session.  MRCPv2 exchanges do not change the
   state of the SIP dialog, the media sessions, or other parameters of
   the dialog initiated via SIP.  It controls and affects the state of
   the media processing resource associated with the MRCPv2 session(s).

   MRCPv2 defines the messages to control the different media processing
   resources and the state machines required to guide their operation.
   It also describes how these messages are carried over a transport-
   layer protocol such as the Transmission Control Protocol (TCP)
   [RFC0793] or the Transport Layer Security (TLS) Protocol [RFC5246].
   (Note: the Stream Control Transmission Protocol (SCTP) [RFC4960] is a
   viable transport for MRCPv2 as well, but the mapping onto SCTP is not
   described in this specification.)

2.  Document Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

   Since many of the definitions and syntax are identical to those for
   the Hypertext Transfer Protocol -- HTTP/1.1 [RFC2616], this
   specification refers to the section where they are defined rather
   than copying it.  For brevity, [HX.Y] is to be taken to refer to
   Section X.Y of RFC 2616.

   All the mechanisms specified in this document are described in both
   prose and an augmented Backus-Naur form (ABNF [RFC5234]).

RFC6787 - Page 10

   The complete message format in ABNF form is provided in Section 15
   and is the normative format definition.  Note that productions may be
   duplicated within the main body of the document for reading
   convenience.  If a production in the body of the text conflicts with
   one in the normative definition, the latter rules.

2.1.  Definitions

   Media Resource
                  An entity on the speech processing server that can be
                  controlled through MRCPv2.

   MRCP Server
                  Aggregate of one or more "Media Resource" entities on
                  a server, exposed through MRCPv2.  Often, 'server' in
                  this document refers to an MRCP server.

   MRCP Client
                  An entity controlling one or more Media Resources
                  through MRCPv2 ("Client" for short).

   DTMF
                  Dual-Tone Multi-Frequency; a method of transmitting
                  key presses in-band, either as actual tones (Q.23
                  [Q.23]) or as named tone events (RFC 4733 [RFC4733]).

   Endpointing
                  The process of automatically detecting the beginning
                  and end of speech in an audio stream.  This is
                  critical both for speech recognition and for automated
                  recording as one would find in voice mail systems.

   Hotword Mode
                  A mode of speech recognition where a stream of
                  utterances is evaluated for match against a small set
                  of command words.  This is generally employed either
                  to trigger some action or to control the subsequent
                  grammar to be used for further recognition.

2.2.  State-Machine Diagrams

   The state-machine diagrams in this document do not show every
   possible method call.  Rather, they reflect the state of the resource
   based on the methods that have moved to IN-PROGRESS or COMPLETE
   states (see Section 5.3).  Note that since PENDING requests
   essentially have not affected the resource yet and are in the queue
   to be processed, they are not reflected in the state-machine
   diagrams.

RFC6787 - Page 11

2.3.  URI Schemes

   This document defines many protocol headers that contain URIs
   (Uniform Resource Identifiers [RFC3986]) or lists of URIs for
   referencing media.  The entire document, including the Security
   Considerations section (Section 12), assumes that HTTP or HTTP over
   TLS (HTTPS) [RFC2818] will be used as the URI addressing scheme
   unless otherwise stated.  However, implementations MAY support other
   schemes (such as 'file'), provided they have addressed any security
   considerations described in this document and any others particular
   to the specific scheme.  For example, implementations where the
   client and server both reside on the same physical hardware and the
   file system is secured by traditional user-level file access controls
   could be reasonable candidates for supporting the 'file' scheme.

3.  Architecture

   A system using MRCPv2 consists of a client that requires the
   generation and/or consumption of media streams and a media resource
   server that has the resources or "engines" to process these streams
   as input or generate these streams as output.  The client uses SIP
   and SDP to establish an MRCPv2 control channel with the server to use
   its media processing resources.  MRCPv2 servers are addressed using
   SIP URIs.

   SIP uses SDP with the offer/answer model described in RFC 3264
   [RFC3264] to set up the MRCPv2 control channels and describe their
   characteristics.  A separate MRCPv2 session is needed to control each
   of the media processing resources associated with the SIP dialog
   between the client and server.  Within a SIP dialog, the individual
   resource control channels for the different resources are added or
   removed through SDP offer/answer carried in a SIP re-INVITE
   transaction.

   The server, through the SDP exchange, provides the client with a
   difficult-to-guess, unambiguous channel identifier and a TCP port
   number (see Section 4.2).  The client MAY then open a new TCP
   connection with the server on this port number.  Multiple MRCPv2
   channels can share a TCP connection between the client and the
   server.  All MRCPv2 messages exchanged between the client and the
   server carry the specified channel identifier that the server MUST
   ensure is unambiguous among all MRCPv2 control channels that are
   active on that server.  The client uses this channel identifier to
   indicate the media processing resource associated with that channel.
   For information on message framing, see Section 5.

   SIP also establishes the media sessions between the client (or other
   source/sink of media) and the MRCPv2 server using SDP "m=" lines.

RFC6787 - Page 12

   One or more media processing resources may share a media session
   under a SIP session, or each media processing resource may have its
   own media session.

   The following diagram shows the general architecture of a system that
   uses MRCPv2.  To simplify the diagram, only a few resources are
   shown.

     MRCPv2 client                   MRCPv2 Media Resource Server
|--------------------|            |------------------------------------|
||------------------||            ||----------------------------------||
|| Application Layer||            ||Synthesis|Recognition|Verification||
||------------------||            || Engine  |  Engine   |   Engine   ||
||Media Resource API||            ||    ||   |    ||     |    ||      ||
||------------------||            ||Synthesis|Recognizer |  Verifier  ||
|| SIP  |  MRCPv2   ||            ||Resource | Resource  |  Resource  ||
||Stack |           ||            ||     Media Resource Management    ||
||      |           ||            ||----------------------------------||
||------------------||            ||   SIP  |        MRCPv2           ||
||   TCP/IP Stack   ||---MRCPv2---||  Stack |                         ||
||                  ||            ||----------------------------------||
||------------------||----SIP-----||           TCP/IP Stack           ||
|--------------------|            ||                                  ||
         |                        ||----------------------------------||
        SIP                       |------------------------------------|
         |                          /
|-------------------|             RTP
|                   |             /
| Media Source/Sink |------------/
|                   |
|-------------------|

                      Figure 1: Architectural Diagram

3.1.  MRCPv2 Media Resource Types

   An MRCPv2 server may offer one or more of the following media
   processing resources to its clients.

   Basic Synthesizer
                  A speech synthesizer resource that has very limited
                  capabilities and can generate its media stream
                  exclusively from concatenated audio clips.  The speech
                  data is described using a limited subset of the Speech
                  Synthesis Markup Language (SSML)
                  [W3C.REC-speech-synthesis-20040907] elements.  A basic
                  synthesizer MUST support the SSML tags <speak>,
                  <audio>, <say-as>, and <mark>.

RFC6787 - Page 13

   Speech Synthesizer
                  A full-capability speech synthesis resource that can
                  render speech from text.  Such a synthesizer MUST have
                  full SSML [W3C.REC-speech-synthesis-20040907] support.

   Recorder
                  A resource capable of recording audio and providing a
                  URI pointer to the recording.  A recorder MUST provide
                  endpointing capabilities for suppressing silence at
                  the beginning and end of a recording, and MAY also
                  suppress silence in the middle of a recording.  If
                  such suppression is done, the recorder MUST maintain
                  timing metadata to indicate the actual timestamps of
                  the recorded media.

   DTMF Recognizer
                  A recognizer resource capable of extracting and
                  interpreting Dual-Tone Multi-Frequency (DTMF) [Q.23]
                  digits in a media stream and matching them against a
                  supplied digit grammar.  It could also do a semantic
                  interpretation based on semantic tags in the grammar.

   Speech Recognizer
                  A full speech recognition resource that is capable of
                  receiving a media stream containing audio and
                  interpreting it to recognition results.  It also has a
                  natural language semantic interpreter to post-process
                  the recognized data according to the semantic data in
                  the grammar and provide semantic results along with
                  the recognized input.  The recognizer MAY also support
                  enrolled grammars, where the client can enroll and
                  create new personal grammars for use in future
                  recognition operations.

   Speaker Verifier
                  A resource capable of verifying the authenticity of a
                  claimed identity by matching a media stream containing
                  spoken input to a pre-existing voiceprint.  This may
                  also involve matching the caller's voice against more
                  than one voiceprint, also called multi-verification or
                  speaker identification.

RFC6787 - Page 14

3.2.  Server and Resource Addressing

   The MRCPv2 server is a generic SIP server, and is thus addressed by a
   SIP URI (RFC 3261 [RFC3261]).

   For example:

        sip:mrcpv2@example.net   or
        sips:mrcpv2@example.net

(page 14 continued on part 2)