RFC 6787

Media Resource Control Protocol Version 2 (MRCPv2)

Pages: 224
Proposed Standard
→ Errata

Part 3 of 8 – Pages 46 to 72

RFC6787 - Page 46 prevText

7.  Resource Discovery

   Server resources may be discovered and their capabilities learned by
   clients through standard SIP machinery.  The client MAY issue a SIP
   OPTIONS transaction to a server, which has the effect of requesting
   the capabilities of the server.  The server MUST respond to such a
   request with an SDP-encoded description of its capabilities according
   to RFC 3264 [RFC3264].  The MRCPv2 capabilities are described by a
   single "m=" line containing the media type "application" and
   transport type "TCP/TLS/MRCPv2" or "TCP/MRCPv2".  There MUST be one
   "resource" attribute for each media resource that the server
   supports, and it has the resource type identifier as its value.

   The SDP description MUST also contain "m=" lines describing the audio
   capabilities and the coders the server supports.

   In this example, the client uses the SIP OPTIONS method to query the
   capabilities of the MRCPv2 server.

   C->S:
        OPTIONS sip:mrcp@server.example.com SIP/2.0
        Via:SIP/2.0/TCP client.atlanta.example.com:5060;
         branch=z9hG4bK74bf7
        Max-Forwards:6
        To:<sip:mrcp@example.com>
        From:Sarvi <sip:sarvi@example.com>;tag=1928301774
        Call-ID:a84b4c76e66710
        CSeq:63104 OPTIONS
        Contact:<sip:sarvi@client.example.com>
        Accept:application/sdp
        Content-Length:0


   S->C:
        SIP/2.0 200 OK
        Via:SIP/2.0/TCP client.atlanta.example.com:5060;
         branch=z9hG4bK74bf7;received=192.0.32.10
        To:<sip:mrcp@example.com>;tag=62784
        From:Sarvi <sip:sarvi@example.com>;tag=1928301774
        Call-ID:a84b4c76e66710
        CSeq:63104 OPTIONS
        Contact:<sip:mrcp@server.example.com>
        Allow:INVITE, ACK, CANCEL, OPTIONS, BYE

RFC6787 - Page 47

        Accept:application/sdp
        Accept-Encoding:gzip
        Accept-Language:en
        Supported:foo
        Content-Type:application/sdp
        Content-Length:...

        v=0
        o=sarvi 2890844536 2890842811 IN IP4 192.0.2.12
        s=-
        i=MRCPv2 server capabilities
        c=IN IP4 192.0.2.12/127
        t=0 0
        m=application 0 TCP/TLS/MRCPv2 1
        a=resource:speechsynth
        a=resource:speechrecog
        a=resource:speakverify
        m=audio 0 RTP/AVP 0 3
        a=rtpmap:0 PCMU/8000
        a=rtpmap:3 GSM/8000

         Using SIP OPTIONS for MRCPv2 Server Capability Discovery

8.  Speech Synthesizer Resource

   This resource processes text markup provided by the client and
   generates a stream of synthesized speech in real time.  Depending
   upon the server implementation and capability of this resource, the
   client can also dictate parameters of the synthesized speech such as
   voice characteristics, speaker speed, etc.

   The synthesizer resource is controlled by MRCPv2 requests from the
   client.  Similarly, the resource can respond to these requests or
   generate asynchronous events to the client to indicate conditions of
   interest to the client during the generation of the synthesized
   speech stream.

   This section applies for the following resource types:

   o  speechsynth

   o  basicsynth

   The capabilities of these resources are defined in Section 3.1.

RFC6787 - Page 48

8.1.  Synthesizer State Machine

   The synthesizer maintains a state machine to process MRCPv2 requests
   from the client.  The state transitions shown below describe the
   states of the synthesizer and reflect the state of the request at the
   head of the synthesizer resource queue.  A SPEAK request in the
   PENDING state can be deleted or stopped by a STOP request without
   affecting the state of the resource.

   Idle                    Speaking                  Paused
   State                   State                     State
     |                        |                          |
     |----------SPEAK-------->|                 |--------|
     |<------STOP-------------|             CONTROL      |
     |<----SPEAK-COMPLETE-----|                 |------->|
     |<----BARGE-IN-OCCURRED--|                          |
     |              |---------|                          |
     |          CONTROL       |-----------PAUSE--------->|
     |              |-------->|<----------RESUME---------|
     |                        |               |----------|
     |----------|             |              PAUSE       |
     |    BARGE-IN-OCCURRED   |               |--------->|
     |<---------|             |----------|               |
     |                        |      SPEECH-MARKER       |
     |                        |<---------|               |
     |----------|             |----------|               |
     |         STOP           |       RESUME             |
     |          |             |<---------|               |
     |<---------|             |                          |
     |<---------------------STOP-------------------------|
     |----------|             |                          |
     |     DEFINE-LEXICON     |                          |
     |          |             |                          |
     |<---------|             |                          |
     |<---------------BARGE-IN-OCCURRED------------------|

                         Synthesizer State Machine

8.2.  Synthesizer Methods

   The synthesizer supports the following methods.

RFC6787 - Page 49

   synthesizer-method   =  "SPEAK"
                        /  "STOP"
                        /  "PAUSE"
                        /  "RESUME"
                        /  "BARGE-IN-OCCURRED"
                        /  "CONTROL"
                        /  "DEFINE-LEXICON"

8.3.  Synthesizer Events

   The synthesizer can generate the following events.

   synthesizer-event    =  "SPEECH-MARKER"
                        /  "SPEAK-COMPLETE"

8.4.  Synthesizer Header Fields

   A synthesizer method can contain header fields containing request
   options and information to augment the Request, Response, or Event it
   is associated with.

   synthesizer-header  =  jump-size
                       /  kill-on-barge-in
                       /  speaker-profile
                       /  completion-cause
                       /  completion-reason
                       /  voice-parameter
                       /  prosody-parameter
                       /  speech-marker
                       /  speech-language
                       /  fetch-hint
                       /  audio-fetch-hint
                       /  failed-uri
                       /  failed-uri-cause
                       /  speak-restart
                       /  speak-length
                       /  load-lexicon
                       /  lexicon-search-order

8.4.1.  Jump-Size

   This header field MAY be specified in a CONTROL method and controls
   the amount to jump forward or backward in an active SPEAK request.  A
   '+' or '-' indicates a relative value to what is being currently
   played.  This header field MAY also be specified in a SPEAK request
   as a desired offset into the synthesized speech.  In this case, the
   synthesizer MUST begin speaking from this amount of time into the
   speech markup.  Note that an offset that extends beyond the end of

RFC6787 - Page 50

   the produced speech will result in audio of length zero.  The
   different speech length units supported are dependent on the
   synthesizer implementation.  If the synthesizer resource does not
   support a unit for the operation, the resource MUST respond with a
   status-code of 409 "Unsupported Header Field Value".

   jump-size             =   "Jump-Size" ":" speech-length-value CRLF

   speech-length-value   =   numeric-speech-length
                         /   text-speech-length

   text-speech-length    =   1*UTFCHAR SP "Tag"

   numeric-speech-length =    ("+" / "-") positive-speech-length

   positive-speech-length =   1*19DIGIT SP numeric-speech-unit

   numeric-speech-unit   =   "Second"
                         /   "Word"
                         /   "Sentence"
                         /   "Paragraph"

8.4.2.  Kill-On-Barge-In

   This header field MAY be sent as part of the SPEAK method to enable
   "kill-on-barge-in" support.  If enabled, the SPEAK method is
   interrupted by DTMF input detected by a signal detector resource or
   by the start of speech sensed or recognized by the speech recognizer
   resource.

   kill-on-barge-in      =   "Kill-On-Barge-In" ":" BOOLEAN CRLF

   The client MUST send a BARGE-IN-OCCURRED method to the synthesizer
   resource when it receives a barge-in-able event from any source.
   This source could be a synthesizer resource or signal detector
   resource and MAY be either local or distributed.  If this header
   field is not specified in a SPEAK request or explicitly set by a
   SET-PARAMS, the default value for this header field is "true".

   If the recognizer or signal detector resource is on the same server
   as the synthesizer and both are part of the same session, the server
   MAY work with both to provide internal notification to the
   synthesizer so that audio may be stopped without having to wait for
   the client's BARGE-IN-OCCURRED event.

   It is generally RECOMMENDED when playing a prompt to the user with
   Kill-On-Barge-In and asking for input, that the client issue the
   RECOGNIZE request ahead of the SPEAK request for optimum performance

RFC6787 - Page 51

   and user experience.  This way, it is guaranteed that the recognizer
   is online before the prompt starts playing and the user's speech will
   not be truncated at the beginning (especially for power users).

8.4.3.  Speaker-Profile

   This header field MAY be part of the SET-PARAMS/GET-PARAMS or SPEAK
   request from the client to the server and specifies a URI that
   references the profile of the speaker.  Speaker profiles are
   collections of voice parameters like gender, accent, etc.

   speaker-profile       =   "Speaker-Profile" ":" uri CRLF

8.4.4.  Completion-Cause

   This header field MUST be specified in a SPEAK-COMPLETE event coming
   from the synthesizer resource to the client.  This indicates the
   reason the SPEAK request completed.

   completion-cause      =   "Completion-Cause" ":" 3DIGIT SP
                             1*VCHAR CRLF

   +------------+-----------------------+------------------------------+
   | Cause-Code | Cause-Name            | Description                  |
   +------------+-----------------------+------------------------------+
   | 000        | normal                | SPEAK completed normally.    |
   | 001        | barge-in              | SPEAK request was terminated |
   |            |                       | because of barge-in.         |
   | 002        | parse-failure         | SPEAK request terminated     |
   |            |                       | because of a failure to      |
   |            |                       | parse the speech markup      |
   |            |                       | text.                        |
   | 003        | uri-failure           | SPEAK request terminated     |
   |            |                       | because access to one of the |
   |            |                       | URIs failed.                 |
   | 004        | error                 | SPEAK request terminated     |
   |            |                       | prematurely due to           |
   |            |                       | synthesizer error.           |
   | 005        | language-unsupported  | Language not supported.      |
   | 006        | lexicon-load-failure  | Lexicon loading failed.      |
   | 007        | cancelled             | A prior SPEAK request failed |
   |            |                       | while this one was still in  |
   |            |                       | the queue.                   |
   +------------+-----------------------+------------------------------+

                Synthesizer Resource Completion Cause Codes

RFC6787 - Page 52

8.4.5.  Completion-Reason

   This header field MAY be specified in a SPEAK-COMPLETE event coming
   from the synthesizer resource to the client.  This contains the
   reason text behind the SPEAK request completion.  This header field
   communicates text describing the reason for the failure, such as an
   error in parsing the speech markup text.

   completion-reason   =   "Completion-Reason" ":"
                           quoted-string CRLF

   The completion reason text is provided for client use in logs and for
   debugging and instrumentation purposes.  Clients MUST NOT interpret
   the completion reason text.

8.4.6.  Voice-Parameter

   This set of header fields defines the voice of the speaker.

   voice-parameter    =   voice-gender
                       /   voice-age
                       /   voice-variant
                       /   voice-name

   voice-gender        =   "Voice-Gender:" voice-gender-value CRLF
   voice-gender-value  =   "male"
                       /   "female"
                       /   "neutral"
   voice-age           =   "Voice-Age:" 1*3DIGIT CRLF
   voice-variant       =   "Voice-Variant:" 1*19DIGIT CRLF
   voice-name          =   "Voice-Name:"
                           1*UTFCHAR *(1*WSP 1*UTFCHAR) CRLF

   The "Voice-" parameters are derived from the similarly named
   attributes of the voice element specified in W3C's Speech Synthesis
   Markup Language Specification (SSML)
   [W3C.REC-speech-synthesis-20040907].  Legal values for these
   parameters are as defined in that specification.

   These header fields MAY be sent in SET-PARAMS or GET-PARAMS requests
   to define or get default values for the entire session or MAY be sent
   in the SPEAK request to define default values for that SPEAK request.
   Note that SSML content can itself set these values internal to the
   SSML document, of course.

RFC6787 - Page 53

   Voice parameter header fields MAY also be sent in a CONTROL method to
   affect a SPEAK request in progress and change its behavior on the
   fly.  If the synthesizer resource does not support this operation, it
   MUST reject the request with a status-code of 403 "Unsupported Header
   Field".

8.4.7.  Prosody-Parameters

   This set of header fields defines the prosody of the speech.

   prosody-parameter   =   "Prosody-" prosody-param-name ":"
                           prosody-param-value CRLF

   prosody-param-name    =    1*VCHAR

   prosody-param-value   =    1*VCHAR

   prosody-param-name is any one of the attribute names under the
   prosody element specified in W3C's Speech Synthesis Markup Language
   Specification [W3C.REC-speech-synthesis-20040907].  The prosody-
   param-value is any one of the value choices of the corresponding
   prosody element attribute from that specification.

   These header fields MAY be sent in SET-PARAMS or GET-PARAMS requests
   to define or get default values for the entire session or MAY be sent
   in the SPEAK request to define default values for that SPEAK request.
   Furthermore, these attributes can be part of the speech text marked
   up in SSML.

   The prosody parameter header fields in the SET-PARAMS or SPEAK
   request only apply if the speech data is of type 'text/plain' and
   does not use a speech markup format.

   These prosody parameter header fields MAY also be sent in a CONTROL
   method to affect a SPEAK request in progress and change its behavior
   on the fly.  If the synthesizer resource does not support this
   operation, it MUST respond back to the client with a status-code of
   403 "Unsupported Header Field".

8.4.8.  Speech-Marker

   This header field contains timestamp information in a "timestamp"
   field.  This is a Network Time Protocol (NTP) [RFC5905] timestamp, a
   64-bit number in decimal form.  It MUST be synced with the Real-Time
   Protocol (RTP) [RFC3550] timestamp of the media stream through the
   Real-Time Control Protocol (RTCP) [RFC3550].

RFC6787 - Page 54

   Markers are bookmarks that are defined within the markup.  Most
   speech markup formats provide mechanisms to embed marker fields
   within speech texts.  The synthesizer generates SPEECH-MARKER events
   when it reaches these marker fields.  This header field MUST be part
   of the SPEECH-MARKER event and contain the marker tag value after the
   timestamp, separated by a semicolon.  In these events, the timestamp
   marks the time the text corresponding to the marker was emitted as
   speech by the synthesizer.

   This header field MUST also be returned in responses to STOP,
   CONTROL, and BARGE-IN-OCCURRED methods, in the SPEAK-COMPLETE event,
   and in an IN-PROGRESS SPEAK response.  In these messages, if any
   markers have been encountered for the current SPEAK, the marker tag
   value MUST be the last embedded marker encountered.  If no markers
   have yet been encountered for the current SPEAK, only the timestamp
   is REQUIRED.  Note that in these events, the purpose of this header
   field is to provide timestamp information associated with important
   events within the lifecycle of a request (start of SPEAK processing,
   end of SPEAK processing, receipt of CONTROL/STOP/BARGE-IN-OCCURRED).

   timestamp           =   "timestamp" "=" time-stamp-value

   time-stamp-value    =   1*20DIGIT

   speech-marker       =   "Speech-Marker" ":"
                           timestamp
                           [";" 1*(UTFCHAR / %x20)] CRLF

8.4.9.  Speech-Language

   This header field specifies the default language of the speech data
   if the language is not specified in the markup.  The value of this
   header field MUST follow RFC 5646 [RFC5646] for its values.  The
   header field MAY occur in SPEAK, SET-PARAMS, or GET-PARAMS requests.

   speech-language     =   "Speech-Language" ":" 1*VCHAR CRLF

8.4.10.  Fetch-Hint

   When the synthesizer needs to fetch documents or other resources like
   speech markup or audio files, this header field controls the
   corresponding URI access properties.  This provides client policy on
   when the synthesizer should retrieve content from the server.  A
   value of "prefetch" indicates the content MAY be downloaded when the
   request is received, whereas "safe" indicates that content MUST NOT

RFC6787 - Page 55

   be downloaded until actually referenced.  The default value is
   "prefetch".  This header field MAY occur in SPEAK, SET-PARAMS, or
   GET-PARAMS requests.

   fetch-hint          =   "Fetch-Hint" ":" ("prefetch" / "safe") CRLF

8.4.11.  Audio-Fetch-Hint

   When the synthesizer needs to fetch documents or other resources like
   speech audio files, this header field controls the corresponding URI
   access properties.  This provides client policy whether or not the
   synthesizer is permitted to attempt to optimize speech by pre-
   fetching audio.  The value is either "safe" to say that audio is only
   fetched when it is referenced, never before; "prefetch" to permit,
   but not require the implementation to pre-fetch the audio; or
   "stream" to allow it to stream the audio fetches.  The default value
   is "prefetch".  This header field MAY occur in SPEAK, SET-PARAMS, or
   GET-PARAMS requests.

   audio-fetch-hint    =   "Audio-Fetch-Hint" ":"
                           ("prefetch" / "safe" / "stream") CRLF

8.4.12.  Failed-URI

   When a synthesizer method needs a synthesizer to fetch or access a
   URI and the access fails, the server SHOULD provide the failed URI in
   this header field in the method response, unless there are multiple
   URI failures, in which case the server MUST provide one of the failed
   URIs in this header field in the method response.

   failed-uri          =   "Failed-URI" ":" absoluteURI CRLF

8.4.13.  Failed-URI-Cause

   When a synthesizer method needs a synthesizer to fetch or access a
   URI and the access fails, the server MUST provide the URI-specific or
   protocol-specific response code for the URI in the Failed-URI header
   field in the method response through this header field.  The value
   encoding is UTF-8 (RFC 3629 [RFC3629]) to accommodate any access
   protocol -- some access protocols might have a response string
   instead of a numeric response code.

   failed-uri-cause    =   "Failed-URI-Cause" ":" 1*UTFCHAR CRLF

RFC6787 - Page 56

8.4.14.  Speak-Restart

   When a client issues a CONTROL request to a currently speaking
   synthesizer resource to jump backward, and the target jump point is
   before the start of the current SPEAK request, the current SPEAK
   request MUST restart from the beginning of its speech data and the
   server's response to the CONTROL request MUST contain this header
   field with a value of "true" indicating a restart.

   speak-restart       =   "Speak-Restart" ":" BOOLEAN CRLF

8.4.15.  Speak-Length

   This header field MAY be specified in a CONTROL method to control the
   maximum length of speech to speak, relative to the current speaking
   point in the currently active SPEAK request.  If numeric, the value
   MUST be a positive integer.  If a header field with a Tag unit is
   specified, then the speech output continues until the tag is reached
   or the SPEAK request is completed, whichever comes first.  This
   header field MAY be specified in a SPEAK request to indicate the
   length to speak from the speech data and is relative to the point in
   speech that the SPEAK request starts.  The different speech length
   units supported are synthesizer implementation dependent.  If a
   server does not support the specified unit, the server MUST respond
   with a status-code of 409 "Unsupported Header Field Value".

   speak-length          =   "Speak-Length" ":" positive-length-value
                             CRLF

   positive-length-value =   positive-speech-length
                         /   text-speech-length

   text-speech-length    =   1*UTFCHAR SP "Tag"

   positive-speech-length =  1*19DIGIT SP numeric-speech-unit

   numeric-speech-unit   =   "Second"
                         /   "Word"
                         /   "Sentence"
                         /   "Paragraph"

RFC6787 - Page 57

8.4.16.  Load-Lexicon

   This header field is used to indicate whether a lexicon has to be
   loaded or unloaded.  The value "true" means to load the lexicon if
   not already loaded, and the value "false" means to unload the lexicon
   if it is loaded.  The default value for this header field is "true".
   This header field MAY be specified in a DEFINE-LEXICON method.

   load-lexicon       =   "Load-Lexicon" ":" BOOLEAN CRLF

8.4.17.  Lexicon-Search-Order

   This header field is used to specify a list of active pronunciation
   lexicon URIs and the search order among the active lexicons.
   Lexicons specified within the SSML document take precedence over the
   lexicons specified in this header field.  This header field MAY be
   specified in the SPEAK, SET-PARAMS, and GET-PARAMS methods.

   lexicon-search-order =   "Lexicon-Search-Order" ":"
             "<" absoluteURI ">" *(" " "<" absoluteURI ">") CRLF

8.5.  Synthesizer Message Body

   A synthesizer message can contain additional information associated
   with the Request, Response, or Event in its message body.

8.5.1.  Synthesizer Speech Data

   Marked-up text for the synthesizer to speak is specified as a typed
   media entity in the message body.  The speech data to be spoken by
   the synthesizer can be specified inline by embedding the data in the
   message body or by reference by providing a URI for accessing the
   data.  In either case, the data and the format used to markup the
   speech needs to be of a content type supported by the server.

   All MRCPv2 servers containing synthesizer resources MUST support both
   plain text speech data and W3C's Speech Synthesis Markup Language
   [W3C.REC-speech-synthesis-20040907] and hence MUST support the media
   types 'text/plain' and 'application/ssml+xml'.  Other formats MAY be
   supported.

   If the speech data is to be fetched by URI reference, the media type
   'text/uri-list' (see RFC 2483 [RFC2483]) is used to indicate one or
   more URIs that, when dereferenced, will contain the content to be
   spoken.  If a list of speech URIs is specified, the resource MUST
   speak the speech data provided by each URI in the order in which the
   URIs are specified in the content.

RFC6787 - Page 58

   MRCPv2 clients and servers MUST support the 'multipart/mixed' media
   type.  This is the appropriate media type to use when providing a mix
   of URI and inline speech data.  Embedded within the multipart content
   block, there MAY be content for the 'text/uri-list', 'application/
   ssml+xml', and/or 'text/plain' media types.  The character set and
   encoding used in the speech data is specified according to standard
   media type definitions.  The multipart content MAY also contain
   actual audio data.  Clients may have recorded audio clips stored in
   memory or on a local device and wish to play it as part of the SPEAK
   request.  The audio portions MAY be sent by the client as part of the
   multipart content block.  This audio is referenced in the speech
   markup data that is another part in the multipart content block
   according to the 'multipart/mixed' media type specification.

   Content-Type:text/uri-list
   Content-Length:...

   http://www.example.com/ASR-Introduction.ssml
   http://www.example.com/ASR-Document-Part1.ssml
   http://www.example.com/ASR-Document-Part2.ssml
   http://www.example.com/ASR-Conclusion.ssml

                             URI List Example


   Content-Type:application/ssml+xml
   Content-Length:...

   <?xml version="1.0"?>
        <speak version="1.0"
               xmlns="http://www.w3.org/2001/10/synthesis"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
               xml:lang="en-US">
          <p>
            <s>You have 4 new messages.</s>
            <s>The first is from Aldine Turnbet
            and arrived at <break/>
            <say-as interpret-as="vxml:time">0345p</say-as>.</s>

            <s>The subject is <prosody
            rate="-20%">ski trip</prosody></s>
         </p>
        </speak>

                               SSML Example

RFC6787 - Page 59

   Content-Type:multipart/mixed; boundary="break"

   --break
   Content-Type:text/uri-list
   Content-Length:...

   http://www.example.com/ASR-Introduction.ssml
   http://www.example.com/ASR-Document-Part1.ssml
   http://www.example.com/ASR-Document-Part2.ssml
   http://www.example.com/ASR-Conclusion.ssml

   --break
   Content-Type:application/ssml+xml
   Content-Length:...

   <?xml version="1.0"?>
       <speak version="1.0"
              xmlns="http://www.w3.org/2001/10/synthesis"
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
              xml:lang="en-US">
          <p>
            <s>You have 4 new messages.</s>
            <s>The first is from Stephanie Williams
            and arrived at <break/>
            <say-as interpret-as="vxml:time">0342p</say-as>.</s>

            <s>The subject is <prosody
            rate="-20%">ski trip</prosody></s>
          </p>
       </speak>
   --break--

                             Multipart Example

8.5.2.  Lexicon Data

   Synthesizer lexicon data from the client to the server can be
   provided inline or by reference.  Either way, they are carried as
   typed media in the message body of the MRCPv2 request message (see
   Section 8.14).

   When a lexicon is specified inline in the message, the client MUST
   provide a Content-ID for that lexicon as part of the content header
   fields.  The server MUST store the lexicon associated with that
   Content-ID for the duration of the session.  A stored lexicon can be
   overwritten by defining a new lexicon with the same Content-ID.

RFC6787 - Page 60

   Lexicons that have been associated with a Content-ID can be
   referenced through the 'session' URI scheme (see Section 13.6).

   If lexicon data is specified by external URI reference, the media
   type 'text/uri-list' (see RFC 2483 [RFC2483] ) is used to list the
   one or more URIs that may be dereferenced to obtain the lexicon data.
   All MRCPv2 servers MUST support the "http" and "https" URI access
   mechanisms, and MAY support other mechanisms.

   If the data in the message body consists of a mix of URI and inline
   lexicon data, the 'multipart/mixed' media type is used.  The
   character set and encoding used in the lexicon data may be specified
   according to standard media type definitions.

8.6.  SPEAK Method

   The SPEAK request provides the synthesizer resource with the speech
   text and initiates speech synthesis and streaming.  The SPEAK method
   MAY carry voice and prosody header fields that alter the behavior of
   the voice being synthesized, as well as a typed media message body
   containing the actual marked-up text to be spoken.

   The SPEAK method implementation MUST do a fetch of all external URIs
   that are part of that operation.  If caching is implemented, this URI
   fetching MUST conform to the cache-control hints and parameter header
   fields associated with the method in deciding whether it is to be
   fetched from cache or from the external server.  If these hints/
   parameters are not specified in the method, the values set for the
   session using SET-PARAMS/GET-PARAMS apply.  If it was not set for the
   session, their default values apply.

   When applying voice parameters, there are three levels of precedence.
   The highest precedence are those specified within the speech markup
   text, followed by those specified in the header fields of the SPEAK
   request and hence that apply for that SPEAK request only, followed by
   the session default values that can be set using the SET-PARAMS
   request and apply for subsequent methods invoked during the session.

   If the resource was idle at the time the SPEAK request arrived at the
   server and the SPEAK method is being actively processed, the resource
   responds immediately with a success status code and a request-state
   of IN-PROGRESS.

   If the resource is in the speaking or paused state when the SPEAK
   method arrives at the server, i.e., it is in the middle of processing
   a previous SPEAK request, the status returns success with a request-
   state of PENDING.  The server places the SPEAK request in the
   synthesizer resource request queue.  The request queue operates

RFC6787 - Page 61

   strictly FIFO: requests are processed serially in order of receipt.
   If the current SPEAK fails, all SPEAK methods in the pending queue
   are cancelled and each generates a SPEAK-COMPLETE event with a
   Completion-Cause of "cancelled".

   For the synthesizer resource, SPEAK is the only method that can
   return a request-state of IN-PROGRESS or PENDING.  When the text has
   been synthesized and played into the media stream, the resource
   issues a SPEAK-COMPLETE event with the request-id of the SPEAK
   request and a request-state of COMPLETE.

   C->S: MRCP/2.0 ... SPEAK 543257
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-Age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
            <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams and arrived at
                <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.
                </s>
             <s>The subject is
                    <prosody rate="-20%">ski trip</prosody>
             </s>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Completion-Cause:000 normal
         Speech-Marker:timestamp=857206027059

                               SPEAK Example

RFC6787 - Page 62

8.7.  STOP

   The STOP method from the client to the server tells the synthesizer
   resource to stop speaking if it is speaking something.

   The STOP request can be sent with an Active-Request-Id-List header
   field to stop the zero or more specific SPEAK requests that may be in
   queue and return a response status-code of 200 "Success".  If no
   Active-Request-Id-List header field is sent in the STOP request, the
   server terminates all outstanding SPEAK requests.

   If a STOP request successfully terminated one or more PENDING or
   IN-PROGRESS SPEAK requests, then the response MUST contain an Active-
   Request-Id-List header field enumerating the SPEAK request-ids that
   were terminated.  Otherwise, there is no Active-Request-Id-List
   header field in the response.  No SPEAK-COMPLETE events are sent for
   such terminated requests.

   If a SPEAK request that was IN-PROGRESS and speaking was stopped, the
   next pending SPEAK request, if any, becomes IN-PROGRESS at the
   resource and enters the speaking state.

   If a SPEAK request that was IN-PROGRESS and paused was stopped, the
   next pending SPEAK request, if any, becomes IN-PROGRESS and enters
   the paused state.

   C->S: MRCP/2.0 ... SPEAK 543258
         Channel-Identifier:32AECB23433802@speechsynth
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
           <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams and arrived at
                <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.</s>
             <s>The subject is
                 <prosody rate="-20%">ski trip</prosody></s>
            </p>
           </speak>

RFC6787 - Page 63

   S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   C->S: MRCP/2.0 ... STOP 543259
         Channel-Identifier:32AECB23433802@speechsynth

   S->C: MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258
         Speech-Marker:timestamp=857206039059

                               STOP Example

8.8.  BARGE-IN-OCCURRED

   The BARGE-IN-OCCURRED method, when used with the synthesizer
   resource, provides a client that has detected a barge-in-able event a
   means to communicate the occurrence of the event to the synthesizer
   resource.

   This method is useful in two scenarios:

   1.  The client has detected DTMF digits in the input media or some
       other barge-in-able event and wants to communicate that to the
       synthesizer resource.

   2.  The recognizer resource and the synthesizer resource are in
       different servers.  In this case, the client acts as an
       intermediary for the two servers.  It receives an event from the
       recognition resource and sends a BARGE-IN-OCCURRED request to the
       synthesizer.  In such cases, the BARGE-IN-OCCURRED method would
       also have a Proxy-Sync-Id header field received from the resource
       generating the original event.

   If a SPEAK request is active with kill-on-barge-in enabled (see
   Section 8.4.2), and the BARGE-IN-OCCURRED event is received, the
   synthesizer MUST immediately stop streaming out audio.  It MUST also
   terminate any speech requests queued behind the current active one,
   irrespective of whether or not they have barge-in enabled.  If a
   barge-in-able SPEAK request was playing and it was terminated, the
   response MUST contain an Active-Request-Id-List header field listing
   the request-ids of all SPEAK requests that were terminated.  The
   server generates no SPEAK-COMPLETE events for these requests.

RFC6787 - Page 64

   If there were no SPEAK requests terminated by the synthesizer
   resource as a result of the BARGE-IN-OCCURRED method, the server MUST
   respond to the BARGE-IN-OCCURRED with a status-code of 200 "Success",
   and the response MUST NOT contain an Active-Request-Id-List header
   field.

   If the synthesizer and recognizer resources are part of the same
   MRCPv2 session, they can be optimized for a quicker kill-on-barge-in
   response if the recognizer and synthesizer interact directly.  In
   these cases, the client MUST still react to a START-OF-INPUT event
   from the recognizer by invoking the BARGE-IN-OCCURRED method to the
   synthesizer.  The client MUST invoke the BARGE-IN-OCCURRED if it has
   any outstanding requests to the synthesizer resource in either the
   PENDING or IN-PROGRESS state.

   C->S: MRCP/2.0 ... SPEAK 543258
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-Age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
           <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams and arrived at
                <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.</s>
             <s>The subject is
                <prosody rate="-20%">ski trip</prosody></s>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   C->S: MRCP/2.0 ... BARGE-IN-OCCURRED 543259
         Channel-Identifier:32AECB23433802@speechsynth
         Proxy-Sync-Id:987654321

RFC6787 - Page 65

   S->C:MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258
         Speech-Marker:timestamp=857206039059

                         BARGE-IN-OCCURRED Example

8.9.  PAUSE

   The PAUSE method from the client to the server tells the synthesizer
   resource to pause speech output if it is speaking something.  If a
   PAUSE method is issued on a session when a SPEAK is not active, the
   server MUST respond with a status-code of 402 "Method not valid in
   this state".  If a PAUSE method is issued on a session when a SPEAK
   is active and paused, the server MUST respond with a status-code of
   200 "Success".  If a SPEAK request was active, the server MUST return
   an Active-Request-Id-List header field whose value contains the
   request-id of the SPEAK request that was paused.

   C->S: MRCP/2.0 ... SPEAK 543258
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-Age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
           <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams and arrived at
                <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.</s>

             <s>The subject is
                <prosody rate="-20%">ski trip</prosody></s>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

RFC6787 - Page 66

   C->S: MRCP/2.0 ... PAUSE 543259
         Channel-Identifier:32AECB23433802@speechsynth

   S->C: MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258

                               PAUSE Example

8.10.  RESUME

   The RESUME method from the client to the server tells a paused
   synthesizer resource to resume speaking.  If a RESUME request is
   issued on a session with no active SPEAK request, the server MUST
   respond with a status-code of 402 "Method not valid in this state".
   If a RESUME request is issued on a session with an active SPEAK
   request that is speaking (i.e., not paused), the server MUST respond
   with a status-code of 200 "Success".  If a SPEAK request was paused,
   the server MUST return an Active-Request-Id-List header field whose
   value contains the request-id of the SPEAK request that was resumed.

   C->S: MRCP/2.0 ... SPEAK 543258
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
           <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams and arrived at
                <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.</s>
             <s>The subject is
                <prosody rate="-20%">ski trip</prosody></s>
            </p>
           </speak>

RFC6787 - Page 67

   S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS@speechsynth
         Channel-Identifier:32AECB23433802
         Speech-Marker:timestamp=857206027059

   C->S: MRCP/2.0 ... PAUSE 543259
         Channel-Identifier:32AECB23433802@speechsynth

   S->C: MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258

   C->S: MRCP/2.0 ... RESUME 543260
         Channel-Identifier:32AECB23433802@speechsynth

   S->C: MRCP/2.0 ... 543260 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258

                              RESUME Example

8.11.  CONTROL

   The CONTROL method from the client to the server tells a synthesizer
   that is speaking to modify what it is speaking on the fly.  This
   method is used to request the synthesizer to jump forward or backward
   in what it is speaking, change speaker rate, speaker parameters, etc.
   It affects only the currently IN-PROGRESS SPEAK request.  Depending
   on the implementation and capability of the synthesizer resource, it
   may or may not support the various modifications indicated by header
   fields in the CONTROL request.

   When a client invokes a CONTROL method to jump forward and the
   operation goes beyond the end of the active SPEAK method's text, the
   CONTROL request still succeeds.  The active SPEAK request completes
   and returns a SPEAK-COMPLETE event following the response to the
   CONTROL method.  If there are more SPEAK requests in the queue, the
   synthesizer resource starts at the beginning of the next SPEAK
   request in the queue.

   When a client invokes a CONTROL method to jump backward and the
   operation jumps to the beginning or beyond the beginning of the
   speech data of the active SPEAK method, the CONTROL request still
   succeeds.  The response to the CONTROL request contains the speak-
   restart header field, and the active SPEAK request restarts from the
   beginning of its speech data.

RFC6787 - Page 68

   These two behaviors can be used to rewind or fast-forward across
   multiple speech requests, if the client wants to break up a speech
   markup text into multiple SPEAK requests.

   If a SPEAK request was active when the CONTROL method was received,
   the server MUST return an Active-Request-Id-List header field
   containing the request-id of the SPEAK request that was active.

   C->S: MRCP/2.0 ... SPEAK 543258
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
           <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams
                and arrived at <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.</s>

             <s>The subject is <prosody
                rate="-20%">ski trip</prosody></s>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857205016059

   C->S: MRCP/2.0 ... CONTROL 543259
         Channel-Identifier:32AECB23433802@speechsynth
         Prosody-rate:fast

   S->C: MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258
         Speech-Marker:timestamp=857206027059

RFC6787 - Page 69

   C->S: MRCP/2.0 ... CONTROL 543260
         Channel-Identifier:32AECB23433802@speechsynth
         Jump-Size:-15 Words

   S->C: MRCP/2.0 ... 543260 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258
         Speech-Marker:timestamp=857206039059

                              CONTROL Example

8.12.  SPEAK-COMPLETE

   This is an Event message from the synthesizer resource to the client
   that indicates the corresponding SPEAK request was completed.  The
   request-id field matches the request-id of the SPEAK request that
   initiated the speech that just completed.  The request-state field is
   set to COMPLETE by the server, indicating that this is the last event
   with the corresponding request-id.  The Completion-Cause header field
   specifies the cause code pertaining to the status and reason of
   request completion, such as the SPEAK completed normally or because
   of an error, kill-on-barge-in, etc.

   C->S: MRCP/2.0 ... SPEAK 543260
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
           <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams
                and arrived at <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.</s>
             <s>The subject is
                <prosody rate="-20%">ski trip</prosody></s>
            </p>
           </speak>

RFC6787 - Page 70

   S->C: MRCP/2.0 ... 543260 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   S->C: MRCP/2.0 ... SPEAK-COMPLETE 543260 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Completion-Cause:000 normal
         Speech-Marker:timestamp=857206039059

                          SPEAK-COMPLETE Example

8.13.  SPEECH-MARKER

   This is an event generated by the synthesizer resource to the client
   when the synthesizer encounters a marker tag in the speech markup it
   is currently processing.  The value of the request-id field MUST
   match that of the corresponding SPEAK request.  The request-state
   field MUST have the value "IN-PROGRESS" as the speech is still not
   complete.  The value of the speech marker tag hit, describing where
   the synthesizer is in the speech markup, MUST be returned in the
   Speech-Marker header field, along with an NTP timestamp indicating
   the instant in the output speech stream that the marker was
   encountered.  The SPEECH-MARKER event MUST also be generated with a
   null marker value and output NTP timestamp when a SPEAK request in
   Pending-State (i.e., in the queue) changes state to IN-PROGRESS and
   starts speaking.  The NTP timestamp MUST be synchronized with the RTP
   timestamp used to generate the speech stream through standard RTCP
   machinery.

   C->S: MRCP/2.0 ... SPEAK 543261
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
           <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams
                and arrived at <break/>

RFC6787 - Page 71

                <say-as interpret-as="vxml:time">0342p</say-as>.</s>
                <mark name="here"/>
             <s>The subject is
                <prosody rate="-20%">ski trip</prosody>
             </s>
             <mark name="ANSWER"/>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543261 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857205015059

   S->C: MRCP/2.0 ... SPEECH-MARKER 543261 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059;here

   S->C: MRCP/2.0 ... SPEECH-MARKER 543261 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206039059;ANSWER

   S->C: MRCP/2.0 ... SPEAK-COMPLETE 543261 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Completion-Cause:000 normal
         Speech-Marker:timestamp=857207689259;ANSWER

                           SPEECH-MARKER Example

8.14.  DEFINE-LEXICON

   The DEFINE-LEXICON method, from the client to the server, provides a
   lexicon and tells the server to load or unload the lexicon (see
   Section 8.4.16).  The media type of the lexicon is provided in the
   Content-Type header (see Section 8.5.2).  One such media type is
   "application/pls+xml" for the Pronunciation Lexicon Specification
   (PLS) [W3C.REC-pronunciation-lexicon-20081014] [RFC4267].

   If the server resource is in the speaking or paused state, the server
   MUST respond with a failure status-code of 402 "Method not valid in
   this state".

   If the resource is in the idle state and is able to successfully
   load/unload the lexicon, the status MUST return a 200 "Success"
   status-code and the request-state MUST be COMPLETE.

RFC6787 - Page 72

   If the synthesizer could not define the lexicon for some reason, for
   example, because the download failed or the lexicon was in an
   unsupported form, the server MUST respond with a failure status-code
   of 407 and a Completion-Cause header field describing the failure
   reason.

(page 72 continued on part 4)