7. Resource Discovery
Server resources may be discovered and their capabilities learned by
clients through standard SIP machinery. The client MAY issue a SIP
OPTIONS transaction to a server, which has the effect of requesting
the capabilities of the server. The server MUST respond to such a
request with an SDP-encoded description of its capabilities according
to RFC 3264 [RFC3264]. The MRCPv2 capabilities are described by a
single "m=" line containing the media type "application" and
transport type "TCP/TLS/MRCPv2" or "TCP/MRCPv2". There MUST be one
"resource" attribute for each media resource that the server
supports, and it has the resource type identifier as its value.
The SDP description MUST also contain "m=" lines describing the audio
capabilities and the coders the server supports.
In this example, the client uses the SIP OPTIONS method to query the
capabilities of the MRCPv2 server.
C->S:
OPTIONS sip:mrcp@server.example.com SIP/2.0
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf7
Max-Forwards:6
To:<sip:mrcp@example.com>
From:Sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:63104 OPTIONS
Contact:<sip:sarvi@client.example.com>
Accept:application/sdp
Content-Length:0
S->C:
SIP/2.0 200 OK
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf7;received=192.0.32.10
To:<sip:mrcp@example.com>;tag=62784
From:Sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:63104 OPTIONS
Contact:<sip:mrcp@server.example.com>
Allow:INVITE, ACK, CANCEL, OPTIONS, BYE
Accept:application/sdp
Accept-Encoding:gzip
Accept-Language:en
Supported:foo
Content-Type:application/sdp
Content-Length:...
v=0
o=sarvi 2890844536 2890842811 IN IP4 192.0.2.12
s=-
i=MRCPv2 server capabilities
c=IN IP4 192.0.2.12/127
t=0 0
m=application 0 TCP/TLS/MRCPv2 1
a=resource:speechsynth
a=resource:speechrecog
a=resource:speakverify
m=audio 0 RTP/AVP 0 3
a=rtpmap:0 PCMU/8000
a=rtpmap:3 GSM/8000
Using SIP OPTIONS for MRCPv2 Server Capability Discovery
8. Speech Synthesizer Resource
This resource processes text markup provided by the client and
generates a stream of synthesized speech in real time. Depending
upon the server implementation and capability of this resource, the
client can also dictate parameters of the synthesized speech such as
voice characteristics, speaker speed, etc.
The synthesizer resource is controlled by MRCPv2 requests from the
client. Similarly, the resource can respond to these requests or
generate asynchronous events to the client to indicate conditions of
interest to the client during the generation of the synthesized
speech stream.
This section applies for the following resource types:
o speechsynth
o basicsynth
The capabilities of these resources are defined in Section 3.1.
8.1. Synthesizer State Machine
The synthesizer maintains a state machine to process MRCPv2 requests
from the client. The state transitions shown below describe the
states of the synthesizer and reflect the state of the request at the
head of the synthesizer resource queue. A SPEAK request in the
PENDING state can be deleted or stopped by a STOP request without
affecting the state of the resource.
Idle Speaking Paused
State State State
| | |
|----------SPEAK-------->| |--------|
|<------STOP-------------| CONTROL |
|<----SPEAK-COMPLETE-----| |------->|
|<----BARGE-IN-OCCURRED--| |
| |---------| |
| CONTROL |-----------PAUSE--------->|
| |-------->|<----------RESUME---------|
| | |----------|
|----------| | PAUSE |
| BARGE-IN-OCCURRED | |--------->|
|<---------| |----------| |
| | SPEECH-MARKER |
| |<---------| |
|----------| |----------| |
| STOP | RESUME |
| | |<---------| |
|<---------| | |
|<---------------------STOP-------------------------|
|----------| | |
| DEFINE-LEXICON | |
| | | |
|<---------| | |
|<---------------BARGE-IN-OCCURRED------------------|
Synthesizer State Machine
8.2. Synthesizer Methods
The synthesizer supports the following methods.
synthesizer-method = "SPEAK"
/ "STOP"
/ "PAUSE"
/ "RESUME"
/ "BARGE-IN-OCCURRED"
/ "CONTROL"
/ "DEFINE-LEXICON"
8.3. Synthesizer Events
The synthesizer can generate the following events.
synthesizer-event = "SPEECH-MARKER"
/ "SPEAK-COMPLETE"
8.4. Synthesizer Header Fields
A synthesizer method can contain header fields containing request
options and information to augment the Request, Response, or Event it
is associated with.
synthesizer-header = jump-size
/ kill-on-barge-in
/ speaker-profile
/ completion-cause
/ completion-reason
/ voice-parameter
/ prosody-parameter
/ speech-marker
/ speech-language
/ fetch-hint
/ audio-fetch-hint
/ failed-uri
/ failed-uri-cause
/ speak-restart
/ speak-length
/ load-lexicon
/ lexicon-search-order
8.4.1. Jump-Size
This header field MAY be specified in a CONTROL method and controls
the amount to jump forward or backward in an active SPEAK request. A
'+' or '-' indicates a relative value to what is being currently
played. This header field MAY also be specified in a SPEAK request
as a desired offset into the synthesized speech. In this case, the
synthesizer MUST begin speaking from this amount of time into the
speech markup. Note that an offset that extends beyond the end of
the produced speech will result in audio of length zero. The
different speech length units supported are dependent on the
synthesizer implementation. If the synthesizer resource does not
support a unit for the operation, the resource MUST respond with a
status-code of 409 "Unsupported Header Field Value".
jump-size = "Jump-Size" ":" speech-length-value CRLF
speech-length-value = numeric-speech-length
/ text-speech-length
text-speech-length = 1*UTFCHAR SP "Tag"
numeric-speech-length = ("+" / "-") positive-speech-length
positive-speech-length = 1*19DIGIT SP numeric-speech-unit
numeric-speech-unit = "Second"
/ "Word"
/ "Sentence"
/ "Paragraph"
8.4.2. Kill-On-Barge-In
This header field MAY be sent as part of the SPEAK method to enable
"kill-on-barge-in" support. If enabled, the SPEAK method is
interrupted by DTMF input detected by a signal detector resource or
by the start of speech sensed or recognized by the speech recognizer
resource.
kill-on-barge-in = "Kill-On-Barge-In" ":" BOOLEAN CRLF
The client MUST send a BARGE-IN-OCCURRED method to the synthesizer
resource when it receives a barge-in-able event from any source.
This source could be a synthesizer resource or signal detector
resource and MAY be either local or distributed. If this header
field is not specified in a SPEAK request or explicitly set by a
SET-PARAMS, the default value for this header field is "true".
If the recognizer or signal detector resource is on the same server
as the synthesizer and both are part of the same session, the server
MAY work with both to provide internal notification to the
synthesizer so that audio may be stopped without having to wait for
the client's BARGE-IN-OCCURRED event.
It is generally RECOMMENDED when playing a prompt to the user with
Kill-On-Barge-In and asking for input, that the client issue the
RECOGNIZE request ahead of the SPEAK request for optimum performance
and user experience. This way, it is guaranteed that the recognizer
is online before the prompt starts playing and the user's speech will
not be truncated at the beginning (especially for power users).
8.4.3. Speaker-Profile
This header field MAY be part of the SET-PARAMS/GET-PARAMS or SPEAK
request from the client to the server and specifies a URI that
references the profile of the speaker. Speaker profiles are
collections of voice parameters like gender, accent, etc.
speaker-profile = "Speaker-Profile" ":" uri CRLF
8.4.4. Completion-Cause
This header field MUST be specified in a SPEAK-COMPLETE event coming
from the synthesizer resource to the client. This indicates the
reason the SPEAK request completed.
completion-cause = "Completion-Cause" ":" 3DIGIT SP
1*VCHAR CRLF
+------------+-----------------------+------------------------------+
| Cause-Code | Cause-Name | Description |
+------------+-----------------------+------------------------------+
| 000 | normal | SPEAK completed normally. |
| 001 | barge-in | SPEAK request was terminated |
| | | because of barge-in. |
| 002 | parse-failure | SPEAK request terminated |
| | | because of a failure to |
| | | parse the speech markup |
| | | text. |
| 003 | uri-failure | SPEAK request terminated |
| | | because access to one of the |
| | | URIs failed. |
| 004 | error | SPEAK request terminated |
| | | prematurely due to |
| | | synthesizer error. |
| 005 | language-unsupported | Language not supported. |
| 006 | lexicon-load-failure | Lexicon loading failed. |
| 007 | cancelled | A prior SPEAK request failed |
| | | while this one was still in |
| | | the queue. |
+------------+-----------------------+------------------------------+
Synthesizer Resource Completion Cause Codes
8.4.5. Completion-Reason
This header field MAY be specified in a SPEAK-COMPLETE event coming
from the synthesizer resource to the client. This contains the
reason text behind the SPEAK request completion. This header field
communicates text describing the reason for the failure, such as an
error in parsing the speech markup text.
completion-reason = "Completion-Reason" ":"
quoted-string CRLF
The completion reason text is provided for client use in logs and for
debugging and instrumentation purposes. Clients MUST NOT interpret
the completion reason text.
8.4.6. Voice-Parameter
This set of header fields defines the voice of the speaker.
voice-parameter = voice-gender
/ voice-age
/ voice-variant
/ voice-name
voice-gender = "Voice-Gender:" voice-gender-value CRLF
voice-gender-value = "male"
/ "female"
/ "neutral"
voice-age = "Voice-Age:" 1*3DIGIT CRLF
voice-variant = "Voice-Variant:" 1*19DIGIT CRLF
voice-name = "Voice-Name:"
1*UTFCHAR *(1*WSP 1*UTFCHAR) CRLF
The "Voice-" parameters are derived from the similarly named
attributes of the voice element specified in W3C's Speech Synthesis
Markup Language Specification (SSML)
[W3C.REC-speech-synthesis-20040907]. Legal values for these
parameters are as defined in that specification.
These header fields MAY be sent in SET-PARAMS or GET-PARAMS requests
to define or get default values for the entire session or MAY be sent
in the SPEAK request to define default values for that SPEAK request.
Note that SSML content can itself set these values internal to the
SSML document, of course.
Voice parameter header fields MAY also be sent in a CONTROL method to
affect a SPEAK request in progress and change its behavior on the
fly. If the synthesizer resource does not support this operation, it
MUST reject the request with a status-code of 403 "Unsupported Header
Field".
8.4.7. Prosody-Parameters
This set of header fields defines the prosody of the speech.
prosody-parameter = "Prosody-" prosody-param-name ":"
prosody-param-value CRLF
prosody-param-name = 1*VCHAR
prosody-param-value = 1*VCHAR
prosody-param-name is any one of the attribute names under the
prosody element specified in W3C's Speech Synthesis Markup Language
Specification [W3C.REC-speech-synthesis-20040907]. The prosody-
param-value is any one of the value choices of the corresponding
prosody element attribute from that specification.
These header fields MAY be sent in SET-PARAMS or GET-PARAMS requests
to define or get default values for the entire session or MAY be sent
in the SPEAK request to define default values for that SPEAK request.
Furthermore, these attributes can be part of the speech text marked
up in SSML.
The prosody parameter header fields in the SET-PARAMS or SPEAK
request only apply if the speech data is of type 'text/plain' and
does not use a speech markup format.
These prosody parameter header fields MAY also be sent in a CONTROL
method to affect a SPEAK request in progress and change its behavior
on the fly. If the synthesizer resource does not support this
operation, it MUST respond back to the client with a status-code of
403 "Unsupported Header Field".
8.4.8. Speech-Marker
This header field contains timestamp information in a "timestamp"
field. This is a Network Time Protocol (NTP) [RFC5905] timestamp, a
64-bit number in decimal form. It MUST be synced with the Real-Time
Protocol (RTP) [RFC3550] timestamp of the media stream through the
Real-Time Control Protocol (RTCP) [RFC3550].
Markers are bookmarks that are defined within the markup. Most
speech markup formats provide mechanisms to embed marker fields
within speech texts. The synthesizer generates SPEECH-MARKER events
when it reaches these marker fields. This header field MUST be part
of the SPEECH-MARKER event and contain the marker tag value after the
timestamp, separated by a semicolon. In these events, the timestamp
marks the time the text corresponding to the marker was emitted as
speech by the synthesizer.
This header field MUST also be returned in responses to STOP,
CONTROL, and BARGE-IN-OCCURRED methods, in the SPEAK-COMPLETE event,
and in an IN-PROGRESS SPEAK response. In these messages, if any
markers have been encountered for the current SPEAK, the marker tag
value MUST be the last embedded marker encountered. If no markers
have yet been encountered for the current SPEAK, only the timestamp
is REQUIRED. Note that in these events, the purpose of this header
field is to provide timestamp information associated with important
events within the lifecycle of a request (start of SPEAK processing,
end of SPEAK processing, receipt of CONTROL/STOP/BARGE-IN-OCCURRED).
timestamp = "timestamp" "=" time-stamp-value
time-stamp-value = 1*20DIGIT
speech-marker = "Speech-Marker" ":"
timestamp
[";" 1*(UTFCHAR / %x20)] CRLF
8.4.9. Speech-Language
This header field specifies the default language of the speech data
if the language is not specified in the markup. The value of this
header field MUST follow RFC 5646 [RFC5646] for its values. The
header field MAY occur in SPEAK, SET-PARAMS, or GET-PARAMS requests.
speech-language = "Speech-Language" ":" 1*VCHAR CRLF
8.4.10. Fetch-Hint
When the synthesizer needs to fetch documents or other resources like
speech markup or audio files, this header field controls the
corresponding URI access properties. This provides client policy on
when the synthesizer should retrieve content from the server. A
value of "prefetch" indicates the content MAY be downloaded when the
request is received, whereas "safe" indicates that content MUST NOT
be downloaded until actually referenced. The default value is
"prefetch". This header field MAY occur in SPEAK, SET-PARAMS, or
GET-PARAMS requests.
fetch-hint = "Fetch-Hint" ":" ("prefetch" / "safe") CRLF
8.4.11. Audio-Fetch-Hint
When the synthesizer needs to fetch documents or other resources like
speech audio files, this header field controls the corresponding URI
access properties. This provides client policy whether or not the
synthesizer is permitted to attempt to optimize speech by pre-
fetching audio. The value is either "safe" to say that audio is only
fetched when it is referenced, never before; "prefetch" to permit,
but not require the implementation to pre-fetch the audio; or
"stream" to allow it to stream the audio fetches. The default value
is "prefetch". This header field MAY occur in SPEAK, SET-PARAMS, or
GET-PARAMS requests.
audio-fetch-hint = "Audio-Fetch-Hint" ":"
("prefetch" / "safe" / "stream") CRLF
8.4.12. Failed-URI
When a synthesizer method needs a synthesizer to fetch or access a
URI and the access fails, the server SHOULD provide the failed URI in
this header field in the method response, unless there are multiple
URI failures, in which case the server MUST provide one of the failed
URIs in this header field in the method response.
failed-uri = "Failed-URI" ":" absoluteURI CRLF
8.4.13. Failed-URI-Cause
When a synthesizer method needs a synthesizer to fetch or access a
URI and the access fails, the server MUST provide the URI-specific or
protocol-specific response code for the URI in the Failed-URI header
field in the method response through this header field. The value
encoding is UTF-8 (RFC 3629 [RFC3629]) to accommodate any access
protocol -- some access protocols might have a response string
instead of a numeric response code.
failed-uri-cause = "Failed-URI-Cause" ":" 1*UTFCHAR CRLF
8.4.14. Speak-Restart
When a client issues a CONTROL request to a currently speaking
synthesizer resource to jump backward, and the target jump point is
before the start of the current SPEAK request, the current SPEAK
request MUST restart from the beginning of its speech data and the
server's response to the CONTROL request MUST contain this header
field with a value of "true" indicating a restart.
speak-restart = "Speak-Restart" ":" BOOLEAN CRLF
8.4.15. Speak-Length
This header field MAY be specified in a CONTROL method to control the
maximum length of speech to speak, relative to the current speaking
point in the currently active SPEAK request. If numeric, the value
MUST be a positive integer. If a header field with a Tag unit is
specified, then the speech output continues until the tag is reached
or the SPEAK request is completed, whichever comes first. This
header field MAY be specified in a SPEAK request to indicate the
length to speak from the speech data and is relative to the point in
speech that the SPEAK request starts. The different speech length
units supported are synthesizer implementation dependent. If a
server does not support the specified unit, the server MUST respond
with a status-code of 409 "Unsupported Header Field Value".
speak-length = "Speak-Length" ":" positive-length-value
CRLF
positive-length-value = positive-speech-length
/ text-speech-length
text-speech-length = 1*UTFCHAR SP "Tag"
positive-speech-length = 1*19DIGIT SP numeric-speech-unit
numeric-speech-unit = "Second"
/ "Word"
/ "Sentence"
/ "Paragraph"
8.4.16. Load-Lexicon
This header field is used to indicate whether a lexicon has to be
loaded or unloaded. The value "true" means to load the lexicon if
not already loaded, and the value "false" means to unload the lexicon
if it is loaded. The default value for this header field is "true".
This header field MAY be specified in a DEFINE-LEXICON method.
load-lexicon = "Load-Lexicon" ":" BOOLEAN CRLF
8.4.17. Lexicon-Search-Order
This header field is used to specify a list of active pronunciation
lexicon URIs and the search order among the active lexicons.
Lexicons specified within the SSML document take precedence over the
lexicons specified in this header field. This header field MAY be
specified in the SPEAK, SET-PARAMS, and GET-PARAMS methods.
lexicon-search-order = "Lexicon-Search-Order" ":"
"<" absoluteURI ">" *(" " "<" absoluteURI ">") CRLF
8.5. Synthesizer Message Body
A synthesizer message can contain additional information associated
with the Request, Response, or Event in its message body.
8.5.1. Synthesizer Speech Data
Marked-up text for the synthesizer to speak is specified as a typed
media entity in the message body. The speech data to be spoken by
the synthesizer can be specified inline by embedding the data in the
message body or by reference by providing a URI for accessing the
data. In either case, the data and the format used to markup the
speech needs to be of a content type supported by the server.
All MRCPv2 servers containing synthesizer resources MUST support both
plain text speech data and W3C's Speech Synthesis Markup Language
[W3C.REC-speech-synthesis-20040907] and hence MUST support the media
types 'text/plain' and 'application/ssml+xml'. Other formats MAY be
supported.
If the speech data is to be fetched by URI reference, the media type
'text/uri-list' (see RFC 2483 [RFC2483]) is used to indicate one or
more URIs that, when dereferenced, will contain the content to be
spoken. If a list of speech URIs is specified, the resource MUST
speak the speech data provided by each URI in the order in which the
URIs are specified in the content.
MRCPv2 clients and servers MUST support the 'multipart/mixed' media
type. This is the appropriate media type to use when providing a mix
of URI and inline speech data. Embedded within the multipart content
block, there MAY be content for the 'text/uri-list', 'application/
ssml+xml', and/or 'text/plain' media types. The character set and
encoding used in the speech data is specified according to standard
media type definitions. The multipart content MAY also contain
actual audio data. Clients may have recorded audio clips stored in
memory or on a local device and wish to play it as part of the SPEAK
request. The audio portions MAY be sent by the client as part of the
multipart content block. This audio is referenced in the speech
markup data that is another part in the multipart content block
according to the 'multipart/mixed' media type specification.
Content-Type:text/uri-list
Content-Length:...
http://www.example.com/ASR-Introduction.ssml
http://www.example.com/ASR-Document-Part1.ssml
http://www.example.com/ASR-Document-Part2.ssml
http://www.example.com/ASR-Conclusion.ssml
URI List Example
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Aldine Turnbet
and arrived at <break/>
<say-as interpret-as="vxml:time">0345p</say-as>.</s>
<s>The subject is <prosody
rate="-20%">ski trip</prosody></s>
</p>
</speak>
SSML Example
Content-Type:multipart/mixed; boundary="break"
--break
Content-Type:text/uri-list
Content-Length:...
http://www.example.com/ASR-Introduction.ssml
http://www.example.com/ASR-Document-Part1.ssml
http://www.example.com/ASR-Document-Part2.ssml
http://www.example.com/ASR-Conclusion.ssml
--break
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams
and arrived at <break/>
<say-as interpret-as="vxml:time">0342p</say-as>.</s>
<s>The subject is <prosody
rate="-20%">ski trip</prosody></s>
</p>
</speak>
--break--
Multipart Example
8.5.2. Lexicon Data
Synthesizer lexicon data from the client to the server can be
provided inline or by reference. Either way, they are carried as
typed media in the message body of the MRCPv2 request message (see
Section 8.14).
When a lexicon is specified inline in the message, the client MUST
provide a Content-ID for that lexicon as part of the content header
fields. The server MUST store the lexicon associated with that
Content-ID for the duration of the session. A stored lexicon can be
overwritten by defining a new lexicon with the same Content-ID.
Lexicons that have been associated with a Content-ID can be
referenced through the 'session' URI scheme (see Section 13.6).
If lexicon data is specified by external URI reference, the media
type 'text/uri-list' (see RFC 2483 [RFC2483] ) is used to list the
one or more URIs that may be dereferenced to obtain the lexicon data.
All MRCPv2 servers MUST support the "http" and "https" URI access
mechanisms, and MAY support other mechanisms.
If the data in the message body consists of a mix of URI and inline
lexicon data, the 'multipart/mixed' media type is used. The
character set and encoding used in the lexicon data may be specified
according to standard media type definitions.
8.6. SPEAK Method
The SPEAK request provides the synthesizer resource with the speech
text and initiates speech synthesis and streaming. The SPEAK method
MAY carry voice and prosody header fields that alter the behavior of
the voice being synthesized, as well as a typed media message body
containing the actual marked-up text to be spoken.
The SPEAK method implementation MUST do a fetch of all external URIs
that are part of that operation. If caching is implemented, this URI
fetching MUST conform to the cache-control hints and parameter header
fields associated with the method in deciding whether it is to be
fetched from cache or from the external server. If these hints/
parameters are not specified in the method, the values set for the
session using SET-PARAMS/GET-PARAMS apply. If it was not set for the
session, their default values apply.
When applying voice parameters, there are three levels of precedence.
The highest precedence are those specified within the speech markup
text, followed by those specified in the header fields of the SPEAK
request and hence that apply for that SPEAK request only, followed by
the session default values that can be set using the SET-PARAMS
request and apply for subsequent methods invoked during the session.
If the resource was idle at the time the SPEAK request arrived at the
server and the SPEAK method is being actively processed, the resource
responds immediately with a success status code and a request-state
of IN-PROGRESS.
If the resource is in the speaking or paused state when the SPEAK
method arrives at the server, i.e., it is in the middle of processing
a previous SPEAK request, the status returns success with a request-
state of PENDING. The server places the SPEAK request in the
synthesizer resource request queue. The request queue operates
strictly FIFO: requests are processed serially in order of receipt.
If the current SPEAK fails, all SPEAK methods in the pending queue
are cancelled and each generates a SPEAK-COMPLETE event with a
Completion-Cause of "cancelled".
For the synthesizer resource, SPEAK is the only method that can
return a request-state of IN-PROGRESS or PENDING. When the text has
been synthesized and played into the media stream, the resource
issues a SPEAK-COMPLETE event with the request-id of the SPEAK
request and a request-state of COMPLETE.
C->S: MRCP/2.0 ... SPEAK 543257
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:neutral
Voice-Age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams and arrived at
<break/>
<say-as interpret-as="vxml:time">0342p</say-as>.
</s>
<s>The subject is
<prosody rate="-20%">ski trip</prosody>
</s>
</p>
</speak>
S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857206027059
S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Completion-Cause:000 normal
Speech-Marker:timestamp=857206027059
SPEAK Example
8.7. STOP
The STOP method from the client to the server tells the synthesizer
resource to stop speaking if it is speaking something.
The STOP request can be sent with an Active-Request-Id-List header
field to stop the zero or more specific SPEAK requests that may be in
queue and return a response status-code of 200 "Success". If no
Active-Request-Id-List header field is sent in the STOP request, the
server terminates all outstanding SPEAK requests.
If a STOP request successfully terminated one or more PENDING or
IN-PROGRESS SPEAK requests, then the response MUST contain an Active-
Request-Id-List header field enumerating the SPEAK request-ids that
were terminated. Otherwise, there is no Active-Request-Id-List
header field in the response. No SPEAK-COMPLETE events are sent for
such terminated requests.
If a SPEAK request that was IN-PROGRESS and speaking was stopped, the
next pending SPEAK request, if any, becomes IN-PROGRESS at the
resource and enters the speaking state.
If a SPEAK request that was IN-PROGRESS and paused was stopped, the
next pending SPEAK request, if any, becomes IN-PROGRESS and enters
the paused state.
C->S: MRCP/2.0 ... SPEAK 543258
Channel-Identifier:32AECB23433802@speechsynth
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams and arrived at
<break/>
<say-as interpret-as="vxml:time">0342p</say-as>.</s>
<s>The subject is
<prosody rate="-20%">ski trip</prosody></s>
</p>
</speak>
S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857206027059
C->S: MRCP/2.0 ... STOP 543259
Channel-Identifier:32AECB23433802@speechsynth
S->C: MRCP/2.0 ... 543259 200 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Active-Request-Id-List:543258
Speech-Marker:timestamp=857206039059
STOP Example
8.8. BARGE-IN-OCCURRED
The BARGE-IN-OCCURRED method, when used with the synthesizer
resource, provides a client that has detected a barge-in-able event a
means to communicate the occurrence of the event to the synthesizer
resource.
This method is useful in two scenarios:
1. The client has detected DTMF digits in the input media or some
other barge-in-able event and wants to communicate that to the
synthesizer resource.
2. The recognizer resource and the synthesizer resource are in
different servers. In this case, the client acts as an
intermediary for the two servers. It receives an event from the
recognition resource and sends a BARGE-IN-OCCURRED request to the
synthesizer. In such cases, the BARGE-IN-OCCURRED method would
also have a Proxy-Sync-Id header field received from the resource
generating the original event.
If a SPEAK request is active with kill-on-barge-in enabled (see
Section 8.4.2), and the BARGE-IN-OCCURRED event is received, the
synthesizer MUST immediately stop streaming out audio. It MUST also
terminate any speech requests queued behind the current active one,
irrespective of whether or not they have barge-in enabled. If a
barge-in-able SPEAK request was playing and it was terminated, the
response MUST contain an Active-Request-Id-List header field listing
the request-ids of all SPEAK requests that were terminated. The
server generates no SPEAK-COMPLETE events for these requests.
If there were no SPEAK requests terminated by the synthesizer
resource as a result of the BARGE-IN-OCCURRED method, the server MUST
respond to the BARGE-IN-OCCURRED with a status-code of 200 "Success",
and the response MUST NOT contain an Active-Request-Id-List header
field.
If the synthesizer and recognizer resources are part of the same
MRCPv2 session, they can be optimized for a quicker kill-on-barge-in
response if the recognizer and synthesizer interact directly. In
these cases, the client MUST still react to a START-OF-INPUT event
from the recognizer by invoking the BARGE-IN-OCCURRED method to the
synthesizer. The client MUST invoke the BARGE-IN-OCCURRED if it has
any outstanding requests to the synthesizer resource in either the
PENDING or IN-PROGRESS state.
C->S: MRCP/2.0 ... SPEAK 543258
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:neutral
Voice-Age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams and arrived at
<break/>
<say-as interpret-as="vxml:time">0342p</say-as>.</s>
<s>The subject is
<prosody rate="-20%">ski trip</prosody></s>
</p>
</speak>
S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857206027059
C->S: MRCP/2.0 ... BARGE-IN-OCCURRED 543259
Channel-Identifier:32AECB23433802@speechsynth
Proxy-Sync-Id:987654321
S->C:MRCP/2.0 ... 543259 200 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Active-Request-Id-List:543258
Speech-Marker:timestamp=857206039059
BARGE-IN-OCCURRED Example
8.9. PAUSE
The PAUSE method from the client to the server tells the synthesizer
resource to pause speech output if it is speaking something. If a
PAUSE method is issued on a session when a SPEAK is not active, the
server MUST respond with a status-code of 402 "Method not valid in
this state". If a PAUSE method is issued on a session when a SPEAK
is active and paused, the server MUST respond with a status-code of
200 "Success". If a SPEAK request was active, the server MUST return
an Active-Request-Id-List header field whose value contains the
request-id of the SPEAK request that was paused.
C->S: MRCP/2.0 ... SPEAK 543258
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:neutral
Voice-Age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams and arrived at
<break/>
<say-as interpret-as="vxml:time">0342p</say-as>.</s>
<s>The subject is
<prosody rate="-20%">ski trip</prosody></s>
</p>
</speak>
S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857206027059
C->S: MRCP/2.0 ... PAUSE 543259
Channel-Identifier:32AECB23433802@speechsynth
S->C: MRCP/2.0 ... 543259 200 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Active-Request-Id-List:543258
PAUSE Example
8.10. RESUME
The RESUME method from the client to the server tells a paused
synthesizer resource to resume speaking. If a RESUME request is
issued on a session with no active SPEAK request, the server MUST
respond with a status-code of 402 "Method not valid in this state".
If a RESUME request is issued on a session with an active SPEAK
request that is speaking (i.e., not paused), the server MUST respond
with a status-code of 200 "Success". If a SPEAK request was paused,
the server MUST return an Active-Request-Id-List header field whose
value contains the request-id of the SPEAK request that was resumed.
C->S: MRCP/2.0 ... SPEAK 543258
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:neutral
Voice-age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams and arrived at
<break/>
<say-as interpret-as="vxml:time">0342p</say-as>.</s>
<s>The subject is
<prosody rate="-20%">ski trip</prosody></s>
</p>
</speak>
S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS@speechsynth
Channel-Identifier:32AECB23433802
Speech-Marker:timestamp=857206027059
C->S: MRCP/2.0 ... PAUSE 543259
Channel-Identifier:32AECB23433802@speechsynth
S->C: MRCP/2.0 ... 543259 200 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Active-Request-Id-List:543258
C->S: MRCP/2.0 ... RESUME 543260
Channel-Identifier:32AECB23433802@speechsynth
S->C: MRCP/2.0 ... 543260 200 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Active-Request-Id-List:543258
RESUME Example
8.11. CONTROL
The CONTROL method from the client to the server tells a synthesizer
that is speaking to modify what it is speaking on the fly. This
method is used to request the synthesizer to jump forward or backward
in what it is speaking, change speaker rate, speaker parameters, etc.
It affects only the currently IN-PROGRESS SPEAK request. Depending
on the implementation and capability of the synthesizer resource, it
may or may not support the various modifications indicated by header
fields in the CONTROL request.
When a client invokes a CONTROL method to jump forward and the
operation goes beyond the end of the active SPEAK method's text, the
CONTROL request still succeeds. The active SPEAK request completes
and returns a SPEAK-COMPLETE event following the response to the
CONTROL method. If there are more SPEAK requests in the queue, the
synthesizer resource starts at the beginning of the next SPEAK
request in the queue.
When a client invokes a CONTROL method to jump backward and the
operation jumps to the beginning or beyond the beginning of the
speech data of the active SPEAK method, the CONTROL request still
succeeds. The response to the CONTROL request contains the speak-
restart header field, and the active SPEAK request restarts from the
beginning of its speech data.
These two behaviors can be used to rewind or fast-forward across
multiple speech requests, if the client wants to break up a speech
markup text into multiple SPEAK requests.
If a SPEAK request was active when the CONTROL method was received,
the server MUST return an Active-Request-Id-List header field
containing the request-id of the SPEAK request that was active.
C->S: MRCP/2.0 ... SPEAK 543258
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:neutral
Voice-age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams
and arrived at <break/>
<say-as interpret-as="vxml:time">0342p</say-as>.</s>
<s>The subject is <prosody
rate="-20%">ski trip</prosody></s>
</p>
</speak>
S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857205016059
C->S: MRCP/2.0 ... CONTROL 543259
Channel-Identifier:32AECB23433802@speechsynth
Prosody-rate:fast
S->C: MRCP/2.0 ... 543259 200 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Active-Request-Id-List:543258
Speech-Marker:timestamp=857206027059
C->S: MRCP/2.0 ... CONTROL 543260
Channel-Identifier:32AECB23433802@speechsynth
Jump-Size:-15 Words
S->C: MRCP/2.0 ... 543260 200 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Active-Request-Id-List:543258
Speech-Marker:timestamp=857206039059
CONTROL Example
8.12. SPEAK-COMPLETE
This is an Event message from the synthesizer resource to the client
that indicates the corresponding SPEAK request was completed. The
request-id field matches the request-id of the SPEAK request that
initiated the speech that just completed. The request-state field is
set to COMPLETE by the server, indicating that this is the last event
with the corresponding request-id. The Completion-Cause header field
specifies the cause code pertaining to the status and reason of
request completion, such as the SPEAK completed normally or because
of an error, kill-on-barge-in, etc.
C->S: MRCP/2.0 ... SPEAK 543260
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:neutral
Voice-age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams
and arrived at <break/>
<say-as interpret-as="vxml:time">0342p</say-as>.</s>
<s>The subject is
<prosody rate="-20%">ski trip</prosody></s>
</p>
</speak>
S->C: MRCP/2.0 ... 543260 200 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857206027059
S->C: MRCP/2.0 ... SPEAK-COMPLETE 543260 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Completion-Cause:000 normal
Speech-Marker:timestamp=857206039059
SPEAK-COMPLETE Example
8.13. SPEECH-MARKER
This is an event generated by the synthesizer resource to the client
when the synthesizer encounters a marker tag in the speech markup it
is currently processing. The value of the request-id field MUST
match that of the corresponding SPEAK request. The request-state
field MUST have the value "IN-PROGRESS" as the speech is still not
complete. The value of the speech marker tag hit, describing where
the synthesizer is in the speech markup, MUST be returned in the
Speech-Marker header field, along with an NTP timestamp indicating
the instant in the output speech stream that the marker was
encountered. The SPEECH-MARKER event MUST also be generated with a
null marker value and output NTP timestamp when a SPEAK request in
Pending-State (i.e., in the queue) changes state to IN-PROGRESS and
starts speaking. The NTP timestamp MUST be synchronized with the RTP
timestamp used to generate the speech stream through standard RTCP
machinery.
C->S: MRCP/2.0 ... SPEAK 543261
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:neutral
Voice-age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams
and arrived at <break/>
<say-as interpret-as="vxml:time">0342p</say-as>.</s>
<mark name="here"/>
<s>The subject is
<prosody rate="-20%">ski trip</prosody>
</s>
<mark name="ANSWER"/>
</p>
</speak>
S->C: MRCP/2.0 ... 543261 200 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857205015059
S->C: MRCP/2.0 ... SPEECH-MARKER 543261 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857206027059;here
S->C: MRCP/2.0 ... SPEECH-MARKER 543261 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857206039059;ANSWER
S->C: MRCP/2.0 ... SPEAK-COMPLETE 543261 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Completion-Cause:000 normal
Speech-Marker:timestamp=857207689259;ANSWER
SPEECH-MARKER Example
8.14. DEFINE-LEXICON
The DEFINE-LEXICON method, from the client to the server, provides a
lexicon and tells the server to load or unload the lexicon (see
Section 8.4.16). The media type of the lexicon is provided in the
Content-Type header (see Section 8.5.2). One such media type is
"application/pls+xml" for the Pronunciation Lexicon Specification
(PLS) [W3C.REC-pronunciation-lexicon-20081014] [RFC4267].
If the server resource is in the speaking or paused state, the server
MUST respond with a failure status-code of 402 "Method not valid in
this state".
If the resource is in the idle state and is able to successfully
load/unload the lexicon, the status MUST return a 200 "Success"
status-code and the request-state MUST be COMPLETE.
If the synthesizer could not define the lexicon for some reason, for
example, because the download failed or the lexicon was in an
unsupported form, the server MUST respond with a failure status-code
of 407 and a Completion-Cause header field describing the failure
reason.