tech-invite   World Map     

IETF     RFCs     Groups     SIP     ABNFs    |    3GPP     Specs     Glossaries     Architecture     IMS     UICC    |    search     info

RFC 6787

 
 
 

Media Resource Control Protocol Version 2 (MRCPv2)

Part 5 of 8, p. 99 to 135
Prev RFC Part       Next RFC Part

 


prevText      Top      Up      ToC       Page 99 
9.6.  Recognizer Results

   The recognizer portion of NLSML (see Section 6.3.1) represents
   information automatically extracted from a user's utterances by a
   semantic interpretation component, where "utterance" is to be taken
   in the general sense of a meaningful user input in any modality
   supported by the MRCPv2 implementation.

9.6.1.  Markup Functions

   MRCPv2 recognizer resources employ the Natural Language Semantics
   Markup Language (NLSML) to interpret natural language speech input
   and to format the interpretation for consumption by an MRCPv2 client.

   The elements of the markup fall into the following general functional
   categories: interpretation, side information, and multi-modal
   integration.

9.6.1.1.  Interpretation

   Elements and attributes represent the semantics of a user's
   utterance, including the <result>, <interpretation>, and <instance>
   elements.  The <result> element contains the full result of
   processing one utterance.  It MAY contain multiple <interpretation>
   elements if the interpretation of the utterance results in multiple
   alternative meanings due to uncertainty in speech recognition or
   natural language understanding.  There are at least two reasons for
   providing multiple interpretations:

   1.  The client application might have additional information, for
       example, information from a database, that would allow it to
       select a preferred interpretation from among the possible
       interpretations returned from the semantic interpreter.

Top      Up      ToC       Page 100 
   2.  A client-based dialog manager (e.g., VoiceXML
       [W3C.REC-voicexml20-20040316]) that was unable to select between
       several competing interpretations could use this information to
       go back to the user and find out what was intended.  For example,
       it could issue a SPEAK request to a synthesizer resource to emit
       "Did you say 'Boston' or 'Austin'?"

9.6.1.2.  Side Information

   These are elements and attributes representing additional information
   about the interpretation, over and above the interpretation itself.
   Side information includes:

   1.  Whether an interpretation was achieved (the <nomatch> element)
       and the system's confidence in an interpretation (the
       "confidence" attribute of <interpretation>).

   2.  Alternative interpretations (<interpretation>)

   3.  Input formats and Automatic Speech Recognition (ASR) information:
       the <input> element, representing the input to the semantic
       interpreter.

9.6.1.3.  Multi-Modal Integration

   When more than one modality is available for input, the
   interpretation of the inputs needs to be coordinated.  The "mode"
   attribute of <input> supports this by indicating whether the
   utterance was input by speech, DTMF, pointing, etc.  The "timestamp-
   start" and "timestamp-end" attributes of <input> also provide for
   temporal coordination by indicating when inputs occurred.

9.6.2.  Overview of Recognizer Result Elements and Their Relationships

   The recognizer elements in NLSML fall into two categories:

   1.  description of the input that was processed, and

   2.  description of the meaning which was extracted from the input.

   Next to each element are its attributes.  In addition, some elements
   can contain multiple instances of other elements.  For example, a
   <result> can contain multiple <interpretation> elements, each of
   which is taken to be an alternative.  Similarly, <input> can contain
   multiple child <input> elements, which are taken to be cumulative.
   To illustrate the basic usage of these elements, as a simple example,

Top      Up      ToC       Page 101 
   consider the utterance "OK" (interpreted as "yes").  The example
   illustrates how that utterance and its interpretation would be
   represented in the NLSML markup.

   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="http://www.example.com/theYesNoGrammar">
     <interpretation>
        <instance>
           <ex:response>yes</ex:response>
         </instance>
       <input>OK</input>
     </interpretation>
   </result>

   This example includes only the minimum required information.  There
   is an overall <result> element, which includes one interpretation and
   an input element.  The interpretation contains the application-
   specific element "<response>", which is the semantically interpreted
   result.

9.6.3.  Elements and Attributes

9.6.3.1.  <result> Root Element

   The root element of the markup is <result>.  The <result> element
   includes one or more <interpretation> elements.  Multiple
   interpretations can result from ambiguities in the input or in the
   semantic interpretation.  If the "grammar" attribute does not apply
   to all of the interpretations in the result, it can be overridden for
   individual interpretations at the <interpretation> level.

   Attributes:

   1.  grammar: The grammar or recognition rule matched by this result.
       The format of the grammar attribute will match the rule reference
       semantics defined in the grammar specification.  Specifically,
       the rule reference is in the external XML form for grammar rule
       references.  The markup interpreter needs to know the grammar
       rule that is matched by the utterance because multiple rules may
       be simultaneously active.  The value is the grammar URI used by
       the markup interpreter to specify the grammar.  The grammar can
       be overridden by a grammar attribute in the <interpretation>
       element if the input was ambiguous as to which grammar it
       matched.  If all interpretation elements within the result
       element contain their own grammar attributes, the attribute can
       be dropped from the result element.

Top      Up      ToC       Page 102 
   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           grammar="http://www.example.com/grammar">
     <interpretation>
      ....
     </interpretation>
   </result>

9.6.3.2.  <interpretation> Element

   An <interpretation> element contains a single semantic
   interpretation.

   Attributes:

   1.  confidence: A float value from 0.0-1.0 indicating the semantic
       analyzer's confidence in this interpretation.  A value of 1.0
       indicates maximum confidence.  The values are implementation
       dependent but are intended to align with the value interpretation
       for the confidence MRCPv2 header field defined in Section 9.4.1.
       This attribute is OPTIONAL.

   2.  grammar: The grammar or recognition rule matched by this
       interpretation (if needed to override the grammar specification
       at the <interpretation> level.)  This attribute is only needed
       under <interpretation> if it is necessary to override a grammar
       that was defined at the <result> level.  Note that the grammar
       attribute for the interpretation element is optional if and only
       if the grammar attribute is specified in the <result> element.

   Interpretations MUST be sorted best-first by some measure of
   "goodness".  The goodness measure is "confidence" if present;
   otherwise, it is some implementation-specific indication of quality.

   The grammar is expected to be specified most frequently at the
   <result> level.  However, it can be overridden at the
   <interpretation> level because it is possible that different
   interpretations may match different grammar rules.

   The <interpretation> element includes an optional <input> element
   containing the input being analyzed, and at least one <instance>
   element containing the interpretation of the utterance.

   <interpretation confidence="0.75"
                   grammar="http://www.example.com/grammar">
       ...
   </interpretation>

Top      Up      ToC       Page 103 
9.6.3.3.  <instance> Element

   The <instance> element contains the interpretation of the utterance.
   When the Semantic Interpretation for Speech Recognition format is
   used, the <instance> element contains the XML serialization of the
   result using the approach defined in that specification.  When there
   is semantic markup in the grammar that does not create semantic
   objects, but instead only does a semantic translation of a portion of
   the input, such as translating "coke" to "coca-cola", the instance
   contains the whole input but with the translation applied.  The NLSML
   looks like the markup in Figure 2 below.  If there are no semantic
   objects created, nor any semantic translation, the instance value is
   the same as the input value.

   Attributes:

   1.  confidence: Each element of the instance MAY have a confidence
       attribute, defined in the NLSML namespace.  The confidence
       attribute contains a float value in the range from 0.0-1.0
       reflecting the system's confidence in the analysis of that slot.
       A value of 1.0 indicates maximum confidence.  The values are
       implementation dependent, but are intended to align with the
       value interpretation for the MRCPv2 header field Confidence-
       Threshold defined in Section 9.4.1.  This attribute is OPTIONAL.

   <instance>
     <nameAddress>
         <street confidence="0.75">123 Maple Street</street>
         <city>Mill Valley</city>
         <state>CA</state>
         <zip>90952</zip>
     </nameAddress>
   </instance>
   <input>
     My address is 123 Maple Street,
     Mill Valley, California, 90952
   </input>


   <instance>
       I would like to buy a coca-cola
   </instance>
   <input>
     I would like to buy a coke
   </input>

                          Figure 2: NSLML Example

Top      Up      ToC       Page 104 
9.6.3.4.  <input> Element

   The <input> element is the text representation of a user's input.  It
   includes an optional "confidence" attribute, which indicates the
   recognizer's confidence in the recognition result (as opposed to the
   confidence in the interpretation, which is indicated by the
   "confidence" attribute of <interpretation>).  Optional "timestamp-
   start" and "timestamp-end" attributes indicate the start and end
   times of a spoken utterance, in ISO 8601 format [ISO.8601.1988].

   Attributes:

   1.  timestamp-start: The time at which the input began. (optional)

   2.  timestamp-end: The time at which the input ended. (optional)

   3.  mode: The modality of the input, for example, speech, DTMF, etc.
       (optional)

   4.  confidence: The confidence of the recognizer in the correctness
       of the input in the range 0.0 to 1.0. (optional)

   Note that it may not make sense for temporally overlapping inputs to
   have the same mode; however, this constraint is not expected to be
   enforced by implementations.

   When there is no time zone designator, ISO 8601 time representations
   default to local time.

   There are three possible formats for the <input> element.

   1.  The <input> element can contain simple text:

       <input>onions</input>

       A future possibility is for <input> to contain not only text but
       additional markup that represents prosodic information that was
       contained in the original utterance and extracted by the speech
       recognizer.  This depends on the availability of ASRs that are
       capable of producing prosodic information.  MRCPv2 clients MUST
       be prepared to receive such markup and MAY make use of it.

   2.  An <input> tag can also contain additional <input> tags.  Having
       additional input elements allows the representation to support
       future multi-modal inputs as well as finer-grained speech
       information, such as timestamps for individual words and word-
       level confidences.

Top      Up      ToC       Page 105 
       <input>
            <input mode="speech" confidence="0.5"
                timestamp-start="2000-04-03T0:00:00"
                timestamp-end="2000-04-03T0:00:00.2">fried</input>
            <input mode="speech" confidence="1.0"
                timestamp-start="2000-04-03T0:00:00.25"
                timestamp-end="2000-04-03T0:00:00.6">onions</input>
       </input>

   3.  Finally, the <input> element can contain <nomatch> and <noinput>
       elements, which describe situations in which the speech
       recognizer received input that it was unable to process or did
       not receive any input at all, respectively.

9.6.3.5.  <nomatch> Element

   The <nomatch> element under <input> is used to indicate that the
   semantic interpreter was unable to successfully match any input with
   confidence above the threshold.  It can optionally contain the text
   of the best of the (rejected) matches.

   <interpretation>
      <instance/>
         <input confidence="0.1">
            <nomatch/>
         </input>
   </interpretation>
   <interpretation>
      <instance/>
      <input mode="speech" confidence="0.1">
        <nomatch>I want to go to New York</nomatch>
      </input>
   </interpretation>

9.6.3.6.  <noinput> Element

   <noinput> indicates that there was no input -- a timeout occurred in
   the speech recognizer due to silence.
   <interpretation>
      <instance/>
      <input>
         <noinput/>
      </input>
   </interpretation>

   If there are multiple levels of inputs, the most natural place for
   <nomatch> and <noinput> elements to appear is under the highest level
   of <input> for <noinput>, and under the appropriate level of

Top      Up      ToC       Page 106 
   <interpretation> for <nomatch>.  So, <noinput> means "no input at
   all" and <nomatch> means "no match in speech modality" or "no match
   in DTMF modality".  For example, to represent garbled speech combined
   with DTMF "1 2 3 4", the markup would be:
   <input>
      <input mode="speech"><nomatch/></input>
      <input mode="dtmf">1 2 3 4</input>
   </input>

   Note: while <noinput> could be represented as an attribute of input,
   <nomatch> cannot, since it could potentially include PCDATA content
   with the best match.  For parallelism, <noinput> is also an element.

9.7.  Enrollment Results

   All enrollment elements are contained within a single
   <enrollment-result> element under <result>.  The elements are
   described below and have the schema defined in Section 16.2.  The
   following elements are defined:

   1.  num-clashes

   2.  num-good-repetitions

   3.  num-repetitions-still-needed

   4.  consistency-status

   5.  clash-phrase-ids

   6.  transcriptions

   7.  confusable-phrases

9.7.1.  <num-clashes> Element

   The <num-clashes> element contains the number of clashes that this
   pronunciation has with other pronunciations in an active enrollment
   session.  The associated Clash-Threshold header field determines the
   sensitivity of the clash measurement.  Note that clash testing can be
   turned off completely by setting the Clash-Threshold header field
   value to 0.

9.7.2.  <num-good-repetitions> Element

   The <num-good-repetitions> element contains the number of consistent
   pronunciations obtained so far in an active enrollment session.

Top      Up      ToC       Page 107 
9.7.3.  <num-repetitions-still-needed> Element

   The <num-repetitions-still-needed> element contains the number of
   consistent pronunciations that must still be obtained before the new
   phrase can be added to the enrollment grammar.  The number of
   consistent pronunciations required is specified by the client in the
   request header field Num-Min-Consistent-Pronunciations.  The returned
   value must be 0 before the client can successfully commit a phrase to
   the grammar by ending the enrollment session.

9.7.4.  <consistency-status> Element

   The <consistency-status> element is used to indicate how consistent
   the repetitions are when learning a new phrase.  It can have the
   values of consistent, inconsistent, and undecided.

9.7.5.  <clash-phrase-ids> Element

   The <clash-phrase-ids> element contains the phrase IDs of clashing
   pronunciation(s), if any.  This element is absent if there are no
   clashes.

9.7.6.  <transcriptions> Element

   The <transcriptions> element contains the transcriptions returned in
   the last repetition of the phrase being enrolled.

9.7.7.  <confusable-phrases> Element

   The <confusable-phrases> element contains a list of phrases from a
   command grammar that are confusable with the phrase being added to
   the personal grammar.  This element MAY be absent if there are no
   confusable phrases.

9.8.  DEFINE-GRAMMAR

   The DEFINE-GRAMMAR method, from the client to the server, provides
   one or more grammars and requests the server to access, fetch, and
   compile the grammars as needed.  The DEFINE-GRAMMAR method
   implementation MUST do a fetch of all external URIs that are part of
   that operation.  If caching is implemented, this URI fetching MUST
   conform to the cache control hints and parameter header fields
   associated with the method in deciding whether the URIs should be
   fetched from cache or from the external server.  If these hints/
   parameters are not specified in the method, the values set for the
   session using SET-PARAMS/GET-PARAMS apply.  If it was not set for the
   session, their default values apply.

Top      Up      ToC       Page 108 
   If the server resource is in the recognition state, the DEFINE-
   GRAMMAR request MUST respond with a failure status.

   If the resource is in the idle state and is able to successfully
   process the supplied grammars, the server MUST return a success code
   status and the request-state MUST be COMPLETE.

   If the recognizer resource could not define the grammar for some
   reason (for example, if the download failed, the grammar failed to
   compile, or the grammar was in an unsupported form), the MRCPv2
   response for the DEFINE-GRAMMAR method MUST contain a failure status-
   code of 407 and contain a Completion-Cause header field describing
   the failure reason.

   C->S:MRCP/2.0 ... DEFINE-GRAMMAR 543257
   Channel-Identifier:32AECB23433801@speechrecog
   Content-Type:application/srgs+xml
   Content-ID:<request1@form-level.store>
   Content-Length:...

   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0">

   <!-- single language attachment to tokens -->
   <rule id="yes">
               <one-of>
                     <item xml:lang="fr-CA">oui</item>
                     <item xml:lang="en-US">yes</item>
               </one-of>
         </rule>

   <!-- single language attachment to a rule expansion -->
         <rule id="request">
               may I speak to
               <one-of xml:lang="fr-CA">
                     <item>Michel Tremblay</item>
                     <item>Andre Roy</item>
               </one-of>
         </rule>

         </grammar>

   S->C:MRCP/2.0 ... 543257 200 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success

Top      Up      ToC       Page 109 
   C->S:MRCP/2.0 ... DEFINE-GRAMMAR 543258
   Channel-Identifier:32AECB23433801@speechrecog
   Content-Type:application/srgs+xml
   Content-ID:<helpgrammar@root-level.store>
   Content-Length:...

   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0">

         <rule id="request">
               I need help
         </rule>

   S->C:MRCP/2.0 ... 543258 200 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success

   C->S:MRCP/2.0 ... DEFINE-GRAMMAR 543259
   Channel-Identifier:32AECB23433801@speechrecog
   Content-Type:application/srgs+xml
   Content-ID:<request2@field-level.store>
   Content-Length:...

   <?xml version="1.0" encoding="UTF-8"?>

   <!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                     "http://www.w3.org/TR/speech-grammar/grammar.dtd">

   <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://www.w3.org/2001/06/grammar
              http://www.w3.org/TR/speech-grammar/grammar.xsd"
              version="1.0" mode="voice" root="basicCmd">

   <meta name="author" content="Stephanie Williams"/>

   <rule id="basicCmd" scope="public">
     <example> please move the window </example>
     <example> open a file </example>

     <ruleref
       uri="http://grammar.example.com/politeness.grxml#startPolite"/>

Top      Up      ToC       Page 110 
     <ruleref uri="#command"/>
     <ruleref
       uri="http://grammar.example.com/politeness.grxml#endPolite"/>
   </rule>

   <rule id="command">
     <ruleref uri="#action"/> <ruleref uri="#object"/>
   </rule>

   <rule id="action">
      <one-of>
         <item weight="10"> open   <tag>open</tag>   </item>
         <item weight="2">  close  <tag>close</tag>  </item>
         <item weight="1">  delete <tag>delete</tag> </item>
         <item weight="1">  move   <tag>move</tag>   </item>
      </one-of>
   </rule>

   <rule id="object">
     <item repeat="0-1">
       <one-of>
         <item> the </item>
         <item> a </item>
       </one-of>
     </item>

     <one-of>
         <item> window </item>
         <item> file </item>
         <item> menu </item>
     </one-of>
   </rule>

   </grammar>


   S->C:MRCP/2.0 ... 543259 200 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success

   C->S:MRCP/2.0 ... RECOGNIZE 543260
   Channel-Identifier:32AECB23433801@speechrecog
           N-Best-List-Length:2
   Content-Type:text/uri-list
   Content-Length:...

Top      Up      ToC       Page 111 
   session:request1@form-level.store
   session:request2@field-level.store
   session:helpgramar@root-level.store

   S->C:MRCP/2.0 ... 543260 200 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... START-OF-INPUT 543260 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... RECOGNITION-COMPLETE 543260 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
   Completion-Cause:000 success
   Waveform-URI:<http://web.media.com/session123/audio.wav>;
                size=124535;duration=2340
   Content-Type:application/x-nlsml
   Content-Length:...

   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="session:request1@form-level.store">
           <interpretation>
               <instance name="Person">
               <ex:Person>
                   <ex:Name> Andre Roy </ex:Name>
               </ex:Person>
            </instance>
            <input>   may I speak to Andre Roy </input>
       </interpretation>
   </result>

                          Define Grammar Example

9.9.  RECOGNIZE

   The RECOGNIZE method from the client to the server requests the
   recognizer to start recognition and provides it with one or more
   grammar references for grammars to match against the input media.
   The RECOGNIZE method can carry header fields to control the
   sensitivity, confidence level, and the level of detail in results
   provided by the recognizer.  These header field values override the
   current values set by a previous SET-PARAMS method.

   The RECOGNIZE method can request the recognizer resource to operate
   in normal or hotword mode as specified by the Recognition-Mode header
   field.  The default value is "normal".  If the resource could not
   start a recognition, the server MUST respond with a failure status-

Top      Up      ToC       Page 112 
   code of 407 and a Completion-Cause header field in the response
   describing the cause of failure.

   The RECOGNIZE request uses the message body to specify the grammars
   applicable to the request.  The active grammar(s) for the request can
   be specified in one of three ways.  If the client needs to explicitly
   control grammar weights for the recognition operation, it MUST employ
   method 3 below.  The order of these grammars specifies the precedence
   of the grammars that is used when more than one grammar in the list
   matches the speech; in this case, the grammar with the higher
   precedence is returned as a match.  This precedence capability is
   useful in applications like VoiceXML browsers to order grammars
   specified at the dialog, document, and root level of a VoiceXML
   application.

   1.  The grammar MAY be placed directly in the message body as typed
       content.  If more than one grammar is included in the body, the
       order of inclusion controls the corresponding precedence for the
       grammars during recognition, with earlier grammars in the body
       having a higher precedence than later ones.

   2.  The body MAY contain a list of grammar URIs specified in content
       of media type 'text/uri-list' [RFC2483].  The order of the URIs
       determines the corresponding precedence for the grammars during
       recognition, with highest precedence first and decreasing for
       each URI thereafter.

   3.  The body MAY contain a list of grammar URIs specified in content
       of media type 'text/grammar-ref-list'.  This type defines a list
       of grammar URIs and allows each grammar URI to be assigned a
       weight in the list.  This weight has the same meaning as the
       weights described in Section 2.4.1 of the Speech Grammar Markup
       Format (SRGS) [W3C.REC-speech-grammar-20040316].

   In addition to performing recognition on the input, the recognizer
   MUST also enroll the collected utterance in a personal grammar if the
   Enroll-Utterance header field is set to true and an Enrollment is
   active (via an earlier execution of the START-PHRASE-ENROLLMENT
   method).  If so, and if the RECOGNIZE request contains a Content-ID
   header field, then the resulting grammar (which includes the personal
   grammar as a sub-grammar) can be referenced through the 'session' URI
   scheme (see Section 13.6).

   If the resource was able to successfully start the recognition, the
   server MUST return a success status-code and a request-state of
   IN-PROGRESS.  This means that the recognizer is active and that the
   client MUST be prepared to receive further events with this
   request-id.

Top      Up      ToC       Page 113 
   If the resource was able to queue the request, the server MUST return
   a success code and request-state of PENDING.  This means that the
   recognizer is currently active with another request and that this
   request has been queued for processing.

   If the resource could not start a recognition, the server MUST
   respond with a failure status-code of 407 and a Completion-Cause
   header field in the response describing the cause of failure.

   For the recognizer resource, RECOGNIZE and INTERPRET are the only
   requests that return a request-state of IN-PROGRESS, meaning that
   recognition is in progress.  When the recognition completes by
   matching one of the grammar alternatives or by a timeout without a
   match or for some other reason, the recognizer resource MUST send the
   client a RECOGNITION-COMPLETE event (or INTERPRETATION-COMPLETE, if
   INTERPRET was the request) with the result of the recognition and a
   request-state of COMPLETE.

   Large grammars can take a long time for the server to compile.  For
   grammars that are used repeatedly, the client can improve server
   performance by issuing a DEFINE-GRAMMAR request with the grammar
   ahead of time.  In such a case, the client can issue the RECOGNIZE
   request and reference the grammar through the 'session' URI scheme
   (see Section 13.6).  This also applies in general if the client wants
   to repeat recognition with a previous inline grammar.

   The RECOGNIZE method implementation MUST do a fetch of all external
   URIs that are part of that operation.  If caching is implemented,
   this URI fetching MUST conform to the cache control hints and
   parameter header fields associated with the method in deciding
   whether it should be fetched from cache or from the external server.
   If these hints/parameters are not specified in the method, the values
   set for the session using SET-PARAMS/GET-PARAMS apply.  If it was not
   set for the session, their default values apply.

   Note that since the audio and the messages are carried over separate
   communication paths there may be a race condition between the start
   of the flow of audio and the receipt of the RECOGNIZE method.  For
   example, if an audio flow is started by the client at the same time
   as the RECOGNIZE method is sent, either the audio or the RECOGNIZE
   can arrive at the recognizer first.  As another example, the client
   may choose to continuously send audio to the server and signal the
   server to recognize using the RECOGNIZE method.  Mechanisms to
   resolve this condition are outside the scope of this specification.
   The recognizer can expect the media to start flowing when it receives
   the RECOGNIZE request, but it MUST NOT buffer anything it receives
   beforehand in order to preserve the semantics that application
   authors expect with respect to the input timers.

Top      Up      ToC       Page 114 
   When a RECOGNIZE method has been received, the recognition is
   initiated on the stream.  The No-Input-Timer MUST be started at this
   time if the Start-Input-Timers header field is specified as "true".
   If this header field is set to "false", the No-Input-Timer MUST be
   started when it receives the START-INPUT-TIMERS method from the
   client.  The Recognition-Timeout MUST be started when the recognition
   resource detects speech or a DTMF digit in the media stream.

   For recognition when not in hotword mode:

   When the recognizer resource detects speech or a DTMF digit in the
   media stream, it MUST send the START-OF-INPUT event.  When enough
   speech has been collected for the server to process, the recognizer
   can try to match the collected speech with the active grammars.  If
   the speech collected at this point fully matches with any of the
   active grammars, the Speech-Complete-Timer is started.  If it matches
   partially with one or more of the active grammars, with more speech
   needed before a full match is achieved, then the Speech-Incomplete-
   Timer is started.

   1.  When the No-Input-Timer expires, the recognizer MUST complete
       with a Completion-Cause code of "no-input-timeout".

   2.  The recognizer MUST support detecting a no-match condition upon
       detecting end of speech.  The recognizer MAY support detecting a
       no-match condition before waiting for end-of-speech.  If this is
       supported, this capability is enabled by setting the Early-No-
       Match header field to "true".  Upon detecting a no-match
       condition, the RECOGNIZE MUST return with "no-match".

   3.  When the Speech-Incomplete-Timer expires, the recognizer SHOULD
       complete with a Completion-Cause code of "partial-match", unless
       the recognizer cannot differentiate a partial-match, in which
       case it MUST return a Completion-Cause code of "no-match".  The
       recognizer MAY return results for the partially matched grammar.

   4.  When the Speech-Complete-Timer expires, the recognizer MUST
       complete with a Completion-Cause code of "success".

Top      Up      ToC       Page 115 
   5.  When the Recognition-Timeout expires, one of the following MUST
       happen:

       5.1.  If there was a partial-match, the recognizer SHOULD
             complete with a Completion-Cause code of "partial-match-
             maxtime", unless the recognizer cannot differentiate a
             partial-match, in which case it MUST complete with a
             Completion-Cause code of "no-match-maxtime".  The
             recognizer MAY return results for the partially matched
             grammar.

       5.2.  If there was a full-match, the recognizer MUST complete
             with a Completion-Cause code of "success-maxtime".

       5.3.  If there was a no match, the recognizer MUST complete with
             a Completion-Cause code of "no-match-maxtime".

   For recognition in hotword mode:

   Note that for recognition in hotword mode the START-OF-INPUT event is
   not generated when speech or a DTMF digit is detected.

   1.  When the No-Input-Timer expires, the recognizer MUST complete
       with a Completion-Cause code of "no-input-timeout".

   2.  If at any point a match occurs, the RECOGNIZE MUST complete with
       a Completion-Cause code of "success".

   3.  When the Recognition-Timeout expires and there is not a match,
       the RECOGNIZE MUST complete with a Completion-Cause code of
       "hotword-maxtime".

   4.  When the Recognition-Timeout expires and there is a match, the
       RECOGNIZE MUST complete with a Completion-Cause code of "success-
       maxtime".

   5.  When the Recognition-Timeout is running but the detected speech/
       DTMF has not resulted in a match, the Recognition-Timeout MUST be
       stopped and reset.  It MUST then be restarted when speech/DTMF is
       again detected.

   Below is a complete example of using RECOGNIZE.  It shows the call to
   RECOGNIZE, the IN-PROGRESS and START-OF-INPUT status messages, and
   the final RECOGNITION-COMPLETE message containing the result.

Top      Up      ToC       Page 116 
   C->S:MRCP/2.0 ... RECOGNIZE 543257
   Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9
   Content-Type:application/srgs+xml
   Content-ID:<request1@form-level.store>
   Content-Length:...

   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0" root="request">

   <!-- single language attachment to tokens -->
       <rule id="yes">
               <one-of>
                     <item xml:lang="fr-CA">oui</item>
                     <item xml:lang="en-US">yes</item>
               </one-of>
         </rule>

   <!-- single language attachment to a rule expansion -->
         <rule id="request">
               may I speak to
               <one-of xml:lang="fr-CA">
                     <item>Michel Tremblay</item>
                     <item>Andre Roy</item>
               </one-of>
         </rule>

     </grammar>

   S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
   Completion-Cause:000 success
   Waveform-URI:<http://web.media.com/session123/audio.wav>;
                 size=424252;duration=2543
   Content-Type:application/nlsml+xml
   Content-Length:...

Top      Up      ToC       Page 117 
   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="session:request1@form-level.store">
       <interpretation>
           <instance name="Person">
               <ex:Person>
                   <ex:Name> Andre Roy </ex:Name>
               </ex:Person>
           </instance>
               <input>   may I speak to Andre Roy </input>
       </interpretation>
   </result>

   Below is an example of calling RECOGNIZE with a different grammar.
   No status or completion messages are shown in this example, although
   they would of course occur in normal usage.

   C->S:   MRCP/2.0 ... RECOGNIZE 543257
           Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9
           Fetch-Timeout:20
           Content-Type:application/srgs+xml
           Content-Length:...

           <?xml version="1.0"? Version="1.0" mode="voice"
                 root="Basic md">
            <rule id="rule_list" scope="public">
                <one-of>
                    <item weight=10>
                        <ruleref uri=
               "http://grammar.example.com/world-cities.grxml#canada"/>
                   </item>
                   <item weight=1.5>
                       <ruleref uri=
               "http://grammar.example.com/world-cities.grxml#america"/>
                   </item>
                  <item weight=0.5>
                       <ruleref uri=
               "http://grammar.example.com/world-cities.grxml#india"/>
                  </item>
              </one-of>
           </rule>

Top      Up      ToC       Page 118 
9.10.  STOP

   The STOP method from the client to the server tells the resource to
   stop recognition if a request is active.  If a RECOGNIZE request is
   active and the STOP request successfully terminated it, then the
   response header section contains an Active-Request-Id-List header
   field containing the request-id of the RECOGNIZE request that was
   terminated.  In this case, no RECOGNITION-COMPLETE event is sent for
   the terminated request.  If there was no recognition active, then the
   response MUST NOT contain an Active-Request-Id-List header field.
   Either way, the response MUST contain a status-code of 200 "Success".

   C->S:   MRCP/2.0 ... RECOGNIZE 543257
           Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9
           Content-Type:application/srgs+xml
           Content-ID:<request1@form-level.store>
           Content-Length:...

           <?xml version="1.0"?>

           <!-- the default grammar language is US English -->
           <grammar xmlns="http://www.w3.org/2001/06/grammar"
                    xml:lang="en-US" version="1.0" root="request">

           <!-- single language attachment to tokens -->
               <rule id="yes">
                   <one-of>
                         <item xml:lang="fr-CA">oui</item>
                         <item xml:lang="en-US">yes</item>
                   </one-of>
               </rule>

           <!-- single language attachment to a rule expansion -->
               <rule id="request">
               may I speak to
                   <one-of xml:lang="fr-CA">
                         <item>Michel Tremblay</item>
                         <item>Andre Roy</item>
                   </one-of>
               </rule>
           </grammar>

   S->C:   MRCP/2.0 ... 543257 200 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog

   C->S:   MRCP/2.0 ... STOP 543258 200
           Channel-Identifier:32AECB23433801@speechrecog

Top      Up      ToC       Page 119 
   S->C:   MRCP/2.0 ... 543258 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Active-Request-Id-List:543257

9.11.  GET-RESULT

   The GET-RESULT method from the client to the server MAY be issued
   when the recognizer resource is in the recognized state.  This
   request allows the client to retrieve results for a completed
   recognition.  This is useful if the client decides it wants more
   alternatives or more information.  When the server receives this
   request, it re-computes and returns the results according to the
   recognition constraints provided in the GET-RESULT request.

   The GET-RESULT request can specify constraints such as a different
   confidence-threshold or n-best-list-length.  This capability is
   OPTIONAL for MRCPv2 servers and the automatic speech recognition
   engine in the server MUST return a status of unsupported feature if
   not supported.

   C->S:   MRCP/2.0 ... GET-RESULT 543257
           Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9


   S->C:   MRCP/2.0 ... 543257 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Content-Type:application/nlsml+xml
           Content-Length:...

           <?xml version="1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   xmlns:ex="http://www.example.com/example"
                   grammar="session:request1@form-level.store">
               <interpretation>
                   <instance name="Person">
                       <ex:Person>
                           <ex:Name> Andre Roy </ex:Name>
                       </ex:Person>
                   </instance>
                   <input>   may I speak to Andre Roy </input>
               </interpretation>
           </result>

Top      Up      ToC       Page 120 
9.12.  START-OF-INPUT

   This is an event from the server to the client indicating that the
   recognizer resource has detected speech or a DTMF digit in the media
   stream.  This event is useful in implementing kill-on-barge-in
   scenarios when a synthesizer resource is in a different session from
   the recognizer resource and hence is not aware of an incoming audio
   source (see Section 8.4.2).  In these cases, it is up to the client
   to act as an intermediary and respond to this event by issuing a
   BARGE-IN-OCCURRED event to the synthesizer resource.  The recognizer
   resource also MUST send a Proxy-Sync-Id header field with a unique
   value for this event.

   This event MUST be generated by the server, irrespective of whether
   or not the synthesizer and recognizer are on the same server.

9.13.  START-INPUT-TIMERS

   This request is sent from the client to the recognizer resource when
   it knows that a kill-on-barge-in prompt has finished playing (see
   Section 8.4.2).  This is useful in the scenario when the recognition
   and synthesizer engines are not in the same session.  When a kill-on-
   barge-in prompt is being played, the client may want a RECOGNIZE
   request to be simultaneously active so that it can detect and
   implement kill-on-barge-in.  But at the same time the client doesn't
   want the recognizer to start the no-input timers until the prompt is
   finished.  The Start-Input-Timers header field in the RECOGNIZE
   request allows the client to say whether or not the timers should be
   started immediately.  If not, the recognizer resource MUST NOT start
   the timers until the client sends a START-INPUT-TIMERS method to the
   recognizer.

9.14.  RECOGNITION-COMPLETE

   This is an event from the recognizer resource to the client
   indicating that the recognition completed.  The recognition result is
   sent in the body of the MRCPv2 message.  The request-state field MUST
   be COMPLETE indicating that this is the last event with that
   request-id and that the request with that request-id is now complete.
   The server MUST maintain the recognizer context containing the
   results and the audio waveform input of that recognition until the
   next RECOGNIZE request is issued for that resource or the session
   terminates.  If the server returns a URI to the audio waveform, it
   MUST do so in a Waveform-URI header field in the RECOGNITION-COMPLETE
   event.  The client can use this URI to retrieve or playback the
   audio.

Top      Up      ToC       Page 121 
   Note, if an enrollment session was active, the RECOGNITION-COMPLETE
   event can contain either recognition or enrollment results depending
   on what was spoken.  The following example shows a complete exchange
   with a recognition result.

   C->S:   MRCP/2.0 ... RECOGNIZE 543257
           Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9
           Content-Type:application/srgs+xml
           Content-ID:<request1@form-level.store>
           Content-Length:...

           <?xml version="1.0"?>

           <!-- the default grammar language is US English -->
           <grammar xmlns="http://www.w3.org/2001/06/grammar"
                    xml:lang="en-US" version="1.0" root="request">

           <!-- single language attachment to tokens -->
               <rule id="yes">
                      <one-of>
                          <item xml:lang="fr-CA">oui</item>
                          <item xml:lang="en-US">yes</item>
                      </one-of>
                 </rule>

           <!-- single language attachment to a rule expansion -->
                 <rule id="request">
                     may I speak to
                      <one-of xml:lang="fr-CA">
                             <item>Michel Tremblay</item>
                             <item>Andre Roy</item>
                      </one-of>
                 </rule>
           </grammar>

   S->C:   MRCP/2.0 ... 543257 200 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog

   S->C:   MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog

Top      Up      ToC       Page 122 
   S->C:   MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Waveform-URI:<http://web.media.com/session123/audio.wav>;
                        size=342456;duration=25435
           Content-Type:application/nlsml+xml
           Content-Length:...

           <?xml version="1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   xmlns:ex="http://www.example.com/example"
                   grammar="session:request1@form-level.store">
               <interpretation>
                   <instance name="Person">
                       <ex:Person>
                           <ex:Name> Andre Roy </ex:Name>
                       </ex:Person>
                   </instance>
                   <input>   may I speak to Andre Roy </input>
               </interpretation>
           </result>

   If the result were instead an enrollment result, the final message
   from the server above could have been:

   S->C:   MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Content-Type:application/nlsml+xml
           Content-Length:...

           <?xml version= "1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   grammar="Personal-Grammar-URI">
               <enrollment-result>
                   <num-clashes> 2 </num-clashes>
                   <num-good-repetitions> 1 </num-good-repetitions>
                   <num-repetitions-still-needed>
                      1
                   </num-repetitions-still-needed>
                   <consistency-status> consistent </consistency-status>
                   <clash-phrase-ids>
                       <item> Jeff </item> <item> Andre </item>
                   </clash-phrase-ids>
                   <transcriptions>
                        <item> m ay b r ow k er </item>
                        <item> m ax r aa k ah </item>
                   </transcriptions>

Top      Up      ToC       Page 123 
                   <confusable-phrases>
                        <item>
                             <phrase> call </phrase>
                             <confusion-level> 10 </confusion-level>
                        </item>
                   </confusable-phrases>
               </enrollment-result>
           </result>

9.15.  START-PHRASE-ENROLLMENT

   The START-PHRASE-ENROLLMENT method from the client to the server
   starts a new phrase enrollment session during which the client can
   call RECOGNIZE multiple times to enroll a new utterance in a grammar.
   An enrollment session consists of a set of calls to RECOGNIZE in
   which the caller speaks a phrase several times so the system can
   "learn" it.  The phrase is then added to a personal grammar (speaker-
   trained grammar), so that the system can recognize it later.

   Only one phrase enrollment session can be active at a time for a
   resource.  The Personal-Grammar-URI identifies the grammar that is
   used during enrollment to store the personal list of phrases.  Once
   RECOGNIZE is called, the result is returned in a RECOGNITION-COMPLETE
   event and will contain either an enrollment result OR a recognition
   result for a regular recognition.

   Calling END-PHRASE-ENROLLMENT ends the ongoing phrase enrollment
   session, which is typically done after a sequence of successful calls
   to RECOGNIZE.  This method can be called to commit the new phrase to
   the personal grammar or to abort the phrase enrollment session.

   The grammar to contain the new enrolled phrase, specified by
   Personal-Grammar-URI, is created if it does not exist.  Also, the
   personal grammar MUST ONLY contain phrases added via a phrase
   enrollment session.

   The Phrase-ID passed to this method is used to identify this phrase
   in the grammar and will be returned as the speech input when doing a
   RECOGNIZE on the grammar.  The Phrase-NL similarly is returned in a
   RECOGNITION-COMPLETE event in the same manner as other Natural
   Language (NL) in a grammar.  The tag-format of this NL is
   implementation specific.

   If the client has specified Save-Best-Waveform as true, then the
   response after ending the phrase enrollment session MUST contain the
   location/URI of a recording of the best repetition of the learned
   phrase.

Top      Up      ToC       Page 124 
   C->S:   MRCP/2.0 ... START-PHRASE-ENROLLMENT 543258
           Channel-Identifier:32AECB23433801@speechrecog
           Num-Min-Consistent-Pronunciations:2
           Consistency-Threshold:30
           Clash-Threshold:12
           Personal-Grammar-URI:<personal grammar uri>
           Phrase-Id:<phrase id>
           Phrase-NL:<NL phrase>
           Weight:1
           Save-Best-Waveform:true

   S->C:   MRCP/2.0 ... 543258 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog

9.16.  ENROLLMENT-ROLLBACK

   The ENROLLMENT-ROLLBACK method discards the last live utterance from
   the RECOGNIZE operation.  The client can invoke this method when the
   caller provides undesirable input such as non-speech noises, side-
   speech, commands, utterance from the RECOGNIZE grammar, etc.  Note
   that this method does not provide a stack of rollback states.
   Executing ENROLLMENT-ROLLBACK twice in succession without an
   intervening recognition operation has no effect the second time.

   C->S:   MRCP/2.0 ... ENROLLMENT-ROLLBACK 543261
           Channel-Identifier:32AECB23433801@speechrecog

   S->C:   MRCP/2.0 ... 543261 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog

9.17.  END-PHRASE-ENROLLMENT

   The client MAY call the END-PHRASE-ENROLLMENT method ONLY during an
   active phrase enrollment session.  It MUST NOT be called during an
   ongoing RECOGNIZE operation.  To commit the new phrase in the
   grammar, the client MAY call this method once successive calls to
   RECOGNIZE have succeeded and Num-Repetitions-Still-Needed has been
   returned as 0 in the RECOGNITION-COMPLETE event.  Alternatively, the
   client MAY abort the phrase enrollment session by calling this method
   with the Abort-Phrase-Enrollment header field.

   If the client has specified Save-Best-Waveform as "true" in the
   START-PHRASE-ENROLLMENT request, then the response MUST contain a
   Waveform-URI header whose value is the location/URI of a recording of
   the best repetition of the learned phrase.

  C->S:   MRCP/2.0 ... END-PHRASE-ENROLLMENT 543262
          Channel-Identifier:32AECB23433801@speechrecog

Top      Up      ToC       Page 125 
  S->C:   MRCP/2.0 ... 543262 200 COMPLETE
          Channel-Identifier:32AECB23433801@speechrecog
          Waveform-URI:<http://mediaserver.com/recordings/file1324.wav>;
                       size=242453;duration=25432

9.18.  MODIFY-PHRASE

   The MODIFY-PHRASE method sent from the client to the server is used
   to change the phrase ID, NL phrase, and/or weight for a given phrase
   in a personal grammar.

   If no fields are supplied, then calling this method has no effect.

   C->S:   MRCP/2.0 ... MODIFY-PHRASE 543265
           Channel-Identifier:32AECB23433801@speechrecog
           Personal-Grammar-URI:<personal grammar uri>
           Phrase-Id:<phrase id>
           New-Phrase-Id:<new phrase id>
           Phrase-NL:<NL phrase>
           Weight:1

   S->C:   MRCP/2.0 ... 543265 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog

9.19.  DELETE-PHRASE

   The DELETE-PHRASE method sent from the client to the server is used
   to delete a phase that is in a personal grammar and was added through
   voice enrollment or text enrollment.  If the specified phrase does
   not exist, this method has no effect.

   C->S:   MRCP/2.0 ... DELETE-PHRASE 543266
           Channel-Identifier:32AECB23433801@speechrecog
           Personal-Grammar-URI:<personal grammar uri>
           Phrase-Id:<phrase id>

   S->C:   MRCP/2.0 ... 543266 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog

9.20.  INTERPRET

   The INTERPRET method from the client to the server takes as input an
   Interpret-Text header field containing the text for which the
   semantic interpretation is desired, and returns, via the
   INTERPRETATION-COMPLETE event, an interpretation result that is very
   similar to the one returned from a RECOGNIZE method invocation.  Only

Top      Up      ToC       Page 126 
   portions of the result relevant to acoustic matching are excluded
   from the result.  The Interpret-Text header field MUST be included in
   the INTERPRET request.

   Recognizer grammar data is treated in the same way as it is when
   issuing a RECOGNIZE method call.

   If a RECOGNIZE, RECORD, or another INTERPRET operation is already in
   progress for the resource, the server MUST reject the request with a
   response having a status-code of 402 "Method not valid in this
   state", and a COMPLETE request state.

   C->S:   MRCP/2.0 ... INTERPRET 543266
           Channel-Identifier:32AECB23433801@speechrecog
           Interpret-Text:may I speak to Andre Roy
           Content-Type:application/srgs+xml
           Content-ID:<request1@form-level.store>
           Content-Length:...

           <?xml version="1.0"?>
           <!-- the default grammar language is US English -->
           <grammar xmlns="http://www.w3.org/2001/06/grammar"
                    xml:lang="en-US" version="1.0" root="request">
           <!-- single language attachment to tokens -->
               <rule id="yes">
                   <one-of>
                       <item xml:lang="fr-CA">oui</item>
                       <item xml:lang="en-US">yes</item>
                   </one-of>
               </rule>

           <!-- single language attachment to a rule expansion -->
               <rule id="request">
                   may I speak to
                   <one-of xml:lang="fr-CA">
                       <item>Michel Tremblay</item>
                       <item>Andre Roy</item>
                   </one-of>
               </rule>
           </grammar>

   S->C:   MRCP/2.0 ... 543266 200 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog

Top      Up      ToC       Page 127 
   S->C:   MRCP/2.0 ... INTERPRETATION-COMPLETE 543266 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Content-Type:application/nlsml+xml
           Content-Length:...

           <?xml version="1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   xmlns:ex="http://www.example.com/example"
                   grammar="session:request1@form-level.store">
               <interpretation>
                   <instance name="Person">
                       <ex:Person>
                           <ex:Name> Andre Roy </ex:Name>
                       </ex:Person>
                   </instance>
                   <input>   may I speak to Andre Roy </input>
               </interpretation>
           </result>

9.21.  INTERPRETATION-COMPLETE

   This event from the recognizer resource to the client indicates that
   the INTERPRET operation is complete.  The interpretation result is
   sent in the body of the MRCP message.  The request state MUST be set
   to COMPLETE.

   The Completion-Cause header field MUST be included in this event and
   MUST be set to an appropriate value from the list of cause codes.

   C->S:    MRCP/2.0 ... INTERPRET 543266
           Channel-Identifier:32AECB23433801@speechrecog
           Interpret-Text:may I speak to Andre Roy
           Content-Type:application/srgs+xml
           Content-ID:<request1@form-level.store>
           Content-Length:...

           <?xml version="1.0"?>
           <!-- the default grammar language is US English -->
           <grammar xmlns="http://www.w3.org/2001/06/grammar"
                    xml:lang="en-US" version="1.0" root="request">
           <!-- single language attachment to tokens -->
               <rule id="yes">
                   <one-of>
                       <item xml:lang="fr-CA">oui</item>
                       <item xml:lang="en-US">yes</item>
                   </one-of>
               </rule>

Top      Up      ToC       Page 128 
           <!-- single language attachment to a rule expansion -->
               <rule id="request">
                   may I speak to
                   <one-of xml:lang="fr-CA">
                       <item>Michel Tremblay</item>
                       <item>Andre Roy</item>
                   </one-of>
               </rule>
           </grammar>

   S->C:    MRCP/2.0 ... 543266 200 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog

   S->C:    MRCP/2.0 ... INTERPRETATION-COMPLETE 543266 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Content-Type:application/nlsml+xml
           Content-Length:...

           <?xml version="1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   xmlns:ex="http://www.example.com/example"
                   grammar="session:request1@form-level.store">
               <interpretation>
                   <instance name="Person">
                       <ex:Person>
                           <ex:Name> Andre Roy </ex:Name>
                       </ex:Person>
                   </instance>
                   <input>   may I speak to Andre Roy </input>
               </interpretation>
           </result>

9.22.  DTMF Detection

   Digits received as DTMF tones are delivered to the recognition
   resource in the MRCPv2 server in the RTP stream according to RFC 4733
   [RFC4733].  The Automatic Speech Recognizer (ASR) MUST support RFC
   4733 to recognize digits, and it MAY support recognizing DTMF tones
   [Q.23] in the audio.

Top      Up      ToC       Page 129 
10.  Recorder Resource

   This resource captures received audio and video and stores it as
   content pointed to by a URI.  The main usages of recorders are

   1.  to capture speech audio that may be submitted for recognition at
       a later time, and

   2.  recording voice or video mails.

   Both these applications require functionality above and beyond those
   specified by protocols such as RTSP [RFC2326].  This includes audio
   endpointing (i.e., detecting speech or silence).  The support for
   video is OPTIONAL and is mainly capturing video mails that may
   require the speech or audio processing mentioned above.

   A recorder MUST provide endpointing capabilities for suppressing
   silence at the beginning and end of a recording, and it MAY also
   suppress silence in the middle of a recording.  If such suppression
   is done, the recorder MUST maintain timing metadata to indicate the
   actual time stamps of the recorded media.

   See the discussion on the sensitivity of saved waveforms in
   Section 12.

10.1.  Recorder State Machine

   Idle                   Recording
   State                  State
    |                       |
    |---------RECORD------->|
    |                       |
    |<------STOP------------|
    |                       |
    |<--RECORD-COMPLETE-----|
    |                       |
    |              |--------|
    |       START-OF-INPUT  |
    |              |------->|
    |                       |
    |              |--------|
    |    START-INPUT-TIMERS |
    |              |------->|
    |                       |

                          Recorder State Machine

Top      Up      ToC       Page 130 
10.2.  Recorder Methods

   The recorder resource supports the following methods.

   recorder-method      =  "RECORD"
                        /  "STOP"
                        /  "START-INPUT-TIMERS"

10.3.  Recorder Events

   The recorder resource can generate the following events.

   recorder-event       =  "START-OF-INPUT"
                        /  "RECORD-COMPLETE"

10.4.  Recorder Header Fields

   Method invocations for the recorder resource can contain resource-
   specific header fields containing request options and information to
   augment the Method, Response, or Event message it is associated with.

   recorder-header      =  sensitivity-level
                        /  no-input-timeout
                        /  completion-cause
                        /  completion-reason
                        /  failed-uri
                        /  failed-uri-cause
                        /  record-uri
                        /  media-type
                        /  max-time
                        /  trim-length
                        /  final-silence
                        /  capture-on-speech
                        /  ver-buffer-utterance
                        /  start-input-timers
                        /  new-audio-channel

10.4.1.  Sensitivity-Level

   To filter out background noise and not mistake it for speech, the
   recorder can support a variable level of sound sensitivity.  The
   Sensitivity-Level header field is a float value between 0.0 and 1.0
   and allows the client to set the sensitivity level for the recorder.
   This header field MAY occur in RECORD, SET-PARAMS, or GET-PARAMS.  A
   higher value for this header field means higher sensitivity.  The
   default value for this header field is implementation specific.

   sensitivity-level    =     "Sensitivity-Level" ":" FLOAT CRLF

Top      Up      ToC       Page 131 
10.4.2.  No-Input-Timeout

   When recording is started and there is no speech detected for a
   certain period of time, the recorder can send a RECORD-COMPLETE event
   to the client and terminate the record operation.  The No-Input-
   Timeout header field can set this timeout value.  The value is in
   milliseconds.  This header field MAY occur in RECORD, SET-PARAMS, or
   GET-PARAMS.  The value for this header field ranges from 0 to an
   implementation-specific maximum value.  The default value for this
   header field is implementation specific.

   no-input-timeout    =     "No-Input-Timeout" ":" 1*19DIGIT CRLF

10.4.3.  Completion-Cause

   This header field MUST be part of a RECORD-COMPLETE event from the
   recorder resource to the client.  This indicates the reason behind
   the RECORD method completion.  This header field MUST be sent in the
   RECORD responses if they return with a failure status and a COMPLETE
   state.  In the ABNF below, the 'cause-code' contains a numerical
   value selected from the Cause-Code column of the following table.
   The 'cause-name' contains the corresponding token selected from the
   Cause-Name column.

   completion-cause         =  "Completion-Cause" ":" cause-code SP
                               cause-name CRLF
   cause-code               =  3DIGIT
   cause-name               =  *VCHAR

   +------------+-----------------------+------------------------------+
   | Cause-Code | Cause-Name            | Description                  |
   +------------+-----------------------+------------------------------+
   | 000        | success-silence       | RECORD completed with a      |
   |            |                       | silence at the end.          |
   | 001        | success-maxtime       | RECORD completed after       |
   |            |                       | reaching maximum recording   |
   |            |                       | time specified in record     |
   |            |                       | method.                      |
   | 002        | no-input-timeout      | RECORD failed due to no      |
   |            |                       | input.                       |
   | 003        | uri-failure           | Failure accessing the record |
   |            |                       | URI.                         |
   | 004        | error                 | RECORD request terminated    |
   |            |                       | prematurely due to a         |
   |            |                       | recorder error.              |
   +------------+-----------------------+------------------------------+

Top      Up      ToC       Page 132 
10.4.4.  Completion-Reason

   This header field MAY be present in a RECORD-COMPLETE event coming
   from the recorder resource to the client.  It contains the reason
   text behind the RECORD request completion.  This header field
   communicates text describing the reason for the failure.

   The completion reason text is provided for client use in logs and for
   debugging and instrumentation purposes.  Clients MUST NOT interpret
   the completion reason text.

   completion-reason        =  "Completion-Reason" ":"
                               quoted-string CRLF

10.4.5.  Failed-URI

   When a recorder method needs to post the audio to a URI and access to
   the URI fails, the server MUST provide the failed URI in this header
   field in the method response.

   failed-uri               =  "Failed-URI" ":" absoluteURI CRLF

10.4.6.  Failed-URI-Cause

   When a recorder method needs to post the audio to a URI and access to
   the URI fails, the server MAY provide the URI-specific or protocol-
   specific response code through this header field in the method
   response.  The value encoding is UTF-8 (RFC 3629 [RFC3629]) to
   accommodate any access protocol -- some access protocols might have a
   response string instead of a numeric response code.

   failed-uri-cause         =  "Failed-URI-Cause" ":" 1*UTFCHAR
                               CRLF

10.4.7.  Record-URI

   When a recorder method contains this header field, the server MUST
   capture the audio and store it.  If the header field is present but
   specified with no value, the server MUST store the content locally
   and generate a URI that points to it.  This URI is then returned in
   either the STOP response or the RECORD-COMPLETE event.  If the header
   field in the RECORD method specifies a URI, the server MUST attempt
   to capture and store the audio at that location.  If this header
   field is not specified in the RECORD request, the server MUST capture
   the audio, MUST encode it, and MUST send it in the STOP response or
   the RECORD-COMPLETE event as a message body.  In this case, the

Top      Up      ToC       Page 133 
   response carrying the audio content MUST include a Content ID (cid)
   [RFC2392] value in this header pointing to the Content-ID in the
   message body.

   The server MUST also return the size in octets and the duration in
   milliseconds of the recorded audio waveform as parameters associated
   with the header field.

   Implementations MUST support 'http' [RFC2616], 'https' [RFC2818],
   'file' [RFC3986], and 'cid' [RFC2392] schemes in the URI.  Note that
   implementations already exist that support other schemes.

   record-uri               =  "Record-URI" ":" ["<" uri ">"
                               ";" "size" "=" 1*19DIGIT
                               ";" "duration" "=" 1*19DIGIT] CRLF

10.4.8.  Media-Type

   A RECORD method MUST contain this header field, which specifies to
   the server the media type of the captured audio or video.

   media-type               =  "Media-Type" ":" media-type-value
                               CRLF

10.4.9.  Max-Time

   When recording is started, this specifies the maximum length of the
   recording in milliseconds, calculated from the time the actual
   capture and store begins and is not necessarily the time the RECORD
   method is received.  It specifies the duration before silence
   suppression, if any, has been applied by the recorder resource.
   After this time, the recording stops and the server MUST return a
   RECORD-COMPLETE event to the client having a request-state of
   COMPLETE.  This header field MAY occur in RECORD, SET-PARAMS, or GET-
   PARAMS.  The value for this header field ranges from 0 to an
   implementation-specific maximum value.  A value of 0 means infinity,
   and hence the recording continues until one or more of the other stop
   conditions are met.  The default value for this header field is 0.

   max-time                 =  "Max-Time" ":" 1*19DIGIT CRLF

Top      Up      ToC       Page 134 
10.4.10.  Trim-Length

   This header field MAY be sent on a STOP method and specifies the
   length of audio to be trimmed from the end of the recording after the
   stop.  The length is interpreted to be in milliseconds.  The default
   value for this header field is 0.

   trim-length                 =  "Trim-Length" ":" 1*19DIGIT CRLF

10.4.11.  Final-Silence

   When the recorder is started and the actual capture begins, this
   header field specifies the length of silence in the audio that is to
   be interpreted as the end of the recording.  This header field MAY
   occur in RECORD, SET-PARAMS, or GET-PARAMS.  The value for this
   header field ranges from 0 to an implementation-specific maximum
   value and is interpreted to be in milliseconds.  A value of 0 means
   infinity, and hence the recording will continue until one of the
   other stop conditions are met.  The default value for this header
   field is implementation specific.

   final-silence            =  "Final-Silence" ":" 1*19DIGIT CRLF

10.4.12.  Capture-On-Speech

   If "false", the recorder MUST start capturing immediately when
   started.  If "true", the recorder MUST wait for the endpointing
   functionality to detect speech before it starts capturing.  This
   header field MAY occur in the RECORD, SET-PARAMS, or GET-PARAMS.  The
   value for this header field is a Boolean.  The default value for this
   header field is "false".

   capture-on-speech        =  "Capture-On-Speech " ":" BOOLEAN CRLF

10.4.13.  Ver-Buffer-Utterance

   This header field is the same as the one described for the verifier
   resource (see Section 11.4.14).  This tells the server to buffer the
   utterance associated with this recording request into the
   verification buffer.  Sending this header field is permitted only if
   the verification buffer is for the session.  This buffer is shared
   across resources within a session.  It gets instantiated when a
   verifier resource is added to this session and is released when the
   verifier resource is released from the session.

Top      Up      ToC       Page 135 
10.4.14.  Start-Input-Timers

   This header field MAY be sent as part of the RECORD request.  A value
   of "false" tells the recorder resource to start the operation, but
   not to start the no-input timer until the client sends a START-INPUT-
   TIMERS request to the recorder resource.  This is useful in the
   scenario when the recorder and synthesizer resources are not part of
   the same session.  When a kill-on-barge-in prompt is being played,
   the client may want the RECORD request to be simultaneously active so
   that it can detect and implement kill-on-barge-in (see
   Section 8.4.2).  But at the same time, the client doesn't want the
   recorder resource to start the no-input timers until the prompt is
   finished.  The default value is "true".

   start-input-timers       =  "Start-Input-Timers" ":"
                               BOOLEAN CRLF

10.4.15.  New-Audio-Channel

   This header field is the same as the one described for the recognizer
   resource (see Section 9.4.23).



(page 135 continued on part 6)

Next RFC Part