XHTML+Voice Profile 1.2

16 March 2004

This version:
http://www.voicexml.org/specs/multimodal/x+v/12/spec.html
Latest version:
http://www.voicexml.org/specs/multimodal/x+v/12/spec.html
Previous version:
http://www.ibm.com/software/pervasive/multimodal/x+v/11/spec.htm
Editors:
Jonny Axelsson, Opera Software <jax@opera.no>
Chris Cross, IBM <xcross@us.ibm.com>
Jim Ferrans, Motorola <James.Ferrans@motorola.com >
Gerald McCobb, IBM <mccobb@us.ibm.com>
T. V. Raman, IBM <tvraman@us.ibm.com>
Les Wilson, IBM <lesw@us.ibm.com>

Abstract

The XHTML+Voice profile brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document.

Note that the language profile described in this specification re-uses W3C working drafts that are likely to change. This integration profile will be updated as needed to use the final stable versions of these specifications. This profile is an update to the XHTML+Voice 1.1 profile. XHTML+Voice 1.2 is current with the VoiceXML 2.0 Recommendation.

Errata

The list of known errors in this specification is available at xhtml-voice12-errata.html. Please report errors in this document to mccobb@us.ibm.com.

Table of Contents

1 Introduction
    1.1 Motivation And Applications
    1.2 Design Principles
    1.3 XHTML+Voice Processing Model
        1.3.1 Processing within one Document
            1.3.1.1 Language and Version
            1.3.1.2 VoiceXML Scope within XHTML+Voice
            1.3.1.3 VoiceXML Dialog Activation
            1.3.1.4 Accessing Speech Dialog Results from XHTML
            1.3.1.5 Accessing XHTML from a Speech Dialog
            1.3.1.6 Returning from a VoiceXML Form
        1.3.2 Cancel
        1.3.3 Declarative Synchronization of Input Modes
        1.3.4 Events and Event Handling
        1.3.5 Document Linking with Voice
        1.3.6 Aural Style Sheets
2 VoiceXML 2.0 Modules
    2.1 Modularization Of VoiceXML 2.0
    2.2 Speech Dialogs
    2.3 Executable Content
    2.4 Speech Grammars
    2.5 Speech And Non-speech Audio Output
    2.6 Event Handling
3 XHTML Modularization
    3.1 Document Conformance
    3.2 User Agent Conformance
    3.3 XHTML Namespace Integration
    3.4 XHTML+Voice Profile
    3.5 XHTML+Voice Abstract Modules
        3.5.1 Abstract Modules
        3.5.2 Element content shorthands
        3.5.3 Attribute list shorthands
4 XML Events Module
    4.1 Listener
    4.2 Event Types
        4.2.1 DOMActivate
    4.3 XHTML+Voice Event Propagation
5 XHTML+Voice Extension Module
    5.1 Sync
        5.1.1 Standard Grammars for XHTML Controls
    5.2 Cancel
    5.3 VoiceXML Field ID Attribute
    5.4 VoiceXML Prompt SRC and EXPR Attributes
        5.4.1 Styling External Prompt Resources
        5.4.2 Invalid Prompt Resource
        5.4.3 Prompt Resource Fetching Properties

Appendices

A Reusable VoiceXML
B Examples
    B.1 What You See Is What You Can Say
    B.2 Mixed-initiative Conversational Interface
    B.3 Speech-Enabled Mail Interface
    B.4 Reusable VoiceXML Subdialogs
C FIA for XHTML+Voice
D DTD
    D.1 xhtml+voice12.dtd
E Schema
    E.1 xhtml+voice12.xsd
F VoiceXML Container for the XHTML+Voice Subset
    F.1 vxml20-xvsubset.xsd
G Multimodal Auto Fill
H Changes from XHTML+Voice 1.1
    H.1 Modified Elements
    H.2 Clarifications
    H.3 Miscellaneous
I References
    I.1 Normative References
    I.2 Informative References


1 Introduction

This document defines version 1.2 of the XHTML+Voice profile. XHTML+Voice 1.2 is a member of the XHTML family of document types, as specified by XHTML Modularization [XHTML Modularization]. XHTML is extended with a modularized subset of VoiceXML 2.0, the XML Events module, and a module containing a small number of attribute extensions to both XHTML and VoiceXML. The latter module facilitates the sharing of multimodal input data between the VoiceXML dialog and XHTML input and text elements.

The XML Events module [XML Events] provides XML host languages the ability to uniformly integrate event listeners and associated event handlers with Document Object Model (DOM) Level 2 [DOM2 Events] event interfaces. The result is an event syntax for XHTML-based languages that enables an interoperable way of associating behaviors with document-level markup.

VoiceXML [VoiceXML 2.0] has been designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. In this document, VoiceXML 2.0 is modularized to prepare it for integration into the XHTML family of languages using the XHTML modularization framework. The modules that combine to support speech dialogs for updating XHTML forms and form elements are selected to be added to XHTML. The modules are described as well as the integration issues. The modularization of VoiceXML 2.0 also specifies DOM event types specific to voice interaction for use with the XML Events module. Speech dialogs authored in VoiceXML 2.0 can then be treated as event handlers to add voice-interaction specific behaviors to XHTML documents. The language integration supports all of the modules defined in XHTML Modularization, and adds speech interaction functionality to XHTML elements to enable multimodal applications. The document type defined by the XHTML+Voice profile is XHTML Host language document type conformant.

1.1 Motivation And Applications

Two mature technologies, XHTML 1.1 [XHTML 1.1] and VoiceXML 2.0 [VoiceXML 2.0] are integrated using [XHTML Modularization] to bring spoken interaction to the visual web. The design leverages open industry APIs like the W3C DOM to create interoperable web content that can be deployed across a variety of end-user devices. Multiple modes of interaction are synchronized and integrated using the DOM 2 Events model [DOM2 Events] and exposed to the content author via XML Events [XML Events].

Today, web applications are authored in XHTML with user interaction created via XHTML form elements. The W3C is presently working on XForms [XForms], the next generation of web forms that bring the power of XML to web application development. The combination of XHTML and Voice described in this document can leverage the semantic richness of web applications created using XForms, while providing a smooth transition for today's developers wishing to deploy multimodal applications by adding spoken interaction to present-day web content. Integrating the work of the W3C voice browser working group into mainstream XHTML content has the advantage of ensuring that future enhancements to the voice browser component such as natural language understanding will be incorporated. Thus, a smooth transition path for web developers wishing to deliver increasingly smart user interaction for their web applications is provided. Building on XHTML Basic [XHTML Basic] and XHTML modularization, content developers will be able to deploy their content to a wide variety of end-user clients ranging from mobile phones and small PDAs to desktop browsers.

1.2 Design Principles

XHTML+Voice is an XML application [XML 1.0].

  1. XHTML is the host language.
  2. XHTML+Voice extends XHTML Basic with a subset of VoiceXML 2.0, as well as XML Events and a small extension module.
  3. XHTML+Voice makes authoring easy for common types of multimodal interactions.
  4. VoiceXML is modularized to permit the creation of profiles that match different application deployment environments.
  5. Those parts of VoiceXML that relate to the VoiceXML document being a stand-alone speech application are omitted from the XHTML+Voice profile.
  6. VoiceXML modularization does not alter the VoiceXML execution model. Specifically, a speech dialog is run as specified by the VoiceXML form interpretation algorithm.
  7. VoiceXML modularization does not modify the function of the VoiceXML 2.0 elements and attributes that are part of the profile.

1.3 XHTML+Voice Processing Model

XHTML+Voice is designed for creating multimodal dialogs that combine the visual input mode, represented by XHTML, and speech input and output, represented by a subset of VoiceXML. Here is a "Hello World" example of XHTML+Voice:

<?xml version="1.0"?>
<html 
xmlns="http://www.w3.org/1999/xhtml" 
xmlns:vxml="http://www.w3.org/2001/vxml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice"
>
  <head>
    <title>XHTML+Voice Example</title>
    <!-- voice handler -->
    <vxml:form id="sayHello">
      <vxml:block><vxml:prompt xv:src="#hello"/>
      </vxml:block>
    </vxml:form>
  </head>
  <body>
    <h1>XHTML+Voice Example</h1>
    <p id="hello" ev:event="click" ev:handler="#sayHello">
      Hello World!
    </p>
  </body>
</html>

The speech dialog identified by "sayHello" is activated when the user clicks anywhere on the paragraph identified by "hello." The speech dialog is a VoiceXML form that synthesizes the text obtained from the same paragraph that activated the form. The speech output is "Hello World!"

1.3.1 Processing within one Document

A speech dialog is defined within XHTML+Voice as a [VoiceXML 2.0] form with a unique ID. The VoiceXML form is activated by an XML Events event with an associated handler that references the form's unique ID. The XML Events event is generated from a user interaction with an XHTML element, generally a form control, or from a document event such as load or unload. Activating the VoiceXML form sets all form and field item variables to their initial values. This clears the guard conditions on all form items that don't have an initial value set with the expr attribute. The form is run according to the form interpretation algorithm (FIA) specified by VoiceXML.

1.3.1.1 Language and Version

A VoiceXML form requires language and VoiceXML version information. VoiceXML 2.0 includes language and version attributes with its root <vxml> element. XHTML+Voice obtains language and VoiceXML version from XHTML as follows. Language is obtained from the HTML root element's xml:lang attribute, while the VoiceXML version can be derived from the value of the VoiceXML namespace. The language can be overriden by the xml:lang attribute on the VoiceXML grammar and prompt tags.

1.3.1.2 VoiceXML Scope within XHTML+Voice

A VoiceXML form within an XHTML+Voice document does not have the session and document scopes defined by VoiceXML. It does not have these scopes for two reasons. First, <form> is the top level VoiceXML element in an XHTML+Voice document. Second, XHTML+Voice does not allow transitions from one voice handler to another. VoiceXML 2.0 allows a form to have either dialog or document scope. If the form's scope is document, as set by the scope attribute, the form is active while another form in the document is running. When the speech input matches the grammar of the form with document scope, there is a transition from the currently running form to the form with the document scope. XHTML+Voice does not allow this transition. Consequently, a form's scope is limited to dialog and the scope attribute is ignored. The grammar scope attribute is also ignored for the same reason. The remaining inner VoiceXML scopes, dialog and anonymous, are processed by XHTML+Voice, as required by the VoiceXML FIA.

While XHTML+Voice only supports the default value of the scope attribute, which is "dialog," if the scope attribute is encountered on a voice handler form the form is not invalidated and processing continues. The scope attribute on the <grammar> element is also ignored and its default value of "dialog" maintained. XHTML+Voice document processing ignores all VoiceXML 2.0 attributes it does not support when they are encountered.

If XHTML+Voice document processing encounters a VoiceXML 2.0 element not supported by XHTML+Voice (e.g., <goto>), a "badfetch" error is thrown. This means that a VoiceXML 2.0 interpreter and an XHTML+Voice interpreter can run the same VoiceXML 2.0 source if all the source tags are supported by XHTML+Voice. However, all the source attributes do not need to be supported by XHTML+Voice as XHTML+Voice supports their default values.

XHTML+Voice allows a speech dialog to be referenced as a voice handler in an external file. Because the speech dialog has no scope outside of its enclosing form, only the form in the external file is processed when the form is activated. For example, the script elements in the external file will not be processed. This is because the visual browser only executes script in the current document, and the VoiceXML <script> element is not supported. This requires the external reference to contain a fragment identifer specifying the form in addition to an absolute or relative URI. This differs from VoiceXML, which specifies that when the fragment is absent, the form "invoked is the lexically first dialog in the document" [VoiceXML 2.0]. With this restriction, the speech dialog can reside in any external XML document, including VoiceXML. Only the calling document has to be an XHTML+Voice document.

Because XHTML script placed in an external file is not processed, validation of VoiceXML results cannot be performed within an external subdialog by calling out to some ECMAScript contained within a VoiceXML script tag. ECMAScript validation of subdialog results can only be performed from the calling document. Validation methods must be included in the ECMAScript objects passed as parameters to the subdialog.

VoiceXML <field>, <subdialog>, and <var> elements do not have any visibility to the XHTML namespace as ECMAScript variables. Furthermore, there is no requirement to support the VoiceXML elements as nodes in the DOM object available to JavaScript. There are several problems with supporting the DOM object. Unlike XHTML form control elements, VoiceXML form item elements don't have a value attribute and consequently the DOM node value property is missing. A value attribute is necessary because the VoiceXML form item elements are their own ECMAScript variables, and they have defined values only while the enclosing form is active. At all other times their values are undefined.

1.3.1.3 VoiceXML Dialog Activation

When the browser loads the body of an XHTML+Voice document a "load" event is generated. This begins the event cycle specified by the DOM Level 2 Events model. While the event cycle is running events propagate through the HTML tree. An XML Events listener can observe an event on either a target HTML node, or an ancestor of the node, if the event bubbles. An XML Events listener activates a handler in response to the observed event. The handler can be a voice dialog activated in response to a "click" event on an HTML input, for example.

A voice dialog can also be activated by dispatching a DOMActivate event against it from XHTML script. The XML Events Module provides more details and an example.

1.3.1.4 Accessing Speech Dialog Results from XHTML

Speech dialog results may be accessed from XHTML in one of the following ways:

  1. The VoiceXML standard application variables are available to an XHTML+Voice application as global JavaScript variables. Each variable listed is an array of elements [0..i..n], where each element represents a possible result. See [VoiceXML 2.0] for more details:
    • application.lastresult$[i].confidence
    • application.lastresult$[i].utterance
    • application.lastresult$[i].inputmode
    • application.lastresult$[i].interpretation
  2. The XHTML+Voice <sync> element is described in XHTML+Voice Extension Module.
1.3.1.5 Accessing XHTML from a Speech Dialog

The global JavaScript scope of an XHTML+Voice document is available to a speech dialog. For example, an XHTML form control element, such as input, can be accessed from within VoiceXML using the DOM object traversal notation available to JavaScript. For example, the value of an input field with name "from_city" is set from the VoiceXML assign tag as follows:

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:ev="http://www.w3.org/2001/xml-events">
  <head>
    <form id="form_id" xmlns="www.w3.org/2001/vxml">
      <field name="from_field">
        <filled>
          <assign name="document.main.from_city.value"
                  expr="from_field"/>
        </filled>
      </field>
    </form>
  </head>
  <body>
    <form name="main" action="cgi/city.jsp">
      <input name="from_city" type="text"
		 ev:event="focus" ev:handler="#form_id"/>
    </form>
  </body>
</html>

The document keyword in XHTML+Voice refers to the JavaScript DOM object. This works because XHTML+Voice allows a voice dialog to share the global JavaScript scope with the XHTML container. XHTML+Voice also puts the VoiceXML application scope below the shared global scope.

1.3.1.6 Returning from a VoiceXML Form

When an event is captured within a voice dialog the author may choose to end the dialog and return to the XHTML container. XHTML+Voice uses the VoiceXML <return> element for this purpose. If the <return> element is run within executable content of a top level voice handler (i.e., one that is not called as a subdialog), the voice handler will end its execution and return to the XHTML. The following example shows how the <return> element can be used:

<?xml version="1.0"?> 
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:vxml="http://www.w3.org/2001/vxml"
      xmlns:ev="http://www.w3.org/2001/xml-events"
      xmlns:xv="http://www.voicexml.org/2002/xhtml+voice" >
   <head><title>Find City or Airport</title> 

      <vxml:form id="vform">
         <vxml:subdialog name="cityorairport" src="cityorairport.vxml#cityform">
            <vxml:param name="paramPrompt" expr="'What city or airport?'"/> 
            <vxml:filled> 
               <vxml:assign name="document.xform.city.value"
                               expr="cityorairport.returnCityOrAirport"/>
            </vxml:filled> 
            <catch event="error.badfetch">
               Error fetching subdialog!
               <return/>
            </catch>
         </vxml:subdialog> 
      </vxml:form> 

   </head>
   <body bgcolor="#FFFFFF">
      <h3>City or Airport</h3>
      <form name="xform" action="cgi/cityorairport.jsp">
         <p>Enter city or airport:</br>
            <input type="text" name="city" ev:event="focus" ev:handler="#vform"/>
         </p>
      </form>
   </body>
</html>

When the <return> element is specified within a top-level voice form, its namelist attribute has no meaning and is ignored. However, either the event or eventexpr attribute can be used to return a VoiceXML event to the XHTML container.

1.3.2 Cancel

Multiple speech dialogs running simultaneously are not allowed by XHTML+Voice. A speech dialog runs in its own thread and, for many devices, the audio subsystem can be owned by only one thread at one time. Also, other resources that are not guaranteed to be thread-safe may cause a voice handler to indefinitely block. Therefore, only one speech dialog can be running at one time per loaded XHTML+Voice document. If only one speech dialog can be running at one time, the activating speech dialog must cancel the currently running dialog. This is the default behavior. The running dialog should also be canceled when the current XHTML+Voice document is unloaded.

The document author can cancel the currently running speech dialog with the <cancel> element that can be specified by an XHTML element as a handler for an XML Events event. The XHTML+Voice Extension Module section provides more details.

Cancel is a message from the visual browser that must be handled by the VoiceXML FIA. It is separate from the cancel event supported by VoiceXML that cancels the currently running prompt. The cancel message from the visual browser modifies the FIA in the sense that it must be checked throughout the FIA, and if it is received then the FIA must terminate.

1.3.3 Declarative Synchronization of Input Modes

The XHTML+Voice <sync> element provides a declarative synchronization of XHTML form control elements and the VoiceXML <field> element. The <sync> element specifies the following behaviors. First, sync allows input from one speech or visual modality to set the field in the other modality. Second, setting the focus of an <input> element that is synchronized with a VoiceXML field updates the FIA to visit that VoiceXML field. This is useful when there are multiple fields within a VoiceXML form. Sync is both a message to the VoiceXML FIA from the visual browser, like cancel, and a message from the FIA to the visual browser. The XHTML+Voice Extension Module section provides more details.

1.3.4 Events and Event Handling

The nomatch, noinput, help, and error VoiceXML event types are propagated as XML Events events to XHTML. They can be linked to an XML Events handler using the XML Events syntax for specifying target, observer, event, and handler. The events are propagated regardless of whether the event has already been caught and handled properly within the VoiceXML form. The VoiceXML event types nomatch, noinput, help, and error propagate to the XHTML container as the XHTML+Voice event types vxmlnomatch, vxmlnoinput, vxmlhelp, and vxmlerror, respectively.

Within VoiceXML a chain of events can be created, where one event is caught and another event is thrown, and so on. Because the entire chain of events is propagated to XHTML, the application author should be careful not to chain multiple events of the same type. The VoiceXML error event subtypes error.semantic, error.badfetch, error.unsupport.element, etc., are propagated as the vxmlerror event type to XHTML. This is in accordance with the VoiceXML specification. This allows for the application to define additional error subtypes that can be handled by the visual browser. More general application-defined event types are also supported. If an application-defined event type is defined within the VoiceXML form, such as "foo.bar", then when that event is thrown within the form, it is propagated to XHTML as an XML Events event. For the example below, both the vxmlnoinput and foo.bar events are handled by the visual browser via the XML Events listener tag. Note that the VoiceXML form exits because the foo.bar event is not handled within the form.

<vxml:form id="ex1">
   <vxml:catch event="noinput">
      <vxml:throw event="foo.bar"/>
   </vxml:catch>

   <vxml:field name="f1">
      <vxml:grammar type="boolean"/>
      <vxml:prompt>Say yes or no</vxml:prompt>
   </vxml:field>
</vxml:form>

<ev:listener ev:observer="ex1" ev:event="vxmlnoinput" ev:handler="#h1"/>
<ev:listener ev:observer="ex1" ev:event="foo.bar" ev:handler="#h2"/>

In addition to the VoiceXML event types listed above, XHTML+Voice supports the vxmldone event type. The vxmldone event is generated when the currently running VoiceXML form completes without an error. All the event types that XHTML+Voice supports are listed in the XML Events Module.

1.3.5 Document Linking with Voice

Document linking with voice is available to the author. Given an XHTML+Voice document with the following <link> and <a> elements:

<link rel="glossary" title="Glossary" href="glossary.html"/>
<link rel="contents" title="Contents" href="contents.html"/>
<a href="chapter3.html" title="Next Page" rel="next">Next</a>
<a href="chapter1.html" title="Previous Page" rel="previous">Previous</a>
<a href="http://www.nytimes.com" title="New York Times">NY Times</a>

The following grammar can be produced, as shown below. The document author uses the rel attribute to enable document linking for a select set of <link> and <a> elements. For each element with a rel attribute, the rel and href attribute values are added to the grammar, where the rel value is what the user might say, and the href value is the corresponding URI. If the rel attribute is omitted the title attribute can be used for building a link activation grammar for all the <a> elements in the document.

#JSGF V1.0 iso-8859-1;
grammar document-links;

public <document-links> = Glossary {this.$value="glossary.html"}
             | Contents {this.$value="contents.html"}
             | Next Page {this.$value="chapter3.html"}
             | Previous Page {this.$value="chapter1.html"}
             | New York Times {this.$value="http://www.nytimes.com"};

The grammar scope of the grammar is document so that it is always active. While XHTML+Voice does not support authoring a grammar with document scope within a form, the multimodal browser should support grammars with document scope for document linking and command and control.

1.3.6 Aural Style Sheets

With the addition of the src and expr attributes to the VoiceXML <prompt> element, XHTML+Voice is able to support Aural style sheets declared according to [CSS2]. Within XHTML, a paragraph with id set to "warnPara" can be styled with the CSS "warn" class:

<p id="warnPara" class="warn">warning</p>

The CSS has visual and aural rules for class "warn." When the VoiceXML<form> processes a prompt with the src attribute set to that paragraph, the aural style rules for "warn" are invoked. The VoiceXML Prompt SRC and EXPR Attributes Section provides more details and a complete example.

2 VoiceXML 2.0 Modules

This section first modularizes VoiceXML 2.0 and then specifies the various VoiceXML 2.0 modules used in the creation of the XHTML+Voice profile.

2.1 Modularization Of VoiceXML 2.0

The files making up the modularization of the VoiceXML 2.0 SCHEMA are available as voice-xml-modules.zip and have been created to ease the process of integrating VoiceXML 2.0 and XHTML. These modules do not change the VoiceXML 2.0 language as specified by the voice browser working group of the W3C. This section gives a high-level overview of each module.

Table 1: VoiceXML Modules
Module Purpose Elements XHTML+Voice?
Events Events thrown by Voice XML processor catch help noinput nomatch error throw Y
Executable statements Statements for use in voice handlers assign clear var log reprompt Y
Filled Voice handlers invoked when a slot is filled. filled Y
Flow control Flow control constructs from VoiceXML if else elseif return Y
Forms Encapsulate voice dialogs form field record subdialog block initial Y
Miscellaneous Non-local transfers in VoiceXML exit goto link script submit N
Menus VoiceXML menus menu choice N
Object Foreign objects for VoiceXML object N
Resources Specifying resources for VoiceXML param property Y
Root VoiceXML stand-alone documents vxml meta metadata N
Enumerate Enumerate choices or options available to user enumerate Y
Option Specify option in a field option Y
Output Speech and audio output prompt value audio desc emphasis lexicon mark voice break prosody say-as sub phoneme p s meta metadata Y
Telephony Telephony control transfer disconnect N
User Input Speech input constructs from VoiceXML grammar lexicon example tag token item meta metadata one-of rule ruleref Y
Attributes Common attributes used in VoiceXML NA Y
Datatypes Common datatypes used in VoiceXML NA Y
Document Model Defines content model for VoiceXML elements NA N

2.2 Speech Dialogs

Modules vxml-exec-1.xsd, vxml-filled-1.xsd, vxml-resource-1.xsd, vxml-flow-1.xsd, vxml-enumerate-1.xsd, vxml-option-1.xsd, and vxml-form-1.xsd support authoring handlers that implement speech dialogs.

2.3 Executable Content

Modules vxml-filled-1.xsd, vxml-flow-1.xsd, vxml-exec-1.xsd, and vxml-resource-1.xsd declare constructs for use within voice handlers. The semantics of these constructs are as defined in the VoiceXML 2.0 specification.

2.4 Speech Grammars

The speech grammar modules provide constructs for authoring speech grammars as specified in VoiceXML 2.0. The modules are provided by the normative VoiceXML 2.0 SCHEMA and are unchanged: grammar-core.xsd, grammar.xsd, vxml-grammar-restriction.xsd, and vxml-grammar-extension.xsd. The restriction and extension modules allow the elements and attributes normatively specified by the speech grammar specification [Speech Grammars] to be included within the VoiceXML 2.0 namespace.

2.5 Speech And Non-speech Audio Output

The speech and audio output modules define constructs for producing spoken and non-spoken audio output. The modules are provided by the normative VoiceXML SCHEMA and are unchanged: synthesis-core.xsd, synthesis.xsd, vxml-synthesis-restriction.xsd, and vxml-synthesis-extension.xsd. As with the speech grammar modules, the elements and attributes normatively defined in the SSML specification [SSML 1.0] are included within the VoiceXML 2.0 namespace.

2.6 Event Handling

Module vxml-events-1.xsd declares the event types defined in VoiceXML 2.0.

3 XHTML Modularization

This section is normative.

3.1 Document Conformance

A conforming XHTML+Voice document is a document that requires only the facilities described as mandatory in this specification. Such a document must meet all of the following criteria:

  1. It must validate against the XML Schema found in schema provided in this document.

  2. The root element of the document must be html.

  3. The name of the default namespace on the root element must be the XHTML namespace name: http://www.w3.org/1999/xhtml.

  4. If a DOCTYPE declaration is present and includes a public identifier, the DOCTYPE declaration must reference the DTD provided in this document using its Formal Public Identifier. The system identifier may be modified appropriately.

    <!DOCTYPE html PUBLIC "-//VoiceXML Forum//DTD XHTML+Voice 1.2//EN"
    "http://www.voicexml.org/specs/multimodal/x+v/12/dtd/xhtml+voice12.dtd">

3.2 User Agent Conformance

The user agent must conform to the "User Agent Conformance" section of the XHTML specification [XHTML 1.0], section 3.2, and the conformance requirements detailed in the VoiceXML modules [VoiceXML 2.0] supported by the integration profile.

The user agent must conform to the following additional user agent rule:

  1. When the user agent claims to support facilities defined within the VoiceXML 2.0 specifications or facilities required by this specification through normative reference, it must do so in ways consistent with the facilities' definition.

3.3 XHTML Namespace Integration

The default XML namespace of an XHTML+Voice document is XHTML. XHTML+Voice extends XHTML with VoiceXML, XML Events, and XHTML+Voice extensions. The VoiceXML, XML Events, and XHTML+Voice extension elements and attributes are included through additional namespace declarations:

<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:vxml="http://www.w3.org/2001/vxml"
      xmlns:ev="http://www.w3.org/2001/xml-events"
      xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">

The name of the unique prefix identifier for the namespace within the document, for example, vxml for VoiceXML elements, is left to the document author's discretion.

3.4 XHTML+Voice Profile

The XHTML functionality in the XHTML+Voice document type is based upon the XHTML modules defined in [XHTML Modularization]. The XHTML+Voice profile includes the XHTML modules defined in [XHTML Basic], such as the basic XHTML forms and tables modules. Added to the XHTML Basic modules are the following modules:

  • The XHTML scripting module.
  • XML Events as defined by the XML Events module, [XML Events]. XML Events with VoiceXML event types and handlers allow the XHTML author to associate voice-interaction specific behaviors.
  • A set of VoiceXML modules for speech-enabling XHTML constructs. The top level VoiceXML element for defining a voice handler is <form>.
  • An XHTML+Voice Extension module for facilitating the authoring of the interaction between the visual and speech modules.

The notation, terms and document conventions used here are borrowed from [XHTML 1.1].

The profile includes the XHTML basic module defined in [XHTML Basic], the XHTML scripting module defined in [XHTML 1.1], the XML Events module defined in [XML Events], the XHTML+Voice extension module defined in the XHTML+Voice Extension Module, and the following VoiceXML 2.0 modules:

3.5 XHTML+Voice Abstract Modules

The namespaces used in these modules are as follows:

XHTML:
http://www.w3.org/1999/xhtml
VoiceXML:
http://www.w3.org/2001/vxml
XML Events:
http://www.w3.org/2001/xml-events
XHTML+Voice:
http://www.voicexml.org/2002/xhtml+voice

3.5.1 Abstract Modules

Table 2: XHTML+Voice Abstract Modules
Element Content Attributes
Base Module (XHTML)
base EMPTY href* (URI)
Basic Forms Module (XHTML)
form Heading | Block - form Common, action* (URI), method ("get"* | "post"), enctype (ContentType)
input EMPTY Common, Access, checked ("checked"), maxlength (Number), name (CDATA), size (Number), src (URI), type ("text"* | "password" | "checkbox" | "radio" | "submit" | "reset" | "hidden" ), value (CDATA)
label (PCDATA | Inline - label)* Common, accesskey (Character), for (IDREF)
select option+ Common, multiple ("multiple"), name (CDATA), size (Number)
option PCDATA Common, , selected ("selected"), value (CDATA)
textarea PCDATA Common, Access, cols* (Number), name (CDATA), rows* (Number)
Basic Tables Module (XHTML)
caption (PCDATA | Inline)* Common
table caption?, tr+ Common, summary (Text), width (Length )
td (PCDATA | Flow - table)* Common, Cell, Align
th (PCDATA | Flow - table)* Common, Cell, Align
tr td+ Common, Align
Enumeration Module (VoiceXML)
enumerate (Audio | TTS)* -
Events Module (VoiceXML)
catch Exec VoiceHandler, event (NMTOKENS)
help Exec VoiceHandler
noinput Exec VoiceHandler
nomatch Exec VoiceHandler
error Exec VoiceHandler
throw EMPTY VoiceHandler, event (NMTOKEN), eventexpr (Script), message (CDATA), messageexpr (Script)
Executable Statements Module (VoiceXML)
assign EMPTY Expr
clear EMPTY namelist (CDATA)
var EMPTY Expr
log (PCDATA | value)* label (CDATA), expr (Script)
reprompt EMPTY -
Filled Module (VoiceXML)
filled (Exec)* mode("any" | "all"*), namelist (CDATA)
Flow Control Module (VoiceXML)
if (Exec | elseif | else)* cond (Script)
else EMPTY -
elseif EMPTY cond (Script)
return EMPTY namelist (CDATA), event (NMTOKEN), eventexpr (Script), message (CDATA), messageexpr (Script)
Forms Module (VoiceXML)
form (Form)* id (ID)
field (Audio | EventHandler | filled | enumerate | grammar | link | vxml:option | prompt | property)* Item, type (GrammarType), slot (NMTOKEN ), modal (Boolean), xv:id (ID)
record (Audio | EventHandler | filled | grammar | prompt | property)* Item, type (ContentType), beep (Boolean), maxtime (Duration), modal (Boolean), dtmfterm (Boolean), finalsilence (Duration)
subdialog (Audio | filled | param | prompt | property)* Item, Cache, Submit, src (URI), srcexpr (Script), fetchaudio (URI)
block Exec Item
initial (Audio | EventHandler | link | prompt | property)* Item
Hypertext Module (XHTML)
a (PCDATA | Inline - a)* Common, Access, Linking, hreflang (LanguageCode)
Image Module (XHTML)
img EMPTY Common, Dim, alt* (Text), longdesc (URI), src* (URI)
Link Module (XHTML)
List Module (XHTML)
dl (dd | dt)+ Common
dt (PCDATA | Inline)* Common
dd (PCDATA | Flow)* Common
ol li+ Common
ul li+ Common
li (PCDATA |Flow)* Common
Metainformation Module (XHTML)
meta EMPTY I18N, content* (CDATA), http-equiv (NMTOKEN), name (NMTOKEN), scheme (CDATA)
Object Module (XHTML)
object (PCDATA | Flow | param)* Common, Dim, archive (URI), classid (URI), codebase (URI), codetype (ContentType), data (URI), declare ("declare"), name (CDATA), standby (Text), tabindex (Number), type (ContentType)
param EMPTY id (IDREF), name* (CDATA), type (ContentType), value (CDATA), valuetype ("data"* | "ref" | "object")
Option Module (VoiceXML)
vxml:option PCDATA dtmf (CDATA), value (CDATA)
Output Module (VoiceXML)
prompt (Audio | TTS | lexicon | meta | metadata)* I18N, VoiceHandler, bargein (Boolean), bargeintype ("speech" | "hotword"), timeout (Duration), xml:base (URI), version ("1.0"), xv:src (URI), xv:expr (CDATA)
value EMPTY expr (Script)
audio (Audio | TTS | desc)* Cache, src (URI), expr (Script)
desc PCDATA xml:lang (NMTOKEN)
lexicon EMPTY uri (URI), type (ContentType)
emphasis SentenceContent level ("strong" | "moderate"* | "none" | "reduced")
voice (SentenceContent | Structure)* I18N, gender ("male" | "female" | "neutral"), age (Number), variant (Number), name (CDATA)
break EMPTY strength ("x-weak" | "weak" | "medium"* | "strong" | "x-strong" | "none"), time (Duration)
prosody (SentenceContent | Structure)* pitch (CDATA), contour (CDATA), range (CDATA), rate (CDATA), duration (Duration), volume (CDATA)
say-as (PCDATA | value)* interpret-as (NMTOKEN), format (NMTOKEN), detail (CDATA)
meta EMPTY name (NMTOKEN), content (CDATA), http-equiv (NMTOKEN)
metadata ANY  
phoneme PCDATA ph (CDATA), alphabet (CDATA)
p (SentenceContent | s)* I18N
s SentenceContent I18N
sub PCDATA alias (CDATA)
mark EMPTY name (CDATA)
Resources Module (VoiceXML)
param EMPTY Expr, value (CDATA), valuetype ("data"* | "ref"), type (CDATA)
property EMPTY name (NMTOKEN), value (CDATA)
Scripting Module (XHTML)
script PCDATA charset (CharSet), defer ("defer"), src (URI), type* (ContentType), xml:space="preserve", declare ("declare")
noscript (Heading | Block | List)+ Common
Structure Module (XHTML)
body (Heading | Block | List)* Common
html head, body I18N, version (CDATA), xmlns (URI = "http://www.w3.org/1999/xhtml")
title PCDATA I18N
Text Module (XHTML)
abbr (PCDATA | Inline)* Common
acronym (PCDATA | Inline)* Common
address (PCDATA | Inline)* Common
blockquote (PCDATA | Heading | Block | List)* Common, cite (URI)
br EMPTY Core
cite (PCDATA | Inline)* Common
code (PCDATA | Inline)* Common
dfn (PCDATA | Inline)* Common
div (PCDATA | Flow)* Common
em (PCDATA | Inline)* Common
h1 (PCDATA | Inline)* Common
h2 (PCDATA | Inline)* Common
h3 (PCDATA | Inline)* Common
h4 (PCDATA | Inline)* Common
h5 (PCDATA | Inline)* Common
h6 (PCDATA | Inline)* Common
kbd (PCDATA | Inline)* Common
p (PCDATA | Inline)* Common
pre (PCDATA | Inline)* Common, xml:space="preserve"
q (PCDATA | Inline)* Common, cite (URI)
samp (PCDATA | Inline)* Common
span (PCDATA | Inline)* Common
strong (PCDATA | Inline)* Common
var (PCDATA | Inline)* Common
User Input Module (VoiceXML)
grammar (PCDATA | meta | metadata | lexicon | tag | rule)* Cache, I18N, version (NMTOKEN), root (IDREF), mode ("voice"* | "dtmf"), src (URI), scope ("document" | "dialog"), type (ContentType), weight (CDATA), tag-format (URI), xml:base (URI)
example PCDATA  
lexicon EMPTY uri (URI), type (ContentType)
tag PCDATA  
token PCDATA I18N
item (RuleExpansion)* I18N, weight (NMTOKEN), repeat (NMTOKEN), repeat-prob (NMTOKEN)
meta EMPTY name (NMTOKEN), content (CDATA), http-equiv (NMTOKEN)
metadata ANY  
one-of (item)+ I18N
rule (RuleExpansion | example)* id (ID), scope ("private"* | "public")
ruleref EMPTY uri (URI), type (ContentType), special ("NULL" | "VOID" | "GARBAGE")
XML Events Module (XML Events)
listener EMPTY XEvents
XHTML+Voice Extension Module (XHTML+Voice)
sync EMPTY

input (NMTOKEN), field (URI), html-form-id (IDREF)

cancel EMPTY

id (ID), voice-handler (URI)

Elements Attributes  
vxml:field& id (ID)  
vxml:prompt& src (URI) | expr (CDATA)  

3.5.2 Element content shorthands

Table 3: Element Entities and Content
Element Entities Content
Audio (VoiceXML) PCDATA | audio | value | enumerate
Block (XHTML) address | blockquote | div | p | pre
EventHandler (VoiceXML) catch | help | noinput | nomatch | error
Exec (VoiceXML) Audio | assign | clear | if | log | prompt | reprompt | return | throw | var
Flow (XHTML) Heading | List | Block | Inline
Form (VoiceXML) EventHandler | grammar | filled | initial | property | record | subdialog | Variable
Heading (XHTML) h1 | h2 | h3 | h4 | h5 | h6
Inline (XHTML) a | abbr | acronym | button | br | cite | code | dfn | em | img | input | kbd | label | object | q | samp | select | span | strong | textarea
RuleExpansion (VoiceXML) PCDATA | token | ruleref | item | one-of | tag
SentenceContent (VoiceXML) Audio | SentenceElements
SentenceElements (VoiceXML) break | emphasis | phoneme | mark | prosody | say-as | voice | sub
Structure (VoiceXML) s | p
TTS (VoiceXML) SentenceElements | Structure
Variable (VoiceXML) block | field | var

3.5.3 Attribute list shorthands

Table 4: Attribute Entities and Content
Attribute Entities Content
Access (XHTML) accesskey (Character), tabindex (Number)
Align (XHTML) align ("left" | "center" | "right"), valign ("top" | "middle" | "bottom")
Cache (VoiceXML) fetchhint ("prefetch" | "safe"), fetchtimeout (Duration, maxage (Number), maxstale (Number)
Cell (XHTML) abbr (Text), axis (CDATA), colspan (Number), headers (IDREFS), rowspan (Number), scope ("row" | "col")
Common (XHTML) Core, Events, XEvents
Core (XHTML) class (NMTOKENS), id (ID), title (CDATA )
Dim (XHTML) height (Length ), width (Length)
Events (XHTML) MouseEvents , KeyEvents
Expr (VoiceXML) name (VarName), expr (Script )
I18N (XML) xml:lang (NMTOKEN)
Item (VoiceXML) name (VarName), cond (Script), expr (Script)
KeyEvents (XHTML) onkeypress (Script), onkeydown (Script), onkeyup (Script)
Linking (XHTML) charset (CharSet), href (URI), hreflang (LanguageCode), rel (LinkTypes), rev (LinkTypes), type (ContentType)
MouseEvents (XHTML) onclick (Script), ondblclick (Script), onmousedown (Script), onmouseover (Script), onmousemove (Script), onmouseout (Script)
Style (XHTML) style (CDATA )
VoiceHandler (VoiceXML) count (Number), cond (Script)
XEvents (XML Events) event, observer (IDREF), handler (URI), target (IDREF), phase ("capture" | "default"*), propagate ("stop" | "continue"*), defaultAction("cancel" | "perform"*), id

Attribute types

Table 5: Attribute Types
Attribute Type Description
Boolean "true" | "false"
Duration A positive real number followed by either 's' (seconds) or 'ms' (milliseconds)
GrammarType CDATA
VarName NMTOKEN or NMTOKEN with "$" appended

4 XML Events Module

4.1 Listener

XHTML+Voice extends XHTML with the XML Events <listener> element and its attributes. The <listener> attributes are added to XHTML elements primarily for activating voice handlers. The <listener> element and attributes belong to the XML Events namespace:

xmlns:ev="http://www.w3.org/2001/xml-events"

4.2 Event Types

For a given XML language extended with XML Events, a set of event types must be specified independently of the [XML Events] module. The XML Events event types supported by the XHTML+Voice profile include all event types defined for [HTML 4.01] intrinsic events. VoiceXML handler activation is specified by including, with an XHTML element, one of these event types as an XML Events event and an ID reference to the VoiceXML form as an XML Events event handler.

The XHTML+Voice profile supports the following VoiceXML 2.0 event types: nomatch, noinput, error, and help. These event types are emitted to the XHTML container as the following XHTML+Voice event types: vxmlnomatch, vxmlnoinput, vxmlerror, and vxmlhelp, respectively. The VoiceXML exit and cancel event types are supported within the VoiceXML form but are not propagated to the visual browser. Event types defined by the author within VoiceXML, also known as application-defined event types, are also propagated to the visual browser. However, the VoiceXML <form> element does not support adding the XML Events attributes.

An additional XHTML+Voice event type, "vxmldone", is supported. The vxmldone event is generated when the voice handler completes.

The XHTML+Voice profile extends the XHTML <script> element with XML Events. The <script> element doesn't generate any events of its own, so the observer attribute is required to specify observing an XML Events event on another node in the XHTML tree. The <script> element can observe any HTML 4.01 intrinsic event or VoiceXML event. Here is an example of how a <script> element can be a handler for a "vxmldone" event. The value of XHTML input "drink" is updated when the voice handler "fid" completes:

<?xml version="1.0"?>
<html xmlns="www.w3.org/1999/xhtml"
      xmlns:ev="http://www.w3.org/2001/xml-events"
      xmlns:vxml="http://www.w3.org/2001/vxml"
      xmlns:xv="http://www.voicexml.org/2002/xhtml+voice" >
  <head><title>Script Event Handler</title>

    <script type="text/javascript" 
      ev:event="vxmldone" ev:observer="fid" declare="declare">
      document.xform.drink.value = application.lastresult$[0].utterance;
    </script>
    <vxml:form id="fid">
      <vxml:field name="f1">