XHTML+Voice

X+V Version 1.2 Spec - March 2004

Mobile X+V 1.2 Spec - September 2005

X+V for the Next Generation Web

XHTML+Voice (X+V for short) is the markup language that adds voice interaction based on VoiceXML to today's web applications. Multimodal applications are the next generation of web applications, and X+V multimodal applications are created and delivered with VoiceXML.

In the past decade we have seen an explosion in the number of browser-based visual applications, from the broad examples you use every day, such as accessing email, movie or flight schedules, and financial information, to assisting surgeons in getting the latest patient information or helping field technicians get detailed schematics while working on machinery. Now more than ever, end users, and the decision makers who serve them, are expecting and demanding anytime/anyplace access to the information that is most relevant to them and the task at hand.

During the same time period, growth has occured in the number of cell phones, PDA’s, laptops and other access points runing web browsers. The challenge we face is that while the wireless networks grew out to support new devices in new locations, by virtue of decreasing physical sizes or hands busy environment, traditional input methods that would allow efficient usage of most Web applications aren't always enough. Imagine how much more efficient you would be if you asked your bank’s mobile Web site for “my current account balance,” and saw the results or ask your portal for "my high priority emails from work" and spotted an important one 4th in the list? How about interacting with your entertainment center? How many movies and music choices do you have? How do you browse through all of them? Using a multimodal interface you could easily navigate your set top box directly to ‘Mozart’ or documentaries playing this weekend…

Enter the XHTML+Voice (X+V) markup language. By allowing web applications to use audible input and output, you can use your voice to input information into a Web application and hear the results in addition to seeing them. It frees the current constraints of delivering information to small devices or even traditional computers in environments where hands or eyes are busy. At the same time, X+V is built on the open web standards, such as XHTML and VoiceXML, that will allow you the broadest reach for your Web-based multimodal applications. Thousands of Web applications can now be delivered to millions of devices, creating a truly portable solution to increase efficiencies, reduce costs, increase revenues, and improve customer satisfaction. The developers that understand this opportunity will truly be prepared for the convergence between these two powerful trends.

IBM Software Group's multimodal technologies site

"As devices become smaller, modes of interaction other than keyboard and stylus are a necessity. In particular, small handheld devices like cell phones and PDAs serve many functions and contain sufficient processing power to handle a variety of tasks. Present and future devices will greatly benefit from the use of multimodal access methods."

Opera Software's multimodal page

"The multimodal browser being developed by IBM and Opera is based on the XHTML+Voice (X+V) specification.

This project builds upon IBM's and Opera's ongoing relationship. In 2001; IBM, Motorola and Opera submitted the multimodal standard X+V to the standards body W3C. This mark-up language leverages existing standards already familiar to voice and Web developers, so they can use their skills and resources to extend current applications instead of building new ones from the ground up."

ACCESS Systems NetFront Browser

"NetFront supports advanced mobile voice recognition technologies based on the XHTML+Voice (X+V) 1.1 framework. X+V supports voice synthesis and voice recognition of mobile Internet data allowing voice input and output interaction with voice supported Web pages."

Compound XML Document Editor

"A compound XML document combines XML markup from several namespaces into a single physical document. A number of standards exist, and continue to be developed, that are descriptions of XML markup within a single namespace. XHTML, XForms, XML Events, Scalable Vector Graphics (SVG), VoiceXML, and MathML are prominent examples of such standards, each having its own namespace...

Sample models for these XML-based standards are provided with the Compound XML Document Editor distribution; documents having markup of these types therefore may be created and edited immediately upon installing:
XHTML 1.0
XForms 1.0
XML Events
Scalable Vector Graphics (SVG) 1.1
Synchronized Multimedia Integration Language (SMIL) 2.0
MathML 2.0
XLink 1.0
VoiceXML 2.0
XHTML + VoiceXML (X+V)
XML-based User Interface Language (XUL) 1.0"

XHTML+Voice Profile 1.0
W3C Note 21 December 2001

"Profile XHTML+Voice brings spoken interaction to standard WWW content by integrating a set of mature WWW technologies such as XHTML and XML Events with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, speech grammars, and the ability to attach Voice handlers for responding to specific DOM events, thereby re-using the event model familiar to web developers. Voice interaction features are integrated directly with XHTML and CSS, and can consequently be used directly within XHTML content."

W3C Multimodal Interaction Activity

"The Multimodal Interaction Activity seeks to extend the Web to allow users to dynamically select the most appropriate mode of interaction for their current needs, including any disabilities, whilst enabling developers to provide an effective user interface for whichever modes the user selects. Depending upon the device, users will be able provide input via speech, handwriting, and keystrokes, with output presented via displays, pre-recorded and synthetic speech, audio, and tactile mechanisms such as mobile phone vibrators and Braille strips.
Multimodal interaction offers significant ease of use benefits over uni-modal interaction, for instance, when hands-free operation is needed, for mobile devices with limited keypads, and for controlling other devices when a traditional desktop computer is unvailable to host the application user interface. This is being driven by advances in embedded and network-based speech processing that are creating opportunities for integrated multimodal Web browsers and for solutions that separate the handling of visual and aural modalities, for example, by coupling a local XHTML user agent with a remote VoiceXML user agent."


About \| News \| Membership \| Technology \| Certification \| Resources \| FAQ \| Sitemap \| Contact
Copyright © 2000 - 2005 VoiceXML Forum. All rights reserved. The VoiceXML Forum is a program of the IEEE Industry Standards and Technology Organization (IEEE-ISTO) For inquiries contact voicexml-admin@voicexml.org This site is maintained by the VoiceXML Forum Webmaster