XHTML+Voice (X+V for short) is the markup language that adds voice interaction based on VoiceXML
to today's web applications. Multimodal applications are the next generation of web applications,
and X+V multimodal applications are created and delivered with VoiceXML.
In the past decade we have seen an explosion in the number of browser-based visual applications,
from the broad examples you use every day, such as accessing email, movie or flight schedules, and
financial information, to assisting surgeons in getting the latest patient information or helping
field technicians get detailed schematics while working on machinery. Now more than ever, end users,
and the decision makers who serve them, are expecting and demanding anytime/anyplace access to the
information that is most relevant to them and the task at hand.
During the same time period, growth has occured in the number of cell phones, PDA’s, laptops and
other access points runing web browsers. The challenge we face is that while the
wireless networks grew out to support new devices in new locations, by virtue of decreasing physical
sizes or hands busy environment, traditional input methods that would allow efficient usage of most Web
applications aren't always enough. Imagine how much more efficient you would be if you asked your bank’s
mobile Web site for
“my current account balance,” and saw the results or ask your portal for "my high priority emails
from work" and spotted an important one 4th in the list? How about interacting with your
entertainment center?
How many movies and music choices do you have? How do you browse through all of them? Using a
multimodal interface you could easily navigate your set top box directly to ‘Mozart’ or documentaries
playing this weekend…
Enter the XHTML+Voice (X+V) markup language. By allowing web applications to use audible input and
output, you can use your voice to input information into a Web
application and hear the results in addition to seeing them. It frees the current constraints of
delivering information to small devices or even traditional computers in environments where hands or eyes are busy.
At the same time, X+V is built on the open web standards, such as XHTML and VoiceXML, that will allow you the broadest
reach for your Web-based multimodal applications. Thousands of Web
applications can now be delivered to millions of devices, creating a truly portable solution to
increase efficiencies, reduce costs, increase revenues, and improve customer satisfaction. The
developers that understand this opportunity will truly be prepared for the convergence between
these two powerful trends.
IBM Software Group's multimodal technologies site
"As devices become smaller, modes of interaction other than keyboard and stylus are a
necessity. In particular, small handheld devices like cell phones and PDAs serve many functions
and contain sufficient processing power to handle a variety of tasks. Present and future devices
will greatly benefit from the use of multimodal access methods."
Opera Software's multimodal page
"The multimodal browser being developed by IBM and Opera is based on the XHTML+Voice (X+V)
specification.
This project builds upon IBM's and Opera's ongoing relationship. In 2001; IBM, Motorola and
Opera submitted the multimodal standard X+V to the standards body W3C. This mark-up language
leverages existing standards already familiar to voice and Web developers, so they can use
their skills and resources to extend current applications instead of building new ones from
the ground up."
ACCESS Systems NetFront Browser
"NetFront supports advanced mobile voice recognition technologies based on the
XHTML+Voice (X+V) 1.1 framework. X+V supports voice synthesis and voice recognition of mobile Internet
data allowing voice input and output interaction with voice supported Web pages."
Compound XML Document Editor
"A compound XML document combines XML markup from several namespaces into a single physical
document. A number of standards exist, and continue to be developed, that are descriptions of XML
markup within a single namespace. XHTML, XForms, XML Events, Scalable Vector Graphics (SVG),
VoiceXML, and MathML are prominent examples of such standards, each having its own namespace...
Sample models for these XML-based standards are provided with the Compound XML Document Editor
distribution; documents having markup of these types therefore may be created and edited immediately
upon installing:
XHTML 1.0
XForms 1.0
XML Events
Scalable Vector Graphics (SVG) 1.1
Synchronized Multimedia Integration Language (SMIL) 2.0
MathML 2.0
XLink 1.0
VoiceXML 2.0
XHTML + VoiceXML (X+V)
XML-based User Interface Language (XUL) 1.0"
XHTML+Voice Profile 1.0
W3C Note 21 December 2001
"Profile XHTML+Voice brings spoken interaction to standard WWW content by
integrating a set of mature WWW technologies such as XHTML and XML Events with XML vocabularies
developed as part of the W3C Speech Interface Framework. The profile includes voice modules that
support speech synthesis, speech dialogs, command and control, speech grammars, and the ability to
attach Voice handlers for responding to specific DOM events, thereby re-using the event model
familiar to web developers. Voice interaction features are integrated directly with XHTML and CSS,
and can consequently be used directly within XHTML content."
W3C Multimodal Interaction Activity
"The Multimodal Interaction Activity seeks to extend the Web to allow users to dynamically
select the most appropriate mode of interaction for their current needs, including any disabilities,
whilst enabling developers to provide an effective user interface for whichever modes the user selects.
Depending upon the device, users will be able provide input via speech, handwriting, and keystrokes,
with output presented via displays, pre-recorded and synthetic speech, audio, and tactile mechanisms
such as mobile phone vibrators and Braille strips.
Multimodal interaction offers significant ease of use benefits over uni-modal interaction, for instance,
when hands-free operation is needed, for mobile devices with limited keypads, and for controlling other
devices when a traditional desktop computer is unvailable to host the application user interface. This
is being driven by advances in embedded and network-based speech processing that are creating
opportunities for integrated multimodal Web browsers and for solutions that separate the handling of
visual and aural modalities, for example, by coupling a local XHTML user agent with a remote VoiceXML
user agent."
|