Speech
Technology
Automated
speech recognition (ASR) systems have
greatly improved in recent years as better
algorithms and acoustic models are developed,
and as more computer power can be brought
to bear on the task. An ASR system running
on an inexpensive home or office computer
with a good microphone can take free-form
dictation, as long as it has been pre-trained
for the speaker's voice. Over the phone,
and with no speaker training, a speech
recognition system needs to be given a
set of speech grammars that tell it what
words and phrases it should expect. Within
these constraints a surprisingly large
set possible utterances can be recognized
(e.g., a particular mutual fund name out
of thousands). Recognition over mobile
phones in noisy environments, while problematic,
can be improved with a new technology
called distributed speech recognition,
where the early analysis is done on the
handset. Speech recognition is used today
in large numbers of commercial applications.
Advances
are also being made in speech synthesis,
or text-to-speech (TTS). Older TTS systems
generate speech completely from scratch,
and tend to sound like "drunken robots".
They can be hard to listen to, and at
times even incomprehensible. But newer
TTS systems are much more lifelike - they
use a technique called waveform concatenation,
in which speech is generated from libraries
of pre-recorded waveforms.
It
is important to note here that VoiceXML
can be used even in environments lacking
speech technology. Audio output can consist
entirely of pre-recorded prompts, and
input can be exclusively from the keypad.
While speech technology makes applications
much more powerful and pleasant to use,
VoiceXML also brings the advantages of
web development and deployment to older
styles of computer telephony applications.
The
Ubiquitous Web
The
Internet extends to more devices than
personal computers. Some examples are
personal organizers with wireless data
connections, mobile phones supporting
the Wireless Application Protocol (WAP),
and NTT Docomo's i-mode phones. The future
will bring more web-enabled devices: overnight
delivery drop off boxes that schedule
pickups and record their contents, networked
MP3 portables, vending machines that reorder
supplies when running low, wall displays
that download artwork, web-based stereo
receivers and televisions, and many others.
Speech
technology is a very natural and powerful
interface for ubiquitous web devices.
Microphones are much smaller than keyboards
and keypads, and speakers smaller than
screens. So it seems quite likely that
many future web devices will have on-board
speech recognition (as do some mobile
phones today).