Spoken language interfaces: The OM system. Jean-Michel. Lunati ... cation, a spoken language system must afford the user .... reasons: It provides a range of ...
Spoken language interfaces: Jean-Michel School
Lunati
of Computer
and Alexander
Science,
Pittsburgh,
The
intrinsic
properties
(e.g., the presence-of characteristics accurate design
of current
recognition)
recognition
pose special
of a speech interface.
understanding interface
these problems
structure
that
tional
input
modal
interface
To fully
modality,
exploit
15213
or recognized
(in-
and in identifying
an
ration
the advantages
of issues that
our goal a conven-
SLS) attempts to provide use of speaker-independent
of spoken natural
The design
into functionally
while
decomposes
independent
Figure
system imple-
for independent
systems
CM
a clear functional
simplifies
design several
has dif-
ATTENTION
modu-
uous stream
a recognition
system
ager segments
each correspond-
tributions Baker,
thank
to the Eric
the following
Office
Thayer,
Manager Robert
Alexander Franz. The research described Defense der No.
individuals project,
Weide,
in this
paper
The
views
and
conclusions
(DOD), contract
contained
con-
start Push
and by the
docu-
Advanced
Research
Projects
Agency
utterance-sized
produces The
units
a contin-
Attention
Man-
from this stream
to the recognition
engine.
process
by indicating
to the system
and the end of an utterance and
Hold
modes).
(Push
both
to Talk
At a more complex
the and
level of
only the bounds of a true utterance but also know to reject unintentional utterances (and noise) and be able to determine whether the user is actually ad-
ment are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense
speech.
function, the system determines one or both of these points through automatic end-point detection (Push to Start and Continuous Listening modes). Ideally, the AM should be capable of determining not
Arpa OrNoo039-
in this
of coded
(AM)
component
and routes these utterances
Kathryn
Arceneaux,
was sponsored
Advanced Research projects Agency 5167, monitored by SPAWAR under
85-CO1 63.
for their
including
Paul
inter-
The AM implements a range of strategies. At one extreme, the user explicitly controls the signal acquisition
like
MANAGER
The signal processing
maintaining
units,
of the spoken language
face.
ing to a necessary function in the speech interface. Note that we have not created novel elements, Each We would
1: Components
App
development
of different components. The CM-SLS proven useful, allowing us to implement larity.
t!%
such flexibility through the continuous-speech recog-
which in turn
recognition
of the spo-
production,
embodies
ferent
components
communi-
A good interface
allows
the explo-
to each component.
a multi-
processing, as well as rudiskill” heuristics.
and
correspond
and mouse.
nition, natural language ment ary “conversational
mentation
an explicit
simplifying
system must afford the user
forms of flexibility:
design
recogniidentified
of the inter-
provides
of these functions,
natural language, and a natural flow of interaction. The Carnegie Mellon Spoken Language Shell (CM-
decomposition
components
decomposition
Figure 1 shows the functional ken language interface.
speech to be a use-
into
in all existing
are not explicitly
as separable
face. The present
for the
separation
keyboard
are implicit
tion systems but typically
in
Ultimately, speech into
includes
a spoken language
the following
technology
University
of these functions
and the
problems
well integrated
that
Pennsylvania
We are interested
allows
ful form of computer input. is to understand how to turn
cation,
utterances)
I. Rudnicky
CarnegieMellon
communication
of speech
malformed
The OM system
or the US
Government.
453
dressing
the computer
(as opposed
to another
plement
agent
voice
in the environment). RECOGNITION
ENGINE
The Recognition
Engine
decodes the input
and it is often impractical
utterance
tion
selves potentially active.
several
requiring
Ideally,
applications
substantial
the RE would
that
office
provides
applications.
from
tight-loop
(e.g., calculation)
search);
and since the applications
rectly usable in real-life settings, the activity users can be studied over extended periods.
are diof actual
(them-
resources)
be implemented
engine is its ability
requirements,
information
OM includes a personal database, an appointment calendar, a mailer interface and a calculator. OM
are as a
itself
specialized co-processor. Currently, we use a separate computer for this purpose. A critical attribute of a recognition
a system
common
to open-ended (e.g., database retrieval); it supports meaningful problem-solving activity (e.g., scheduling,
to have this process reside
on which
Manager,
The Office Manager (OM) domain is interesting for the following reasons: It provides a range of interac-
(RE)
into an ASCII string. In its present implementation, the RE functions as a dedicated server and allows multiple clients to share the same recognition facilities. Recognition imposes a high computational load on a computer
the Office
access to several
understands
a 36 word
vocabulary
vides control functions, such switching between applications,
to decode speech
and
pro-
as starting up help invocation,
and etc.
in real-time. Real-time response (or rather response that is within a 200-300 msec delay of the end of an
The current implementation a database of names and
utterance)
ence participants. It is used by the Voice Mail and Personal Information Database components of OM.
TASK
maintains
the rhythm
MANAGER
Speech recognition
of interaction.
(TM) systems
Database are often
built
as mono-
customizing
non-expert
of the system includes addresses of 172 confer-
tools are provided,
user to create or modify
allowing
the
entries in existing
lithic processes. While this approach is adequate for a computer that runs a single speech application, it is inefficient for a computer that is meant to
databases and to create new databases. Changes to a database result in automatic updates to the recognition knowledge base, allowing users who lack a speech
support
background
a variety
of speech-enabled
applications.
In
to easily extend
the system.
the latter case it becomes more efficient to centralize speech resources and to allocate them dynamically to individual applications. The purpose of the Task Manager is to supervise, in the context of multiple
All components of the system, with the exception of the Recognition Engine are implemented on a NeXT computer, using Objective-C and the Nextstep inter-
voice-addressable
face. The RE server is implemented
speech channel plementation,
applications, to the proper the actual
the assignment application.
on a 4-processor
forms (DEC
TM also include the maintenance of context information and the communication of this information to
use as RE’s.
knowledge
the selection
database
The errorful introduction
MANAGER
Word
R6000)
recognition
accuracy
query, of perplexity
about
Other
plat-
are currently
in
for a typical
105, is 92%.
Our future work includes the development of techniques for structuring recognition and parsing knowledge bases along “object” lines, to permit individual
(CM)
technologies,
the Confirmation
ager. The CM allows the user to intercept recognition
DS5000 and IBM
in C as a parallel 10040.
CONCLUSION
nature of speech recognition compels the of a component not normally found in
interface
Apollo
of recognition
bases.
CONFIRMATION
other
performed
program
by the
the RE, where it governs
services
of the
In our im-
applications
Man-
their
and edit a
to inherit
environment
modularization
before it is acted upon by the application.
nents.
(the
language OM),
characteristics
and reusability
The goal is to simplify
from
and to encourage of language
the
compo-
the process of creating
In terms of human communication, the CM performs the error repair necessitated by breakdowns in the communication channel (such as might be caused by a noisy telephone line or a loud interruption). It does
languages for particular applications by providing the developer not only with standard interface components but also with standard language components.
not concern itself with the consequences of errors due to some misunderstanding on the part of the user (although it does offer an opportunity for the immediate
Meaningful study of spoken language interaction requires a system that will be used on a daily basis and whose utility will persist past the initial stages of play
undoing THE The
of a just-spoken OFFICE
Spoken
MANAGER Language
and exploration.
utterance).
Shell
We believe
that
the Office Manager
is such a system, Systems that do not have this persistence of utility will ultimately have little to tell us about spoken communication with computers.
SYSTEM has been used to im-
454