Spoken language interfaces: The OM system - Carnegie Mellon ...

Spoken language interfaces: Jean-Michel School

Lunati

of Computer

and Alexander

Science,

Pittsburgh,

The

intrinsic

properties

(e.g., the presence-of characteristics accurate design

of current

recognition)

recognition

pose special

of a speech interface.

understanding interface

these problems

structure

that

tional

input

modal

interface

To fully

modality,

exploit

15213

or recognized

(in-

and in identifying

an

ration

the advantages

of issues that

our goal a conven-

SLS) attempts to provide use of speaker-independent

of spoken natural

The design

into functionally

while

decomposes

independent

Figure

system imple-

for independent

systems

CM

a clear functional

simplifies

design several

has dif-

ATTENTION

modu-

uous stream

a recognition

system

ager segments

each correspond-

tributions Baker,

thank

to the Eric

the following

Office

Thayer,

Manager Robert

Alexander Franz. The research described Defense der No.

individuals project,

Weide,

in this

paper

The

views

and

conclusions

(DOD), contract

contained

con-

start Push

and by the

docu-

Advanced

Research

Projects

Agency

utterance-sized

produces The

units

a contin-

Attention

Man-

from this stream

to the recognition

engine.

process

by indicating

to the system

and the end of an utterance and

Hold

modes).

(Push

both

to Talk

At a more complex

the and

level of

only the bounds of a true utterance but also know to reject unintentional utterances (and noise) and be able to determine whether the user is actually ad-

ment are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense

speech.

function, the system determines one or both of these points through automatic end-point detection (Push to Start and Continuous Listening modes). Ideally, the AM should be capable of determining not

Arpa OrNoo039-

in this

of coded

(AM)

component

and routes these utterances

Kathryn

Arceneaux,

was sponsored

Advanced Research projects Agency 5167, monitored by SPAWAR under

85-CO1 63.

for their

including

Paul

inter-

The AM implements a range of strategies. At one extreme, the user explicitly controls the signal acquisition

like

MANAGER

The signal processing

maintaining

units,

of the spoken language

face.

ing to a necessary function in the speech interface. Note that we have not created novel elements, Each We would

1: Components

App

development

of different components. The CM-SLS proven useful, allowing us to implement larity.

t!%

such flexibility through the continuous-speech recog-

which in turn

recognition

of the spo-

production,

embodies

ferent

components

communi-

A good interface

allows

the explo-

to each component.

a multi-

processing, as well as rudiskill” heuristics.

and

correspond

and mouse.

nition, natural language ment ary “conversational

mentation

an explicit

simplifying

system must afford the user

forms of flexibility:

design

recogniidentified

of the inter-

provides

of these functions,

natural language, and a natural flow of interaction. The Carnegie Mellon Spoken Language Shell (CM-

decomposition

components

decomposition

Figure 1 shows the functional ken language interface.

speech to be a use-

into

in all existing

are not explicitly

as separable

face. The present

for the

separation

keyboard

are implicit

tion systems but typically

in

Ultimately, speech into

includes

a spoken language

the following

technology

University

of these functions

and the

problems

well integrated

that

Pennsylvania

We are interested

allows

ful form of computer input. is to understand how to turn

cation,

utterances)

I. Rudnicky

CarnegieMellon

communication

of speech

malformed

The OM system

or the US

Government.

453

dressing

the computer

(as opposed

to another

plement

agent

voice

in the environment). RECOGNITION

ENGINE

The Recognition

Engine

decodes the input

and it is often impractical

utterance

tion

selves potentially active.

several

requiring

Ideally,

applications

substantial

the RE would

that

office

provides

applications.

from

tight-loop

(e.g., calculation)

search);

and since the applications

rectly usable in real-life settings, the activity users can be studied over extended periods.

are diof actual

(them-

resources)

be implemented

engine is its ability

requirements,

information

OM includes a personal database, an appointment calendar, a mailer interface and a calculator. OM

are as a

itself

specialized co-processor. Currently, we use a separate computer for this purpose. A critical attribute of a recognition

a system

common

to open-ended (e.g., database retrieval); it supports meaningful problem-solving activity (e.g., scheduling,

to have this process reside

on which

Manager,

The Office Manager (OM) domain is interesting for the following reasons: It provides a range of interac-

(RE)

into an ASCII string. In its present implementation, the RE functions as a dedicated server and allows multiple clients to share the same recognition facilities. Recognition imposes a high computational load on a computer

the Office

access to several

understands

a 36 word

vocabulary

vides control functions, such switching between applications,

to decode speech

and

pro-

as starting up help invocation,

and etc.

in real-time. Real-time response (or rather response that is within a 200-300 msec delay of the end of an

The current implementation a database of names and

utterance)

ence participants. It is used by the Voice Mail and Personal Information Database components of OM.

TASK

maintains

the rhythm

MANAGER

Speech recognition

of interaction.

(TM) systems

Database are often

built

as mono-

customizing

non-expert

of the system includes addresses of 172 confer-

tools are provided,

user to create or modify

allowing

the

entries in existing

lithic processes. While this approach is adequate for a computer that runs a single speech application, it is inefficient for a computer that is meant to

databases and to create new databases. Changes to a database result in automatic updates to the recognition knowledge base, allowing users who lack a speech

support

background

a variety

of speech-enabled

applications.

In

to easily extend

the system.

the latter case it becomes more efficient to centralize speech resources and to allocate them dynamically to individual applications. The purpose of the Task Manager is to supervise, in the context of multiple

All components of the system, with the exception of the Recognition Engine are implemented on a NeXT computer, using Objective-C and the Nextstep inter-

voice-addressable

face. The RE server is implemented

speech channel plementation,

applications, to the proper the actual

the assignment application.

on a 4-processor

forms (DEC

TM also include the maintenance of context information and the communication of this information to

use as RE’s.

knowledge

the selection

database

The errorful introduction

MANAGER

Word

R6000)

recognition

accuracy

query, of perplexity

about

Other

plat-

are currently

in

for a typical

105, is 92%.

Our future work includes the development of techniques for structuring recognition and parsing knowledge bases along “object” lines, to permit individual

(CM)

technologies,

the Confirmation

ager. The CM allows the user to intercept recognition

DS5000 and IBM

in C as a parallel 10040.

CONCLUSION

nature of speech recognition compels the of a component not normally found in

interface

Apollo

of recognition

bases.

CONFIRMATION

other

performed

program

by the

the RE, where it governs

services

of the

In our im-

applications

Man-

their

and edit a

to inherit

environment

modularization

before it is acted upon by the application.

nents.

(the

language OM),

characteristics

and reusability

The goal is to simplify

from

and to encourage of language

the

compo-

the process of creating

In terms of human communication, the CM performs the error repair necessitated by breakdowns in the communication channel (such as might be caused by a noisy telephone line or a loud interruption). It does

languages for particular applications by providing the developer not only with standard interface components but also with standard language components.

not concern itself with the consequences of errors due to some misunderstanding on the part of the user (although it does offer an opportunity for the immediate

Meaningful study of spoken language interaction requires a system that will be used on a daily basis and whose utility will persist past the initial stages of play

undoing THE The

of a just-spoken OFFICE

Spoken

MANAGER Language

and exploration.

utterance).

Shell

We believe

that

the Office Manager

is such a system, Systems that do not have this persistence of utility will ultimately have little to tell us about spoken communication with computers.

SYSTEM has been used to im-

454