Elements of a Spoken Language Programming ... - Semantic Scholar

2 downloads 186 Views 89KB Size Report
Mar 12, 2007 - In many settings, such as home care or mobile environments, de- mands on ... novel programming applications, an interface may have no infor-.
Elements of a Spoken Language Programming Interface for Robots Tim Miller

Andy Exley

William Schuler

Department of Computer Science and Engineering University of Minnesota - Twin Cities Minneapolis, MN

Department of Computer Science and Engineering University of Minnesota - Twin Cities Minneapolis, MN

Department of Computer Science and Engineering University of Minnesota - Twin Cities Minneapolis, MN

[email protected]

[email protected]

[email protected]

ABSTRACT In many settings, such as home care or mobile environments, demands on users’ attention, or users’ anticipated level of formal training, or other on-site conditions will make standard keyboardand monitor-based robot programming interfaces impractical. In such cases, a spoken language interface may be preferable. However, the open-ended task of programming a machine is very different from the sort of closed-vocabulary, data-rich applications (e.g. call routing) for which most speaker-independent spoken language interfaces are designed. This paper will describe some of the challenges of designing a spoken language programming interface for robots, and will present an approach that uses these semantic-level resources as extensively as possible in order to address these challenges.

Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Processing—Language Modeling

General Terms Algorithms

Keywords Human-robot interaction, language modeling, natural language processing, spoken language interfaces

1.

INTRODUCTION

As robots become more capable and more pervasive, the need for end users to edit or refine robot programming on site (as opposed to in the lab) will only increase. In many settings, such as home care or mobile environments, demands on users’ attention, or users’ anticipated level of formal training, or other on-site conditions will make standard keyboard- and monitor-based programming sessions impractical. In such cases, a spoken language inter-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HRI’07, March 10–12, 2007, Arlington, Virginia, USA. Copyright 2007 ACM 978-1-59593-617-2/07/0003 ...$5.00.

face may be preferable. However, the open-ended task of programming a machine is very different from the sort of closed-vocabulary, data-rich applications (e.g. call routing) for which most speakerindependent spoken language interfaces are designed. Because the intended use of a programming interface is to define new behaviors, it will not be able to rely on task-specific training data such as word co-occurrence statistics to guide its search for hypothesized directives, as conventional spoken language interfaces do. In very novel programming applications, an interface may have no information about word use other than the definitions of the words in its lexicon, many of which may have been newly defined by the user. This paper will describe some of the challenges of designing a spoken language programming interface for robots, and will describe an implemented system that uses these semantic-level resources as extensively as possible in order to address these challenges. The system implemented is for the Sony AIBO four-legged robot, and it allows a user to program complex coordinated motor movements using spoken language instructions which combine and relate simpler motions temporally.

1.1 Related Work Similar systems have been developed which employ semantic constraints to filter recognition hypotheses [2, 4, 7, 10, 12]. Rayner et al [10] describe a system which is directed at scripting or programming semi-autonomous agents – in particular a robotic agent aboard a space station for tasks such as taking sensor measurements and opening and closing doors. Details in the scripts can be filled in by the agent using information from its environment. A similar approach was employed by Guido Bugmann [4] for programming a mobile robotic agent to follow traffic directions. Unlike the approach described in this paper, neither of these systems allow properties of the environment to influence low level phonological and syntactic decisions made during recognition. Other systems allow environment information to influence low level recognition. Systems such as those proposed by [2, 7] make use of heuristics in an agent-based architecture, which modify lower levels of analysis (word recognition and syntactic parsing) in cases where the lower level analysis cannot be reconciled with a coherent higher level semantic interpretation. However, unlike a integrated probabilistic approach, these heuristics do not admit a straightforward training using corpus examples. Finally, Roy and Mukherjee [12] do propose an integrated probability model which distinguishes domain independent syntax from domain dependent semantics. By distinguishing syntactic and semantic portions of the model, this approach allows an interface developer to specify an environment independent syntax which can then be constrained to generate only coherent interpretations by the

semantics of the environment. But this model is not sufficiently rich to capture a wide variety of syntactic constructs. The approach described in this paper attempts to take the best features of all of these approaches, using a syntactically rich fully statistical model of syntax and semantics to allow environment information to affect low level processing.

2.

PROGRAMMING SEMANTICS

This paper will assume a relatively simple programming semantics to provide a minimum complexity for the directive language that must be recognized, and the problems with constraining this language to make recognition reliable. It is assumed that more complex programming constructs such as iteration and recursion will extend beyond the scope of single utterances, at which the recognizer will operate (for example, repeated actions can be communicated using discourse-level organization commands, say by defining an action in one sentence, then directing the robot to ‘repeat that five times’). In this minimal language, the semantics of the commands include reference to entities like physical objects, the most tangible of entities that one can talk about. In addition, there is the more abstract concept of event entities, which represent the idea that a command refers to an intangible thing: the action which the user desires the robot to perform. To accomodate these event entities, there must be a discourse model that explicitly keeps track of the events, since users will need to be able to refer to previously described events. In addition to entities, semantics of this language include relations over events. Relations such as BEFORE and AFTER apply to pairs of event entities (e.g. BEFORE(ev1, ev2) means that event ev1 begins before event ev2). Other relations may apply to single entities representing physical objects in the domain. Some example relations used in this domain can be seen in Table 1 below. Finally, truth values must be incorporated into the semantics as another type of entity. This allows for things like negation, so that the relation NOT is true for the entities 0 and 1, but not for 1 and 1 (it might be easier to think of it in terms of a function on truth values). For example, in the sentence “Pick up the box that is not blue”, the word “blue” will have a truth value of 0 for every nonblue box, and the word “not”, which is associated with the relation NOT, will take the 0 and output the truth value 1 for the entire sentence for the non-blue boxes. Truth functions over entities can also be made stochastic, in order to model uncertainty about propositions derived from noisy sensors. In a stochastic logic, or stochastic lambda calculus, truth values over propositions like NOT(t1) would be defined probabilistically: e.g. P(NOT (t1) = 1 | t1 = 0) = 1.0. Another semantic construct required for a sequence programming application is a system of keeping track of temporal relations. Interval Temporal Logic [3] provides a way of reasoning about relations dealing with time periods. In this system, the relations of AFTER and BEFORE correspond neatly to the predicates After and Before in interval temporal logic. The COINITIAL corresponds roughly to the predicate Starts, although without the implication that one argument causes the other. These relations allow the system to put together a program once the user has completed instruction. A program consists of the collection of individual directives given by the user, organized into a temporal sequence based on the reasoning provided by the interval temporal logic.

3.

EXAMPLE PROGRAM To illustrate the power of this architecture, we will first give a

general overview of an example command sequence a user could use to program a mobile robot. In our case the robot platform is the Sony AIBO, a four-legged robot meant to resemble a dog. The AIBO is interesting for this application because of the variety of possible movements which users might desire. Each leg has three actuators; one at the shoulder level that move the entire leg back and forth, one at the shoulder level to raise and lower the entire leg, and a third joint at the knee. The following list of commands shows how a user might program the AIBO the beginning of a simple dance. 1. Bend your back right knee 2. While doing that, bend your back left knee 3. While doing that, move your back right thigh backwards 4. While doing that, move your back left thigh backwards 5. After that, move your right front leg outwards 6. While doing that, move your left front leg outwards This movement occurs in two phases; first, the dog bends its knees for a support base while it is rotating its legs backwards to lift up its body. Then it moves its “arms” out to the sides in preparation for a fist pump. After the first utterance, the system creates an entity representing the event of bending its right hind leg. During the second utterance, the system knows about all the original entities, but is also aware that the anaphoric expression “that” refers to an event entity in the previous sentence, in this case the event of bending its right back knee. After recognizing the second utterance correctly, the system has two event entities, one for each of the commands, and has also added the relation COINITIAL for those two event entities. Similarly, the next two utterances create new event entities for their directives, and add the COINITIAL relation for their previous entities. The fifth command starts a new phase in the overall motion, and adds a new event entity and the relation AFTER with the two arguments of the current new entity and the previous event entity. At any point in this process, if the recognizer misunderstands a command, the user can give a command “Not that” to eliminate the previous entity from the discourse model. The program described by this sequence of spoken commands may seem simple, and indeed it is. However, even a simple command sequence like this is difficult to program using standard AIBO programming techniques. This example use of the system demonstrates the potential for a spoken language programming interface in human robot interaction.

4. CHALLENGES Even without the more complex programming constructs, reliably recognizing these kinds of directives in a robot programming interface poses serious challenges for conventional speakerindependent spoken language processing systems such as those used in call routing applications. These systems typically have a closed vocabulary, and are often syntactically constrained to allow only a small subset of this vocabulary to be recognized at any point in an utterance. Within this constrained grammar, most existing systems additionally have access to word co-occurrence statistics derived from large corpora of sample utterances from the same application domain as the utterances being recognized.

4.1 Permissive Grammar In contrast, users of a spoken language programming interface will expect to be able to combine arbitrary statements and functions, and even create new functions, just as users of conventional

Relation INTERCOREF EVENT FRONT BEND AFTER COINITIAL MOVEFORWARD UNDO

Arguments e1 ev1 e1 ev1,e1 ev1,ev2 ev1,ev2 ev1,e1 ev1

Intuitive Meaning The argument could be referred to in the next sentence using coreference. The entity argument is of type event. The entity argument is a front limb. The arguments are the event this creates and the knee entity to bend. Event entity ev1 occurs after event entity ev2. Event entities ev1 and ev2 begin at the same time. The arguments are the event created and the limb to move forward. The entity argument should be removed from the program.

Table 1: Sample relations and their intuitive meanings programming languages do. Moreover, since a programming interface is a tool for defining new behaviors, it will not be able to rely on task-specific training data such as word co-occurrence statistics to guide its search for hypothesized directives. It may, however, be possible to obtain – or learn – statistical models over the truth or falsity of abstract semantic predicates as observed in the interfaced application given the semantic types of the predicates’ arguments (e.g. whether the object of the predicate is a physical object like an arm or leg, or an event). These co-occurrence statistics over semantic types can be used to filter the recognition in the absence of co-occurrence statistics over individual words.

4.2 Open Vocabulary Any programming interface should allow users define new functions. In very novel programming applications, an interface may have no information about word use other than the definitions of the words in its lexicon, many of which may have been newly defined by the user. In such cases, the interface will have to be able to rely on only semantic cues to filter out hypothesized directives that do not make sense in the current discourse context, in addition to those that are ungrammatical. This means the interface will have to be able to incrementally interpret (or understand) the utterance while it is still being recognized, which must be done very efficiently in order to satisfy the constraints of real time or near-real time (e.g. push to talk) interaction.

4.3 Vague References Users will routinely need to refer to previously mentioned objects or previously programmed actions in order to coordinate simultaneous or successive actions in complex behaviors. In field environments where no monitor or text display is available, the user may have to rely on imperfect memory of these entities when describing them, or may simply wish to offer minimal descriptions and let context sort out the correct referent, and thus may unwittingly introduce vagueness or reference ambiguity into his or her directives. Examples of vague references are pronouns like ‘it’ and ‘that’, and incompletely specified noun phrases like ‘when you move your leg’ instead of ‘. . . your right hind leg’. These vague references must be correctly resolved in order to ensure the appropriate statistics are used in recognition. Moreover, keeping track of these entities requires an explicit discourse model, containing instances of object or event entities, and predicates over these entities, in addition to the statistics over semantic types described earlier. This discourse model may contain ordinary, discrete boolean predicates, or, when the truth of predicates is based on uncertain percepts, may contain continuous probability distributions over the truth or falsity of predicates, given features of the entities serving as the predicates’ arguments. For newly defined expressions, these probabilities can be computed as products of other terms used in a user-supplied definition, or in cases where an inductive definition

is preferable, probabilities can be calculated as a linear function on weights for user-specified features, learned from a small sets of examples.

4.4 Intra-Utterance References References will usually denote entities contained in the discourse model, which is updated after every utterance with entities introduced in that utterance. But some utterances will contain references to entities introduced earlier in the same utterances, which are therefore are not yet contained in the discourse model. For example the directive, ‘move your left leg backward until it is completely down’ is much more natural than a directive that does not use the intra-sentential anaphoric pronoun ‘it’ to refer to the leg: ‘move your left leg backward until your left leg is completely down’. If there is a competing hypothesis ending ‘. . . until it is completely done’ (which might be preferred if ‘it’ refers to an event rather than an object), it will be necessary to access several hypothesized antecedents in order to choose the correct ending. This means that hypotheses about specific entities must be considered in the (e.g. Viterbi) recognizer search.

5. MODEL 5.1 System In addition to the model of syntax and semantics which is fixed at run-time, this system uses an explicit dynamic discourse model to keep track of events and entities denoted by the user’s previous utterances. After every utterance, the discourse model is updated to reflect the changes in events the user may refer to. With the combined models, the system is able to make hypotheses about referred to entities at each frame in the recognition process, based on the relations present in the most recent update of the discourse model.

5.2 Grammar This system uses a statistical model derived from a semantic grammar. The semantic grammar consists of a set of context-free grammar (CFG) rules which specify the way in which truth values and entities map to arguments in a semantic relation, as exemplified below. Whenever the nonterminal symbol in the left hand side of one rule is identified with a nonterminal symbol in the right hand side of another rule during the derivation of a phrase structure tree, the variables associated with these nonterminal symbols are understood to be unified. Previous versions of this system were trained by collecting a corpus of example sentences from users, and manually annotating these utterances as trees with the appropriate syntax and semantics. In contrast, the current system builds a model by explicitly specifying a grammar which consists of syntax and semantics. This allows for more parsimonious representation of the combinatorial nature

of the command structures of the system, and does not rely on collecting example sentences from users in every domain in which it is used. In addition, by placing essentially uniform probabilities on the different commands, the system does not make assumptions about command likelihood based on a possibly biased small corpus. Finally, this method allows for easily adding new command structures to the grammar, with the potential for new command structures to be added by the end user by example. Simp.t.ev → move DP.t.e1 backward : MOVEBACKWARD(t,ev,e1) Simp.t.ev → move DP.t.e1 forward : MOVEFORWARD(t,ev,e1) DP.t.e1 → NP.t.e1 DP.t.e1 → it : INTRACOREF(t,e1) DP.t.e1 → that : INTERCOREF(t,e1) DP.t.ev → that : INTERCOREFEVENT(t,ev) NP.t.e1 → D.t.e1 N1.t.e1 N1.t.e1 → A.t.e1 N1.t.e1 N1.t.e1 → leg : LEG(t,e1) N1.t.e1 → paw : PAW(t,e1) D.t.e1 → your : YOUR(t,e1) PP.t.e1 → P.t.e1.e2 DP.t.e2 P.t.e1.e2 → in : IN(t,e1,e2) P.t.e1.e2 → to : TO(t,e1,e2) A.t.e1 → front : FRONT(t,e1) A.t.e1 → hind : HIND(t,e1) A.t.e1 → left : LEFT(t,e1) A.t.e1 → right : RIGHT(t,e1)

S01 S02 S03

F11

R11

F12

R21

F13

R31

S11

···

S12

···

S13

···

W1

···

Q1

···

A1

···

F1W W0 F1Q Q0 F1A A0

Figure 2: Graphical representation of right-corner DBN model, including factored hidden random variables over incomplete constituents in stack elements (Sid ), compositions or ‘reductions’ of incomplete constituents in stack elements (Rdi ), and depth-specific final states (Fid ), plus random variables over words (Wi ) and sub-phone states (Qi ). Taken together, each stack forms a complete analysis of the recognized input at every time frame i. Shaded random variables in indicate observed evidence: Ai are observed frames of acoustical features, boolean Fi1 are ‘true’ at the end of the utterance, ‘false’ otherwise.

5.3 Language Model The basis for this interface is a structured probabilistic language model described in [13] and refined in [14]. This language model is based on a Dynamic Bayes Net (DBN) [5] formulation which contains random variables that explicitly represent syntactic constituents, as well as semantic notions of truth values and denoted entities. These DBN random variables can be compiled into a single random variable for efficient recognition as a hidden Markov model (HMM). These features enable incremental semantic filtering, in which semantic relations are used to reduce the search space, so that combinations of words that may be syntactically accurate will not be hypothesized if they are not semantically coherent. In order to use this referential information to guide speech recognition, the parsing and interpretation of this semantic grammar will need to run incrementally, using a stack, as a non-deterministic pushdown automaton (PDA). To reduce the number of possible stack configurations that must be considered during recognition, the grammar is first converted into a right-corner form1 . This rightcorner transform converts all left recursion in a grammar into right recursion of incomplete nonterminal symbols lacking one or more symbols to the right; and thereby restricts stack use in a PDA to cases of center recursion (expansion of nonterminal symbols in the interiors of other rules) in which stack use is unavoidable. This transform preserves the relationship between the entity reference variables described in the previous paragraph and the nonterminal symbols with which they are associated. The right-corner transformed grammar, complete with entity reference variables, can then be mapped to random variables (RVs) in a dynamic bayes net (DBN), which is a probabilistic time series model that can recognize a most likely sequence of hidden variable values (in this case, of successive stack configurations) associated with a sequence of observations (in this case, ten millisecond frames of speech). In particular, the approach described here uses a variant of a Hierarchic Hidden Markov Model (HHMM) [9], con1 the

left-right dual of a left-corner form [1, 8, 11]

sisting of several layers of nested time-series models, which has been extended to include explicit random variables over stack reductions between time steps. The random variables in the resulting DBN correspond to uncertain estimates of predict, scan, and reduce operations of an Earley-style parser [6], operating on the right-corner transformed grammar described in the previous paragraph: • ‘reduce’ operations (Rd ) model the combination of two constituents into a single constituent at a higher level stack position; • ‘scan’ operations (Sd , when F d+1 =1 and F d =0) follow ‘reduce’ operations, and model the introduction of a new constituent as a transition from a previous constituent at the same stack level; and • ‘predict’ operations (Sd , when F d+1 =1 and F d =1), following ‘scan’ operations, model the introduction of a new constituent given a constituent in higher-level stack position. This DBN parsing model has the interesting property that as each terminal is encountered, only one scan operation and only one predict operation need be considered. The grammar can then be converted to a normal form, similar to BNF, in which all lexicalized rules (containing terminal symbols) must begin with these terminal symbols, so that only one lexicalize rule need be considered during the predict operation at each time step. Then, if it is stipulated that only lexicalized rules are associated with semantic functions (this ‘lexicalization’ is a common assumption in linguistic semantics literature), it follows that only one semantic function need be considered per time step. This is important because semantic functions depend on the state of the world, and so cannot be precomputed before run time; but since this model ensures that only one semantic function need be evaluated at each time step (and moreover, it is during the ‘predict’ operation, which is the final operation at each

a)

b)

S

pick

up

···

NP the

PP N

S/PP

Adv

A

N

red

box

directly

PP on

top

n0 Sca pick

e1 n1 duc Re Sca ′′ S/up NP

PP/top of NP

S/PP of

NP ···

c)

···

S/front of NP

NP

e2 duc Re S/NP

Adv

S/NP

NP

S/up NP

up NP/NP

pick

up

n2 Sca

e3 duc Re

n3 Sca

NP/N A

box

the

red n4 Sca

′′

′′

′′

′′

′′

NP/N

′′

NP/N

′′

(pick)

up

the

(pick)

(up)

(the)

red

directly N

NP/N

e4 duc Re

on

e5 duc Re S/PP ···

NP

box

Figure 1: DBN training process. A phrase structure tree for the sentence ‘pick up the red box directly on top of the blue box’ (a) is converted into a right-corner transform (b), which is then mapped to individual stack elements (instantiated random variables) at each time step in the DBN (c). Quote marks indicate labels copied from the previous frame. time step), it follows that the entire rest of the model can be safely composed (multiplied together) into a single joint random variable prior to recognition, leaving only one semantic operation at each time step to compute at run time! The independence assumptions in this model are: def

d d+1 d−1 d P(Sid | all) = P(Sid | Fi−1 , Fi−1 , Si , Ri ) def

P(Rdi | all) = P(Rdi | Fid+1 , Sid , Rd+1 ) i def

P(Fid | all) = P(Fid | Fid+1 , Sid−1 , Rdi )

(1) (2) (3)

The definitions for Sid and Rdi are then further broken down into ‘composition’ (ΘV ), ‘attention’ (ΘE ), and ‘lexicalization’ (ΘL ) components explained below, in which all f are true or false, all c are syntactic categories (e.g. ‘S’ or ‘NP’ or ‘S / NP’), all~v are entity coindexation or re-write patterns described below, all t are truth values ∈ {0, 1}, and all ~e are tuples of entities from the environment that are referred to or denoted by this instance of category c (e.g. a single entity denoted by a noun phrase like ‘the box’, or a pair of related entities denoted by a preposition ‘in’)2 : d d+1 P(Sid = hc,~e,ti | Fi-1 = f ,′ Fi-1 = f ,′′ ′ ′ ′′ ′′ ′′ Sid-1= hc,′ ~e ,t i, Rdi = hc,~ e ,t i) ′

= P(~v | f , f ,′′ c,′ ~e ,′ t,′ c,′′~e ,′′ t ′′ )· P(~e,t |~v, f ,′ f ,′′ c,′ ~e ,′ t,′ c,′′~e ,′′ t ′′ )· P(c |~e, t, ~v, f ,′ f ,′′ c,′ ~e ,′ t,′ c,′′~e ,′′ t ′′ ) def



(4)

′′

= PΘV, d, f ′, f ′′ (~v | c, c )· PΘE (~e, t |~v, ~e ,′ t,′ ~e ,′′ t ′′ )· PΘL, d, f ′, f ′′ (c |~v, ~e, t, c,′ c′′ )

(5)

2 Technically,~v in the Equations 4 and 5 is a nuisance variable since

it does not occur as part of Sid , and should therefore be marginalized (or maximized over in Viterbi decoding), but in practice there is no way to generate the same c,~e,t via two different ~v coindexation patterns, so this step can be eliminated.

The breakdown for Rdi is essentially identical to that shown above for Sid except that it contains no term f ′′. Each lexicalization model PΘL (c | ~v, ~e, t, c,′ c′′ ) in the above equations is then calculated as the normalized product of the probability PΘLC (c, u | ~v, c,′ c′′ ) of using category c with semantic predicate function u in the context of categories c′ and c′′ , times the ‘lexical semantic’ or truth-function probability PΘLS (t | u,~e) of denoted entities ~e satisfying predicate u (see Equation 7).

5.4 Training Since models ΘV and ΘLC do not directly depend on entities, they can be extracted from phrase-structure- and reference-annotated training sentences collected in different environments from those used in evaluation. Training instances for PΘV (~v | c,′ c′′ ) and PΘLC (c, u | ~v, c,′ c′′ ) are then extracted from right-corner transformed and DBNaligned versions of these sentences, with coindexation patterns ~v determined by the patterns of identical entities in the conditions and conclusions of these training instances. The remaining models ΘE and ΘLS are directly based on entities, but can still be abstracted across environments using features of entities (e.g. relative position or size) rather than particular entities themselves. In our implementation of this model, truth functions PΘLS (t | u,~e) were simply specified by hand, and the attention model PΘE (~e,t | ~v, ~e ,′ t ,′ ~e ,′′ t ′′ ) was taken to be uniform for each ‘NEW’ entity generated by ΘV . Viewed as a generative process, this language model begins with the composition model selecting a coindexation pattern ~v for a new constituent. This coindexation pattern contains an index pointer for each argument position j in ~e, which points to the first entity in ~e, ~e ′ or~e ′′ that exactly matches e j , or is set to ‘NEW’ if position j contains the first occurrence of e j . Once the coindexation pattern has been chosen, each ‘NEW’ entity is then selected from the environment using the attention model ΘE . For example, if a sentence constituent were being generated at the top level, a new entity would be chosen from a prior PΘE (e | NEW) = PΘE (e); or if some decomposition of a prepositional phrase were being generated, a ‘NEW’ landmark entity e might be chosen using PΘE (e, e′ | NEW, e′ ) for

use in further description (as the NP complement of the PP), based on its proximity and relation to an existing (coindexed) trajector entity e′ . Finally, the lexicalization model selects words or syntactic categories (or a multi-word/category combinations such as ‘in front of NP’), weighted by how strongly they hold true over the chosen entity or relation tuple ~e. By generating probabilities for hypotheses in this manner, the model can incrementally recognize right-corner derivations while still preserving explicit representations of intermediate constituents at all levels of analysis: e.g. representing subphone symbols3 in the DBN’s lowest (d = 6) level, partial phonemes in the next (d = 5) level,4 partial words in the following (d = 4) level, and partial phrases and denotations at subsequent (d ≤ 3) levels, until eventually the denotation of a complete sentence can be recognized in the top level, at the end of the utterance.

5.5 Compiling environment-independent models to HMM The model is then compiled into an HMM by grouping together S and R random variables across all d stack depths into a single ΘQ∗ distribution over stack configurations (the first bracketed term in Equation 8 below), composed of the ΘV and ΘLC models. pi = ∏PΘV (~vdi | c′di , c′′di ) d

· PΘE (~edi ,t di |~vdi ,~e′di ,~e′′di ) · PΘL (cdi |~edi ,t di ,~vdi , c′di , c′′di )

def

=

∏PΘ

V

d

(6)

(~vdi | c′di , c′′di )

· PΘE (~edi ,t di |~vdi ,~e′di ,~e′′di ) · PΘLC (cdi , udi | ~vdi , c′di , c′′di ) · PΘLS (t di | udi ,~edi ) · Z

=

h



PΘV (~vdi

d

(7)

| c′di , c′′di ) · PΘLC (cdi , udi

i |~vdi , c′di , c′′di ) h i · ∏ PΘE (~edi ,t di |~vdi ,~edi ,~e′′di ) d

· where Z =

h

∏ PΘ

LS

d

i (t di | udi ,~edi ) · Z

(8)

1 . ∑c,u PΘLC (c,u|~vdi ,c′di ,c′′ di )·PΘLS (t di |u,~edi )

5.6 Example Details To make the recognition process more concrete, we will step through the recognition process of the first utterance, “Bend your back right knee”. First, as the word “bend” is input, the system maintains several hypotheses, including the word “bend” with all its combinations of relations (just “BEND” in this case) and entities (“rbk” for right back knee, etc.). In addition, a new event entity is generated, since part of the meaning of this command is creating a new event. The word “your” currently does not have a meaningful interpretation, as the robot assumes every utterance is directed at it. When the word “back” is recognized, the active hypotheses that contain entities corresponding to the front limbs will be eliminated. Similarly, as the words “right” and “knee” are 3 These correspond

to the onset, middle, and ending sounds of individual phonemes, whose distributions can be obtained using existing acoustical models. 4 This level and the one below it are similar to the state and emit variables in a Hidden Markov Model for subphone composition, which can also be extracted from existing acoustical models.

recognized, the hypotheses with entities representing left limbs and non-knee limbs are eliminated. At the end of the utterance, the most likely sequence is obtained through the Viterbi algorithm and postprocessed to extract relations and their associated entities, to be added to the discourse model. In this example the relation BEND is extracted, with the associated entities of “rbk” and “nev” (for new event). This relation is then added to the discourse model with a truth value of 1 having probability 1.0, with the new event entity given a new semi-permanent label (e.g. ev0). In addition, the two relations INTERCOREF and EVENT, are added in the postprocessing that is done after the recognition. The INTERCOREF relation is added with the event entity argument ev0, since the event entity can be referred to in the next sentence without explicitly describing the same event. The EVENT relation simply denotes that the entity argument is of event type. Finally, for new event entities, the temporally useful relations BEFORE and AFTER are added to the discourse model, so that subsequent directives can make use of the previous temporal frame of reference.

6. CONCLUSION AND FUTURE WORK This paper has introduced an interface that can be used for spokenlanguage programming of robotic systems by non-expert users. By taking advantage of semantic information as much as possible, recognition can succeed in novel contexts despite the lack of relevant word co-occurrence statistics. The system described in this paper demonstrates that even with a small vocabulary and limited command set, a spoken language programming interface has the potential to be a powerful tool. However, there are some obvious improvements that would allow future versions to be even more powerful. First, there are some linguistic complexities that would make spoken language programming more efficient. Quantification would simplify the programming of coordinated movements, as in “Bend both of your back knees”. In addition, it should be possible to teach the interface new ways to program the robot. For example, a user may prefer to personify the AIBO by referring to its front legs as arms, and the interface could accommodate this by creating a new relation, ARM, which gives a truth value of 1 for all entities which have truth values of 1 for the relations FRONT and LEG. There are two other improvements that would probably be necessary for creating precise movements. The first improvement is the ability to specify movement duration in units of degrees of movement. Second is a real-time feedback mode that allows the user to tweak a sub-move before adding it to the discourse model. This ability allows the user to experiment with the movement without having to see the whole move every time a small change is desired.

7. ACKNOWLEDGEMENTS This research was supported by National Science Foundation CAREER award 0447685, and by grants from the University of Minnesota Grant-In-Aid and Digital Technology Center Initiative Programs. The views expressed are not necessarily endorsed by the sponsors.

8. REFERENCES [1] A. V. Aho and J. D. Ullman. The Theory of Parsing, Translation and Compiling; Volume. I: Parsing. Prentice-Hall, Englewood Cliffs, New Jersey, 1972. [2] G. Aist, J. Allen, E. Campana, L. Galescu, C. A. G. Gallo, S. C. Stoness, M. Swift, and M. Tanenhaus. Software architectures for incremental understanding of human

[3] [4] [5]

[6] [7]

[8]

[9]

speech. In Proceedings of Interspeech/ICSLP, Pittsburgh, PA, 2006. J. Allen and G. Ferguson. Actions and events in interval temporal logic. Journal of Logic and Computation, 4, 1994. G. Bugmann, E. Klein, S. Lauria, and T. Kyriacou. Corpus-based robotics : A route instruction example, 2004. T. Dean and K. Kanazawa. A model for reasoning about persitence and causation. Computational Intelligence, 5(3):142–150, 1989. J. Earley. An efficient context-free parsing algorithm. CACM, 13(2):94–102, 1970. P. Gorniak and D. Roy. Grounded semantic composition for visual scenes. Journal of Artificial Intelligence Research, 21:429–470, 2004. M. Johnson. Finite state approximation of constraint-based grammars using left-corner grammar transforms. In Proceedings of COLING/ACL, pages 619–623, 1998. K. P. Murphy and M. A. Paskin. Linear time inference in hierarchical HMMs. In Proceedings of Neural Information Processing Systems, pages 833–840, 2001.

[10] M. Rayner, B. A. Hockey, and F. James. A compact architecture for dialogue management based on scripts and meta-outputs. In ANLP/NAACL Workshop on Conversational Systems, pages 54–60, Somerset, New Jersey, 2000. Association for Computational Linguistics. [11] S. J. Rosenkrantz and P. M. Lewis, II. Deterministic left corner parser. In IEEE Conference Record of the 11th Annual Symposium on Switching and Automata, pages 139–152, 1970. [12] D. Roy and N. Mukherjee. Towards situated speech understanding: Visual context priming of language models. Computer Speech and Language, 19(2):227–248, 2005. [13] W. Schuler and T. Miller. Integrating denotational meaning into a DBN language model. In Proceedings of Eurospeech/Interspeech, Lisbon, Portugal, 2005. [14] W. Schuler, T. Miller, S. Wu, and A. Exley. Dynamic evidence models in a DBN phone recognizer. In Proceedings of Interspeech/ICSLP, Pittsburgh, PA, 2006.