Input Recognition in Voice Control Interfaces to Three-Tier Autonomous Agents Vladimir Kulyukin Computer Science Department Utah State University Logan, UT, 84322
Adam Steele School of Computer Science DePaul University Chicago, IL, 60604-2301
Abstract
deliberation tier plans and solves problems; the execution tier translates goals into task networks and executes them; the sensory-motor skills interact with the world. The execution tier of the 3T architecture is implemented using Reactive Action Packages (RAPs) [3]. 3T architectures are featured on diverse robotic platforms to solve a variety of problems [1, 9, 11, 8, 7]. In this paper, we show how VCI’s to 3T agents can benefit from the CFG formalism. Our approach also utilizes the recent advances in speech recognition that enhance HMMbased voice input recognition with context-free command and control grammars (CFCG’s), i.e., CFG’s with action directives. In particular, we argue that the input recognition capacity of CFG’s is partially equivalent to the input recognition capacity of DMAP-Nets. We proceed to use this theoretical result to construct the VCI’s to two autonomous agents. The first agent is Merlin, a Microsoft software agent, that acts as a desktop assistant (see Figure 2). The second agent is a Pioneer 2DX mobile robot assembled from the robotic toolkit from ActivMedia, Inc. (www.activmedia.com) (see Figure 3). The robot patrols an office area looking for soda cans, coffee cups, and crumpled pieces of paper. Figure 4 shows the hardware components of the Pioneer 2DX robot we used in our experiments. The robot has a three-wheel mobile base with two sonar rings, front and rear. The base has an onboard x86 computer with 32MB of RAM running Windows NT 4.0. The base also has an EVI-D30 camera mounted on it. The camera can pan, tilt, and zoom. It has a horizontal angle of view of 48.8 degrees and a vertical angle of view of 37.6 degrees. The video feed between an offboard client computer and the robotic base is done through a CCTV-900 wireless AV receiver and switcher and a Winnov video capture card [www.winnov.com]. The commands from the client computer to the robot base are sent via an InfoWave Radio Modem manufactured by the InnoMedia Wireless Group [www.innomedia.com]. The modem operates in the frequency band of 902-928 MHz with the air data rate of 85 Kbps. The robot has three on-board batteries that are periodically recharged with a PSC-124000 Automatic Battery Charger.
Voice control interfaces are based on the assumption that the difficult problem of understanding spoken utterances can be sufficiently constrained if reduced to the problem of mapping voice inputs to knowledge and control structures. Thus, a spoken input is recognized if and only if it references an appropriate structure. Context-free command and control grammars are used in speech recognition to constrain voice inputs and improve recognition accuracy. We show how such grammars can be automatically constructed from the knowledge structures used in three-tier autonomous agents.
1. Introduction The purpose of voice control interfaces (VCI’s) is to map human voice inputs to the appropriate knowledge and control structures in autonomous robots or other computational devices. VCI’s combine three aspects of natural language processing (NLP): speech recognition, syntax, and semantics, each of which uses different formalisms. Speech recognition has traditionally relied on Hidden Markov Models (HMM), while syntax and semantics have made heavy use of context-free grammars (CFG’s), semantic networks, and first-order predicate calculus [6]. A direct consequence of this formalism divide is that intelligent VCI’s to three-tier (3T) autonomous agents operate in two completely separate phases: speech recognition and symbol interpretation. During speech recognition, voice inputs are mapped to symbols; during symbol interpretation, symbols obtained from voice inputs are used to identify appropriate knowledge structures [5, 4]. However, symbol interpretation can be eliminated altogether if the recognition of knowledge structures occurs as a natural by-product of speech recognition. We show that this is possible due to the partial input recognition equivalence between CFG’s and Direct Memory Access Parsing semantic networks (DMAP-Nets) [12, 10], which are knowledge structures used in many 3T agents [4, 7]. 3T agents are viewed as consisting of three tiers of functionality: deliberation, execution, and sensory-motor skills. The 1
Figure 1: A DMAP-Net. The paper is organized as follows. In section 2, we investigate the input recognition capacities of DMAP-Nets with respect to context-free languages (CFLs). In section 3, we use the construction inherent in the analysis from section 2 to build voice control interfaces to two autonomous agents. We show how the voice inputs are mapped to the agents’ goals that, in turn, enable and disable the agents’ behaviors. Section 4 offers implementation details. Section 5 outlines future work. Section 6 offers conclusions.
2 Input Recognition Analysis A DMAP-Net is a directed graph of nodes whose edges have two types of labels: abstraction and packaging. If two nodes are connected through an edge with an isa label, the node that receives the edge is an abstraction of the node that emits it. For example, in Figure 1, M-COMMAND is an abstraction of M-TURN-COMMAND. If two nodes are connected through an edge with a label other than isa, the receiving node is a frame and the emitting node is a slotfiller of the slot whose name is the edge’s label. For example, M-TURN-COMMAND is a frame with two slots: angle and direction. M-ANGLE is the filler of the angle slot while M-DIRECTION is the filler of the direction slot. The frame name starts with an ”M-” prefix to indicate that each node stands for a memory organization package (MOP), a term introduced by Schank [1980] to refer to frames. Frames are activated through recognition sequences associated with them. In Figure 1, the hyphenated box connected to M-TURN-COMMAND from below via a hyphenated arrow contains two recognition sequences at least one of which must be completed by the input for M-TURNCOMMAND to be activated. Recognition sequences simulate spreading activation [10]. For, if a spreading activation function is known and is provably deterministic, one can effectively generate all of the recognition sequences necessary to activate a given frame.
DMAP-Nets connect to other modules through callbacks. A callback is an arbitrary piece of code that runs as soon as the frame it is associated with is activated. In Figure 1, the dotted box to the right of M-TURN-COMMAND and connected to it with a dotted arrow denotes a callback that installs an appropriate goal on the RAP executive’s agenda and asks the executive to execute it. Let D =< ; T; I; R; X; E > be a DMAP-net, where
is the set of frames, T is the set of tokens, I is the set of frame ids, R [T [ I ]+ is the set of r-sequences, X is the set of edge labels, and E is the set of labelled edges, i.e., E = f(Mi ; Mj ; x)jMi 2 I; Mj 2 I; x 2 X g. Note that
is defined by I , X , and E . Let T + be the set of t-sequences. Let T \ I = ; so that there is no confusion between tokens and frame ids. Since and I are isomorphic, i.e., every frame has a unique id, frames and frame ids are used interchangeably. Let : I ! 2 I be a function that associates frames with sets of r-sequences. In the discussion below, it is assumed that t-sequences are non-empty. A frame can be activated by a t-sequence directly or indirectly. Let Ad (M; t) denote that a frame M is directly activated by a t-sequence t and let A i (M; t) denote that M is indirectly activated by t. Let A(M; t) denote that M is activated by t either directly or indirectly. A frame Mi is directly activated by a t-sequence t = t1 t2 :::tn , n 1, denoted by A d (Mi ; t), iff there exists a r-sequence r = r1 r2 :::rn 2 (Mi ) such that 8i; 1 i n, one of the following conditions hold:
2 T + , then ri and ti are identical; If ri 2 I , then A(ri ; ti ).
1. If ri 2.
A frame Mi is indirectly activated by a t-sequence t = t1 t2 :::tn , n 1, Ai (Mi ; t), iff there exists Mj 6= Mi such that A(Mj ; t) and (Mj ; Mi ; isa) 2 E . In other words,
a frame is indirectly activated by a token sequence if the frame is an abstraction of another frame activated by that sequence. Let L(D) = ftjt 2 T + ^ 8M 2 I 0 I; A(M; t)g be the language of D. In other words, a token sequence is in the language of D if it activates a subset of frames. Note that the exact definition of I 0 will vary for different DMAPnets. For example, one can define I 0 to be a singleton and accept only those t-sequences that activate the only frame in the singleton. Lemma 2.1 Let be D =< ; T; I; R; X; E > a DMAPnet, then there exists a CFG G such that L(D) = L(G). Proof: Let G
=< ; N; P; S > such that = T , N = I , P = P1 [ P2 [ P3 , where 1.
P1 = fMi ! Mj j(Mj ; Mi ; isa) 2 E g;
2.
P2 = fM ! rjM
2 I ^ r 2 (M )g;
3.
P3 = fS ! M1 jM2 j:::jMn ; 1 n jI jg.
Let t be a t-sequence such that t 2 L(D). Let M 2 I 0 be a frame activated by t. If A d (M; t) holds, then there exists an r-sequence r 2 (M ) such that r and t satisfy the two conditions of direct activation. Since, by construction, M ! r 2 P , M derives t. Since, by construction, S ! r 2 P , S derives t, i.e., t 2 L(G). If A i (M; t) holds, t activates a frame N such that M is one of its abstractions. Without loss of generality, assume that A d (N; t) holds. For, if M is indirectly activated, there must be a frame N such that Ad(N; t) holds, and M is an abstraction of N . If A d (N; t) holds, there exists an r-sequence r 2 (N ) such that r and t satisfy the two conditions of direct activation. Since, by construction, both N ! r and M ! N are in P , M derives t. Since, by construction, S ! M 2 P , S derives t, i.e., t 2 L(G). Conversely, let t 2 L(G). Then S derives t in one of two ways. Either S ) M ) r ) ::: ) t, where M 2 I and r 2 (M ), or S ) M ) N ) r ) ::: ) t, where M; N 2 I and r 2 (N ). In the former case, since M derives t via r, by reading the yield of the derivation tree rooted at M , one can find a strictly increasing sequence of indices 1 through n, 1 n, such that r = r 1 r2 :::rn and t = t1 t2 :::tn and 81 i n, ti is identical with ri or ri derives ti . Since, by construction, r 2 (M ), A d (M; t) holds. In the latter case, S derives t via M and N , and N derives t via r. By reading the yield of the derivation tree rooted at N , one can similarly find a strictly increasing sequence of indices that make t and r satisfy the two conditions of direct activation. Since, by construction, M is an abstraction of N and Ad (N; t) holds, Ai (M; t) holds as well. Thus, in either 2 case t 2 L(D). The proof of Lemma 2.1 offers an algorithm for constructing CFCG’s from DMAP-Nets. Given a DMAP-Net, the algorithm automatically generates an equivalent CFCG for the speech and frame recognition. Specifically, for each frame in the DMAP-Net and for each recognition sequence associated with the frame, a CFCG production is constructed such that the frame name is the production’s left-hand side and the recognition sequence is its right-hand side. If the frame has callbacks, each callback becomes an action specification. If two frames are connected via an abstraction edge, the abstraction frame becomes the lefthand side and the specification frame becomes the righthand side. Lemma 2.1 leads to the following theorem. Theorem 2.1 Let DMAP L be the set of DMAP-Net languages and CF L be the set of context-free languages, then
DMAP L CF L:
Proof: Let L 2 DMAP L. There there exists a DMAP-Net D such that L(D) = L. By Lemma 2.1, there exists a CFG G such that L(G) = L. Hence, DMAP L CF L. 2 The following lemma covers the construction of DMAPNets from CFG’s. Lemma 2.2 Let G =< ; N; P; S >, then there exists a DMAP-Net D such that L(D) L(G). Proof:
D =< ; T; I; R; X; E > be a DMAP-Net defined as
follows: 1.
T = ;
2.
I = N;
3.
I = fS g;
4.
X = fisa; partof g;
5.
6.
0
S
N R (Ni ), where Ni 2 N and (Ni ) = Sk = f ig=1such that Ni ! j 2 P , and 1 k . j j =1 j
j
E = E1 [ E2 , where E1 = f(Ni ; Nj ; isa)jNj ! Ni 2 P g and E2 = f(Ni ; Nj ; partof )jNi ! Nj g, where ; 2 [ [ N ]+ .
Let t 2 + and let t 2 L(G). Then S derives t in one of two ways. Either S ) t or S ) r ) ::: ) t, where r 2 [ [ N ]+ . In the former case, S ! t 2 P and, by construction, t 2 (S ). Thus, t 2 L(D). In the latter case, since r derives t, by reading the yield of the derivation tree rooted at S , one can find a strictly increasing sequence of indices to make r and t satisfy the two conditions of direct activation, as was done in Lemma 2.1. By construction, r 2 (S ). Hence, A(S; t) holds and t 2 L(D). 2 The question arises why the the construction offered in Lemma 2.1 has the equality sign between the two languages while the construction offered in Lemma 2.2 has the subset sign. It turns out that the construction of Lemma 2.2 can produce a DMAP-Net that recognizes a language strictly larger than the language recognized by the corresponding CFG. The following lemma formalizes this observation. Lemma 2.3 Let the construction algorithm C that generates DMAP-Nets from CFG’s be as specified in Lemma 2.2. Let C (G) = D, where G is a CFG and D is a DMAP-Net. Then there exists a CFG G0 such that L(G0 ) L(C (G0 )). Proof: Let G0 have the following productions: S ! ab and S ! aSb, i.e., L(G0 ) = an bn . Let D 0 = C (G0 ). By definition of activation, L(D) = a n bn [ an b, 1 n. 2
COMMAND is activated, a callback associated with that node installs an appropriate goal on the RAP sequencer’s agenda. Given the recognition equivalence of DMAP-nets and CFG’s, we can construct a VCI that uses a CFCG to do goal identification as a by-product of speech recognition. Thus, only the approapriate goal is sent to the RAP sequencer. The productions of the CFG are as follows:
Figure 2: Merlin.
Figure 3: Pioneer 2DX robot.
Figure 4: Pioneer 2DX Hardware.
3 Mapping Inputs to Knowledge Structures Now we consider how CFCG’s can be used in VCI’s to map inputs to knowledge structures. Suppose that we want to build a VCI to a 3T mobile robot. One of the robot’s physical abilities that the VCI needs to reference is turning a certain number of degrees left or right. A standard VCI carries out the reference in two steps [4]. First, a audio stream uttered by the user is mapped into a symbolic representation of the user’s utterance, e.g., a set of symbols or a string. Second, the symbolic representation is used to activate the goals in the agent’s internal representation, e.g., a DMAPNet. For example, the agent uses the DMAP-Net given in Figure 1, M-TURN-COMMAND is activated on such inputs as “turn left twenty,” “turn left twenty degrees,” “turn right thirty,” or “turn right thirty degrees.” Once M-TURN-
M-TURN-COMMAND => turn M-ANGLE M-DIRECTION | turn M-DIRECTION M-ANGLE :: execute-goal(turn, M-ANGLE, M-DIRECTION) M-ANGLE => M-NUMBER | M-DEGREES degrees M-NUMBER => ten | twenty ... M-COMMAND => M-TURN-COMMAND ... In the above CFCG, the nonterminals are capitalized; the terminals are in lower-case letters. The double colon sign in the first production separates the right-hand side of the production from an action specification. In this case, the action specification denotes a goal that will be installed on the RAP executive’s agenda should the rule recognize the voice input. For example, if the voice input is ”turn left twenty degrees,” the RAP executive receives receive the following goal: (turn -20 100)), which means that the robot should turn left with a speed of 100 mm/sec. The key point here is that the symbol interpretation that typically occurs through the DMAP-Net is bypassed because it is no longer necessary. In effect, the agent’s conceptual memory now consists of a set of context-free command and control productions.
4 Implementation Our VCI uses Microsoft Speech API (SAPI) 5.1 freely available from www.microsoft.com/speech. SAPI couples a HMM-based recognition engine with a system for constraining recognized speech with a CFCG. It provides speaker independent, relatively noise robust speech recognition. The grammar to be recognized is defined by an XML Data Type Definition (DTD). Here are three rules from the XML grammar used in the VCI to the Pioneer robot.
turn