Combining Planning and Dialog for Cooperative ... - CiteSeerX

13 downloads 0 Views 213KB Size Report
used when the robot is working in a scenario with stan- ... the image sequence and translated into robotic actions. .... SCREW: round head bolt, hexagon bolt.
Combining Planning and Dialog for Cooperative Assembly Construction Jannik Fritsch and Hans Brandt-Pook and Gerhard Sagerer

jannik, hbrandt, [email protected] Applied Computer Science, Technical Faculty, University of Bielefeld, Germany Presented at IJCAI-99 Workshop \Scheduling and Planning meet Real-time Monitoring in a Dynamic and Uncertain World"

Abstract

This article presents work on integrating a dialog component with a reactive planner into an assembly construction system. Using a reactive planner with a tight connection to the dialog component enables the system to ask for user assistance when encountering execution failures. We developed an action detection based on vision to monitor the execution of robotic and human actions. The combination of the robot-independent action detection with the reactive planner allows to correct many errors automatically (e.g. robot looses part and the detection process knows where the lost part is in the scene) and to incorporate human actions for cooperative plan execution (e.g. to resolve robotic failures). Additionally, monitoring actions allows to learn high-level assembly plans during their construction.

1 Introduction

The improvements in robotics over the past years have resulted in an increased research interest to develop easy ways to instruct these systems. Today two distinctive approaches to instruct robots exist. 1. command language: The rst approach uses some kind of (high-level) command language to instruct the robot. The instructions can either be given using a keyboard (like in the PLAYBOT project [Tsotsos et al., 1998]) or speech (see, for example, [Brown et al., 1992]). If no restrictions are imposed on the speech input, the natural language instructions are often ambiguous. They need to be clari ed in a dialog between the system and the instructor. The complex instructions need to be processed to obtain a sequence of basic robotic actions using, for example, standard planning techniques. This approach is used when the robot is working in a scenario with standardized parts where all actions can be implemented in advance. 2. Learning by Watching: The second approach allows the user to specify the sequence of robotic actions indirectly by executing the task

in front of a camera. The human actions are detected in the image sequence and translated into robotic actions. The details of the human action (e.g. how a part is taken by the hand) are used to guide the robotic action (e.g. [Kuniyoshi et al., 1994]). This coupling between the human action and the robot action asks for a tight integration of all system components which is dicult to achieve when systems become more complex. The strength of this approach are situations where unknown parts/actions can occur. In this paper we will focus on error detection and recovery in systems instructed by command language. We argue that the integration of sophisticated modules into a complex system makes it necessary to develop additional modules for error handling which can take into account the information from all modules. We focus on resolving execution errors at the planning level where a given goal has to be reached compared to research on error recovery at the execution level (i.e. inside the robotics module) where a basic action has to be executed (e.g. [Donald, 1989]). This allows to handle situations where the simple repetition of a failed action may not be appropriate or is even impossible (e.g. part falls down and is now outside of range). Our approach is pragmatic, we aim at developing a system capable of an \appropriate" reaction to errors in human-machine-interaction. For this purpose the planning algorithm and the dialog module are closely connected to enable interaction with the user (e.g. ask for user assistance). The dialog module allows to use speech to resolve ambiguous instructions before starting the planner and to report errors occurring during execution of a plan. We use available results from vision modules to detect actions in the scene and monitor plan execution. Because detecting actions is robot-independent, human actions as well as robot actions can be monitored which allows the cooperative construction of assemblies. When the robot fails executing an action the dialog module can ask the user to continue. The actions executed by the user in the scene are detected and change the state of the system. Afterwards the planner can proceed with assigning actions to the robot to reach the goal. In the next section we give an overview of the sys-

In IJCAI-99 Workshop "Planning and Scheduling meet Real-Time Monitoring in a Dynamic and Uncertain World", pages 59-64, Stockholm, 1999.

tem in which our research is realized. In Section 3 we describe the assembly representation used and the symbolic action detection approach. Section 4 introduces the dialog component and Section 5 deals with the planning component and the interaction of planner and dialog.

Robot

2 System Overview

The work is conducted within a research project to build a \Situated Arti cial Communicator" to construct assemblies using a wooden construction-kit (Bau x) for children. The nal goal is the robotic construction of a toy-airplane (see Figure 1). The system is designed to

USER

SCENE

robotic actions

image Object/Assembly recognition

user actions

speech

ARTIFICIAL COMMUNICATOR

new/disappeared objects and assemblies action detection inferred actions planned actions

assembly structure

planning component ASP subassembly Planner structure + monitoring

result of monitoring the plan execution

"Baufix"-domain ’put bar on screw’

assembly database/ sequencer "Airplane"-domain ’build a propeller’

Dialog Interface

Figure 2: Simpli ed diagram of module interaction.

Figure 1: Toy-airplane using Bau x parts. behave like a human constructor with regard to its communication abilities and its actuators [Fink et al., 1996]. The instructions can be given using natural speech like one would instruct a human constructor. The scene in which all construction actions are carried out consists of a table and is modi ed by the robot or the user. The actuator part of the system consists of two robot arms (see [Zhang et al., 1998]). Two cameras for stereo vision are mounted in an angle of 45 degrees above the table to obtain images of the scene. Figure 2 shows a simpli ed diagram of the overall system with the modules important to this paper in a gray box. The images are processed to recognize the objects and assemblies on the table (see [Heidemann et al., 1996]) and based on these objects an assembly recognition is carried out using functional models (see [Bauckhage et al., 1998]). The recognized objects and assemblies are monitored over time and the new or disappeared objects are extracted. Note that only objects on the table and changes relating to those objects can be observed. Given the position of the cameras it is not possible to observe the contents of the robot hands. To detect actions the information about new or disappeared objects and assemblies is used together with two hand models (see Section 3.2). Because there is no

visual control of what is actually in the hands the states of the hand models re ect the system's assumption of the current state of the world. If all actions necessary to build an assembly were detected, the hand models contain the complete assembly structure. To learn assembly construction plans this structure is sent to the assembly database. If the dialog obtains a name for this assembly subsequent instructions can refer to it. The verbal instructions from the user are processed by the dialog interface. This module contains speech recognition, interpretation and the dialog (see [BrandtPook et al., 1999]). It extracts the intended complex action from the verbal instruction and sends it to the appropriate level of the planning component. 1. Instructions related to construction actions are sent directly to the reactive planner. The result of the plan execution is send back to the dialog who can ask for user assistance if an error occurred which can be resolved by the user (e.g. putting a part closer to the robot). This enhances the ability of the system to overcome robotic errors because the reactive planner can incorporate user actions when trying to achieve a goal. 2. Instructions related to assemblies from the airplanedomain are sent to the assembly database/sequencer module. If this module contains a matching assembly it sends the construction plans for the subassemblies one-by-one to the reactive planner for construction. Because we use vision-based action detection we can monitor plan execution independently from whether a human or the robot acts in the scene. This allows simulating the robotic execution of the desired actions to easily produce di erent error situations related to robot execution.

3 Assemblies and Basic Actions

The actions which can be executed in the scene are strongly related to the available parts and how they can be connected together. Therefore we will rst turn to the di erent functions of the single parts and a de nition of their assemblies before we introduce the basic actions and their detection.

3.1 Assembly Structure

In our system parts from the Bau x construction-kit are used to construct bolted assemblies. The male parts are screws of di erent length, the female parts have threads and can be screwed onto the screws. Additionally, there is a third class of miscellaneous parts with holes (e.g. bar with 3 holes ! 3 h bar) which can be put onto screws.

SCREW: round head bolt, hexagon bolt MISC: 3 h bar, 5 h bar, 7 h bar, felly, socket, ring (1) NUT: cube, rhomb-nut

The function of an assembly constructed using single parts falls into one or several classes. Only if the assembly is used as subassembly in a larger assembly, one function can be determined for this subassembly based on the type of connection within the assembly. The following gives the formal de nition for assembly construction (commas were used instead of conjunctions).

Assembly: (Agg SCREW, Agg MISC*1 , Agg NUT) Agg SCREW: SCREW j Assembly (2) Agg MISC: MISC j Assembly Agg NUT: NUT j Assembly

Note that the list of parts in an assembly is ordered, always starting with Agg SCREW followed by any number of Agg MISC and ending with Agg NUT. A function can only be assigned to an assembly if there are free \ports" to use the assembly as a part (e.g. for Agg NUT there needs to be at least one free thread in the assembly). The recursive de nition of the assembly structure using subassemblies allows di erent representations for one assembly. When constructing an assembly the system needs to choose one representation out of several learned representation. This decision can be based on several constraints, e.g. how the resulting subassemblies can be handled by the gripper. In the following we will refer to \part" as an entity to construct an assembly, this can be either a single Bau xpart or an assembly itself.

3.2 Detection of Basic Actions

To enable closed-loop planning we need to detect the actions executed in the scene. As our primary goal is the monitoring of plan execution we do only monitor actions which can be executed by the robot (Take, Connect, Put down) or represent an error. 1 The star operator indicates the possibility to put any

number of Agg MISC on an Agg SCREW limited by its length.

Two hand models are used for action detection. The state of each hand model represents the part or (partial) assembly which is currently in the hand.

States of hand model:

Empty j SCREW j MISC j NUT j (Agg SCREW, Agg MISC+2 )3 j Assembly

(3)

Using these hand models and the information about new and disappeared objects we can infer actions and change the hand states appropriately:

Take X

Preconditions: X disappeared ^ hand: Empty E ects: : (X on table) ^ hand: X

Connect X Y

(4) Preconditions: hand 14 : X (Agg SCREW) ^ hand 24 : Y (Agg MISC j Agg NUT) E ects: hand 1: XY ^ hand 2: Empty

Put down X

Preconditions: X new ^ hand: X E ects: X on table ^ hand: Empty For failure detection only one additional Falling down

action is currently implemented which is similar to the Put down action. The only di erence is in the preconditions, before the part fell down it may have been part of an assembly in one hand as long as it was removable (i.e. not fastened with a screw). The Connect action is not observable because mounting parts together does not lead to new or disappeared objects. Therefore this action can only be inferred if the next Take action has happened: If both hands contain parts and the preconditions of the Connect action are met, it is inferred that the parts have been connected together if another object disappears. Inferring the Connect action results in one hand containing the (begun) assembly and the other hand being empty. This empty hand can now take the disappeared object. If a Connect action is not possible and both hands contain parts an error message is generated because the disappeared part cannot be taken by the hands. Because our approach uses only information about new and disappeared objects to detect actions, we need to make some assumptions:  Each hand can hold one part or (partial) assembly.  No parts are put down outside the visible scene.  New parts may only be put into the scene by the user if no similar parts (same type and color) are in the robot hands. The rst point is true for our robot but not for the user. If our system is to detect user actions as well, the user must follow this restriction. 2 The plus operator indicates the possibility to put at least

one Agg MISC on an Agg SCREW limited by its length. 3 This state represents a partial assembly. 4 The notation hand 1 and hand 2 is only used to indicate two di erent hand models, each of the two hand models can be hand 1 or hand 2.

4 The Dialog Module

The dialog strategy enables the system to obtain complete instructions by listening to the human instructor. Figure 3 shows an example of a dialog where user interaction is requested to resolve a robotic failure. In this

S_REPEAT

S_REQUEST S_REJECT S_CONFIRM_EXE

S_OPENING

U_UTTERANCE S_CONFIRM_INT

S_CLOSING

System: Hello! This system understands instructions

Figure 4: Dialog model

User: Oh, rst of all put two bars on the short bolt! System: Okay. I have put the short bars on the yellow

more information (S REQUEST), otherwise the resulting instruction will be send to the planning module for execution. The planning module returns information on whether the robot executed the instruction successfully (S CONFIRM EXE) or some kind of error occurred (S REJECT). In case of an error a system output noti es the user of the kind of failure (see Section 5.3).

related to Bau x. What is your rst instruction?

bolt. What is to do now? User: Fix it with a red cube. System: Okay. The robot takes the cube but it falls down. System: Please move the red cube close to the robot. The user takes the cube and puts it closer to the robot, the system proceeds with taking the cube. System: What is to do now? User: Put down the propeller! System: Okay. I have put down the propeller.

Figure 3: Example of a dialog with user interaction dialog the user instructs the robot to build a propeller consisting of two bars which are put onto a bolt and xed with a cube. Because the robot looses the cube and the repetition of the Take action is not possible, the user is asked to move the cube closer to the robot so that he can take it. Note that the user introduces the name for the assembly at the end of the construction. The system has no prior knowledge about airplanes and knows from now on how the user calls this assembly. The assembly name is sent to the database and, after successful execution of the instructions, attached to the learned assembly structure. Therefore the system now knows how to build the assembly called \propeller". To develop an adequate dialog strategy we studied simulated human{machine{interactions [Brindopke et al., 1995] that have the same constraints | a human guides a robot to build an airplane. We found many ambiguous instructions which needed to be clari ed in a dialog before a common interpretation was possible. Because a common interpretation of an instruction is a basic precondition for learning more complex actions, the main aspect of the dialog model shown in Figure 4 is to obtain a sequence of clear and unambiguous instructions. Every dialog starts with an opening of the system (S OPENING) followed by a user utterance (U UTTERANCE). If the system understood the intention of the utterance it con rms this (S CONFIRM INT) otherwise the user is asked for repetition of the utterance (S REPEAT). If an utterance cannot be completely interpreted the system asks for

5 The Planning Component

The planning component should react fast to \wrong" actions while being able to build complex assembly structures learned before. To facilitate both needs the planning component is made up of two levels: 1. Planning and monitoring of basic actions for construction of a single assembly. 2. Sequential assignment of subassemblies to the rst level for construction of complex assemblies. The rst level processes all instructions referring to construction actions with Bau x parts. This level uses the Action Selection for Planning algorithm (ASP) with functional encodings [Bonet et al., 1997]. At every instance of time the ASP algorithm selects only one next action and waits for its execution (see Section 5.1 for details). Updating the planner with the actions monitored in the world therefore allows integrated error recovery because selecting the next action is based on the actual situation. The assembly structure (see Section 3.1) is used as goal of the planner at the rst level. Several aspects make it a good interface between the two levels: 1. An assembly is a unit likely to be given a name by a human instructor. 2. A single assembly has adequate granularity for plan monitoring and learning (construction of a complex assembly depends on successful construction of its subassemblies). 3. The second level can use deliberation to select a speci c subassembly sequence. All instructions referring to the airplane-domain (e.g. Build a propeller.) are handled by the second level. This level contains a database with all assembly structures learned and the associated names given by the instructor. The second level assigns the subassemblies one-byone to the reactive planner for construction.

5.1 The ASP Planner

The ASP algorithm is a variation of Korf's Learning Real Time A* algorithm (LRTA*, see [Korf, 1990]). It yields a non-optimal solution because it treats planning as a search problem and does not search through the search space exhaustively. Instead a heuristic function is used to estimate the number of steps from the actual state to the goal (i.e. the cost to the goal) for every possible action. The action with the lowest (estimated) cost is selected and changes the actual state. Based on this new state the search is repeated to nd the next action. When using ASP for closed-loop planning the action selected by the planner is assigned to the robot but does not change the actual state. An exception to this rule is the Connect action which is not observable. Only observable actions make the planning module wait for their execution. When an action is detected (whether it was the \right" action is not important) the actual state is changed to re ect the new state and the next action is searched. Because searching the next action is done based on the actual state a \wrong" action will be corrected automatically (e.g. if the robot looses a part, the system autonomously recovers by planning a Take action with the actual position of the fallen part). Because the ASP algorithm only selects the best action given the current state and never constructs a complete plan it is very fast. An additional bene t is that there is only a small increase in planning time if the user changes the planning goal while the system is trying to achieve a goal. This feature is especially useful when interacting with a user in a dynamic environment.

5.2 Interfacing the Planner to the World

In our scenario, a situation with more than a dozen parts laying on the table is normal. Those construction parts di er in color and functionality, there are 10 di erent types of parts (see Section 3.1), some of them available in 4 di erent colors. Clearly, modeling all those di erent parts makes the world model unnecessary large and subsequently the planning more time-consuming. Therefore we implemented generic parts in the planner which are initialized with the speci c type and position of a part at run-time. At the beginning of the planning process only the parts needed for the actual goal are initialized, all other parts are \invisible" to the planner. If unknown parts are taken by the robot or by the human during the construction process they are initialized in the planner. If a part not belonging to the actual goal assembly is placed back on the table, the matching generic part is cleared. The planning operators to reach the goal are comparable to the actions described in Section 3.2 but do contain additional atoms to capture the sequential nature of assembly construction. To solve a planning problem in the original ASPimplementation the start/goal situation and the planning operators are compiled and the algorithm is run. To avoid this time-consuming procedure we pre-compiled the plan operators and generic goals (one for every size of

assembly = number of parts) using the generic parts described above. At run-time the appropriate generic parts and the goal are initialized based on the instruction. To keep the number of plan operators small the di erence between putting miscellaneous parts on a screw versus screwing parts together is not modeled in the current implementation of the operators. If the planner assigns a Connect action this action is mapped onto a Screw or Put action for the robot based on the part used (e.g. whether it is a MISC or a NUT part).

5.3 Enhanced Performance by Interaction of Planner and Dialog

For executing instructions related to the Bau x-domain the reactive planner is invoked directly from the dialog (e.g. Put bar on screw.). The result of the plan monitoring is sent back to the dialog. The planner sends S CONFIRM EXE to the dialog module if the instruction is executed successfully. Otherwise the S REJECT message returned by the planner can be categorized into one of the following error classes. Semantic error: The planning module contains the knowledge about allowed connections. So, only the planner is able to decide whether an instruction is executable at all. For example, \screw the bar into the bar" is not an adequate instruction and will therefore be rejected. Plan not executable: Some instructions are not executable although the semantic check of the utterance has delivered no error. For example, if a hole is already used it will not be possible to use it for another connection. Robotic failure: The robot fails in executing the instruction because of technical reasons (e.g. the object is too large for the gripper). User interaction: The robot needs help to execute an action (e.g. object is out of range of robot arm). Every error class consists of several error messages to capture the details of the error. For the rst three classes the dialog produces a system output to inform the user about the reason for the failure. The fourth class captures failures where user assistance can lead to successful completion of a plan. The appropriate system output is generated (\Put the screw closer to the robot.") and the planner waits until an action is detected in the scene or a timeout occurs. If the user moves the screw, the action detection nds a Take screw action and later a Put Down screw action. This changes the position of the screw within the planner. When the planner now searches the next action the one found is the same as before, but this time the robot is able to complete it because the updated screw position is inside its range.

5.4 System Performance

The complete system consists of several distributed applications. To demonstrate the small time delay between a user instruction and the rst reaction of the system,

we measured the performance of the dialog and the planner running on a single workstation5, all other modules (vision, robotics) were started on separate machines. The speech recognition processes the speech input incrementally and yields results with a delay of 750 ms. The dialog module needs approximately 20 ms to understand an instruction consisting of six words, for longer instructions the processing time increases to about 120 ms for a 15-word instruction. This results in a typical \reaction time" below 800 ms until the dialog outputs a system message either con rming the instruction or asking for clari cation. If the instruction was con rmed, the ASP-planner is started. It needs approximately 600 ms to nd the rst action6 , subsequent steps are processed faster (below 200 ms) because the internal hash table already contains some heuristic estimates calculated during search for the rst step. Summarizing the processing times given above we think our system is a good realization of a mixed-initiative human-machine interface responding fast enough to new or changing goals and unforeseen events in the scene.

6 Future Work

We are currently improving the action detection by incorporating the information about moving (human and robotic) hands in an image sequence to restrict when actions can occur. We plan to enhance the second level of our planning component to recognize known assemblies during their construction by the user. This will allow the subsequent completion of the construction by the system (e.g. Do you want the robot to complete the propeller?). Currently the application of probabilistic models to learn the quality of a speci c representation (e.g. construction time) is considered. The ability of the system to learn assembly construction plans will be used to support other modules (e.g. visual assembly recognition).

7 Summary

This article demonstrated the integration of a planner and a dialog component into an assembly construction system. For action detection we presented a simple approach based on vision. A two-level planning component was introduced to facilitate reactive planning while at the same time complex assemblies can be constructed. It was shown that symbolic action detection in connection with a dialog and a planner can increase the overall system performance by recovering from failures and allowing cooperative construction. 5 All processing times measured on a DEC AlphaStation 500/500 (SPECint95 15.0, SPECfp95 20.4) 6 We increased the Deep Lookahead value to 6 to reduce the possibility of wrong actions, in the standard implementation a value of 2 is used by [Bonet et al., 1997].

References

[Bauckhage et al., 1998] C. Bauckhage, F. Kummert, and G. Sagerer. Modeling and recognition of assembled objects. In IECON'98 Proceedings of the 24th Annual Conference of the IEEE Electronics Society, pages 2051{2056, Aachen, 1998. [Bonet et al., 1997] B. Bonet, G. Loerincs, and H. Ge ner. A robust and fast action selection mechanism for planning. In Proceedings of AAAI-97, pages 714{719. MIT Press, 1997. [Brandt-Pook et al., 1999] H. Brandt-Pook, G. A. Fink, S. Wachsmuth, and G. Sagerer. Integrated Recognition and Interpretation of Speech for a Construction Task Domain. In Proc. of the 8th Int. Conf. on Human-Computer Interaction, Munich, 1999. to appear. [Brindopke et al., 1995] C. Brindopke, M. Johanntokrax, A. Pahde, and B. Wrede. Instruktionsdialoge in einem wizard-of-oz-szenario: Materialband. Report 7, SFB 360 `Situierte Kunstliche Kommunikatoren', Universitat Bielefeld, 1995. [Brown et al., 1992] M. K. Brown, B. M. Buntschuh, and J. G. Wilpon. SAM: A Perceptive Spoken Language Understanding Robot. IEEE Trans. Systems, Man and Cybernetics, 22:1390{1402, 1992. [Donald, 1989] B. Donald. Error Detection and Recovery. Lecture Notes in Computer Science. Springer Verlag, 1989. [Fink et al., 1996] G.A. Fink, N. Jungclaus, F. Kummert, H. Ritter, and G. Sagerer. A Distributed System for Integrated Speech and Image Understanding. In International Symposium on Arti cial Intelligence, pages 117{126, Cancun, Mexico, 1996. [Heidemann et al., 1996] G. Heidemann, F. Kummert, H. Ritter, and G. Sagerer. A Hybrid Object Recognition Architecture. In International Conference on Arti cial Neural Networks { ICANN 96, pages 305{ 310, Bochum, 1996. Springer-Verlag. [Korf, 1990] R. Korf. Real-time heuristic search. Arti cial Intelligence, 42:189{211, 1990. [Kuniyoshi et al., 1994] Y. Kuniyoshi, M. Inaba, and H. Inoue. Learning by Watching: Extracting Reusable Task Knowledge from Visual Observation of Human Performance. IEEE Trans. on Robotics and Automation, 10(6):799{822, 1994. [Tsotsos et al., 1998] J.K. Tsotsos, G. Verghese, S. Dickinson, M. Jenkin, A. Jepson, E. Milios, F. Nu o, S. Stevenson, M. Black, D. Metaxas, S. Cluhane, Y. Ye, and R. Mann. PLAYBOT: A visually-guided robot for physically disabled children. Image and Vision Computing, 16:275{292, 1998. [Zhang et al., 1998] J. Zhang, Y. v. Collani, and A. Knoll. Development of a Robot Agent for Interactive Assembly. In Proceedings of the 4th Int. Symp. on Distributed Robotic Systems, Karlsruhe, 1998.