KANTRA - A Natural Language Interface for Intelligent Robots - DFKI

Sonderforschungsbereich 314 Kunstliche ¨ Intelligenz - Wissensbasierte Systeme KI-Labor am Lehrstuhl fur ¨ Informatik IV Leitung: Prof. Dr. W. Wahlster

VITRA

Universitat ¨ des Saarlandes FB 14 Informatik IV Postfach 151150 D-66041 Saarbrucken ¨ Fed. Rep. of Germany Tel. 0681 / 302-2363

Bericht Nr. 114

KANTRA - A Natural Language Interface for Intelligent Robots

Thomas Laengle, Tim C. Lueth Eva Stopp, Gerd Herzog, Gjertrud Kamstrup

Marz ¨ 1995

ISSN 0944-7814

114

KANTRA - A Natural Language Interface for Intelligent Robots Thomas Laengle?, Tim C. Lueth? Eva Stopp, Gerd Herzog, Gjertrud Kamstrup?? SFB 314 – Project VITRA, FB 14 Informatik, University of the Saarland, D-66041 ¨ Saarbrucken, email: [email protected] ? Institute for Real-Time Computer Systems and Robotics, University of Karlsruhe, D-76128 Karlsruhe, email: [email protected] ?? Faculty of Electrical Engineering and Computer Science, Norwegian Institute of Technology, N-7034 Trondheim Norway, email: [email protected] Abstract The future use of advanced technical systems, for example robots in the area of service and maintenance, will lead to high demands for interaction between man-machine and machine-man to make these systems more easily accessible for human operators. On that account, natural language could be an efficient means to use robots in a more flexible manner. First, it is possible to convey information in a varying degree of condensation, on the other hand, communication can be performed on different levels of abstraction in an application-specific way. In order to fully exploit the capabilities of a natural language access, a dialogue-based approach will be used. In this article, we want to report about the joint efforts of the University of Karlsruhe and the University of the Saarland in providing natural language access to the autonomous mobile two-arm robot KAMRO which is being developed at IPR. A closer view at the description of the interface architecture will be given. It is documented how the integration of the man-machine interface into the control architecture of the robot should be performed in order to supply access to all internal information and models that are necessary for an autonomous behaviour. Key Words. Natural Language, Man-Machine-Interface, Robot-Human-Interaction

1

Introduction

Natural language and robotics are two major areas of Artificial Intelligence, but they have been studied rather independently in the past. Only a few research works use natural language as a tool in human-machine-interaction. Some of these works will be presented in the next section. As advanced robot systems steadily gain higher intelligence and greater autonomy, the requirements for the design of flexible interfaces to control these systems increase. This is for instance the case, according to [Rembold et al. 93], for future use of intelligent robots in manufacturing or as service robots for different applications. On that account, Proceedings of the 4th International Conference on Intelligent Autonomous Systems

1

natural language as a communication medium for humans is an efficient means to make a technical system easier accessible to its users, [Wahlster 89]. A practical advantage of a natural language access is the possibility to convey information in a varying degree of condensation and to communicate on different levels of abstraction in an application-specific way.

2

State of the Art

Some of the research which has been carried out in order to combine the fields of robotics and natural language processing will now be presented. In [Sondheimer 76], it is focused on the problem of spatial reference in natural language machine control. The well known SHAKEY system, [Nilsson 84], a mobile robot without manipulators, is able to understand simple commands given in natural language. The work described in [Sato & Hirai 87] concentrates on language-aided instruction for teleoperational control. Specific words can be utilized to simplify the specification of teleoperational functions for the instruction of a remote robot system. In [Torrance 94], a natural language interface for a navigating indoor office-based mobile robot is presented. In addition to giving commands and asking questions about the robot’s plans the user can associate arbitrary names with specific locations in the environment. Some theoretical aspects of natural language communication with robot systems from the perspective of computer linguistics are discussed in [Lobin 92]. Other approaches have been concerned with natural language control of autonomous agents within simulated 2D or 3D environments, [Badler et al. 91; Chapman 91; Vere & Bickmore 90]. One salient aspect for natural language access to robot systems is the relationship between sensory information and verbal descriptions. Such issues have already been investigated in the field of integrated natural language and vision processing, [Bajcsy et al. 85; Herzog & Wazinski 94; Neumann 89; Wahlster et al. 83].

3

The Intelligent Mobile Robot KAMRO

Higher intelligence and greater autonomy of more advanced robot systems increase the requirements for the design of a flexible interface to control the system on different levels of abstraction. In the KAMRO (Karlsruhe Autonomous Mobile RObot) project, for example, an autonomous mobile robot (Fig. 1) for assembly tasks is being developed with the capability of recovering from error situa¨ & Rembold 94]. The autonomous mobile robot KAMRO is a two-arm tions, [Luth robot-system that consists of a mobile platform with an omnidirectional drive system, two Puma 260 manipulators, and different sensors for navigation, docking and manipulation. KAMRO is capable of performing assembly tasks (Fig. 2) autonomously. The tasks or robot operations can be described on different levels: assembly precedence graphs, implicit elementary operations (pick, place) and explicit elementary operations (grasp, transfer, fine motion, join, exchange, etc.). A given complex task is transformed by the control architecture (Fig. 3) from assembly precendence graph level to explicit elementary operation level. The 2

Figure 1: The Mobile Robot KAMRO

generation of suitable sequences of elementary operations depends on position and orientation of the assembly parts on the worktable while execution is controlled by the real-time robot control system. Status and sensor data which is given back to the planning-system enable KAMRO to control the execution of the plan and correct it, if necessary.

4

Human-Robot Interaction

Intelligently behaving autonomous robot systems have several sensor systems to provide the perceptual capabilities which are necessary to explore and analyze their environment, e.g., tactile, acoustic, and vision sensors. They use this information to generate an environment model. But an intelligent robot sometimes is not able to complete incomplete information from its sensors and its knowledge base. In this situation, it is an advantage to query the human operator for the missing information. So, we argue that a natural language interface should not use natural language just as a command-language. There should exist a dialog between user and autonomous system to resolve ambiguities and misunderstandings.

3

Figure 2: The Cranfield Assembly Benchmark

In the context of natural language access we consider four main situations of human-machine interaction:

Task specification: Operations and tasks to be performed by the robot can be given on different levels of abstraction: from high-level commands like "assemble benchmark", implicit robot operations, e.g., "pick sideplate", to explicit robot operations like "grasp" or "finemotion". Or the operator could just give a description of the final positions of the considered objects. Execution monitoring: One of the most significant features of autonomous systems is the possibility to work up an assembly mission in different orders. Because of this property the operator should be informed about what the robot is actually doing: descriptions and explanations can be given in more or less detail. Explanation of error recovering: Autonomous systems normally are able to recover from error situations. This capability could cause comprehension problems for the user because the robot sometimes does not behave as expected. So an explanation why and how plans have been changed increases cooperativeness. Updating and describing the environment representation: Since the visual field of an autonomous mobile robot is restricted geometric and visual data can fairly be complete in dynamic complex environments. The human operator can aid the robot in maintaining the environment representation by providing additional information in natural language. On the other hand, he should also have the possibility to ask for verbal descriptions of the scene. Most existing natural language interfaces have been developed in order to provide access to databases or expert systems. In general, three main modules can be distinguished: 4

Knowledge base

Action plan Plan execution system FATE Robot operation

Status Real-time robot control RT-RCS

KAMRO

Figure 3: Structure of the KAMRO System

Analysis component: Natural language input must be translated by a parser into a semantic representation encoded in a knowledge representation language. Evaluation component: Then, the utterances are interpreted with respect to internal world knowledge of the intelligent system. This component forms the interface between natural language access and autonomous system. Feedback from the application system is given back to the dialog system which has to contact the user. Generation component: The information given by the evaluation component then has to be translated into natural language utterances depending on the situational context. Fig. 4 shows the resulting architecture of our KANTRA system (KAmro Natural language TRAnslator). Autonomous system and dialog system must continuously update their environment model. This must especially be done after the execution of a command. In order to analyse an utterance we must be able to identify the objects in the utterance because, in general, one cannot rely on unique identifiers. According to [Herskovits 86] spatial expressions are used to describe the location of an object in order to identify it. Such spatial expressions must be related to visual and geometric information about the environment, i.e., a referential semantics must be defined. After the analysis of an utterance its result must be transferred to the robot using a representation for the different robot commands. Since the autonomous mobile robot KAMRO is a maximally cooperative system instructions can be given as short as possible, underspecified information can to a certain degree be com-

5

Commands Queries Analysis

Morpho-Syntactic Knowledge Conceptual Knowledge

Encapsulated Knowledge

User Model

Evaluation

of the Robot

Linguistic Dialog Memory

...

Autonomous Mobile Robot

Generation Descriptions Explanations Queries

Task Representation

KAMRO

Environment Representation Execution Representation

Figure 4: Structure of the Natural Language Interface KANTRA

pleted by the robot itself, while some other uncertainties are removed by the dialog system. An autonomous system has a planning component which is responsible for the correct execution of plans. If the commands are given by the user certain error situations can occur, e.g., a manipulator can only place an object if it has picked it before. This information often is intended by the user but not mentioned in the utterance. Another problem is that a robot only has a certain number of manipulators. If the operator gives a sequence with more pick commands as manipulators without any place between them the robot will not be able to perform the instructions.

5

Environment Representation

The correct environment representation must permanently be accessible for KAMRO and its natural language interface. On that account, the robot uses one of its visual sensor, the overhead camera, to record the situation on the workbench (Fig. 5). In order to make this information available to the KANTRA system, it is stored in a common database. World representation changes in time, so it is important to use a timestamp of the snapshot. This way, it is possible to merge older and newer knowledge about the environment. For each object, the database contains the following information: The field tells at what time the recording of the scene was made, while is the object identifier. For example, a description like “The spacer left to the lever” can be used to refer to a specific one. The says how the object is placed on the table: a cube can for instance be placed in 6 different ways. The field says how the object is rotated around its own axis with respect to an initial frame, and tells where on the table, using cartesian coordinates, the object is located. KANTRA uses this information to make a model of the world on the workbench.

6

Figure 5: Initial Scene of Workbench

On the other hand, it is possible that the user wants to specify information that the robot sensors cannot perceive. Therefore, the KANTRA system has the possibility to use natural language descriptions to update the knowledge. For example, it is possible to extend the overhead camera view (and thus snapshot of the world) by expressions like “There is a spacer left to the sideplate”, if this spacer cannot be perceived. KANTRA also allows the user to ask for a description of the scene. He could for instance ask “Where is the shaft ?” (Fig. 6) when he intends to know the location of this object. Using Fig. 5 as the actual scene, the answer should be something like “The shaft is near the spacing-piece”. The answer consists of a description using one or more spatial expressions, e.g., “near the spacing-piece” or “to the left of the lever”. The analysis of “where” questions will be done by generating valid spatial relations for the object with respect to the constellation in the domain. For each relation a corresponding applicability distribution is defined,[Schirra & Stopp 93; Stopp et al. 94], which consists of values between 0 and 1, while 1 means that the relation is best applicable for the corresponding position, 0 means absolutely not. The relation which is the most appropriate description with respect to the actual context builds the basis for the generation of the answer to the question. We use a generator based on Tree Adjoining Grammars, [Harbusch et al. 91].

6

Task Specification

In a first step we want to be able to analyse instructions which can be related to the implicit robot operations pick and place. In order to analyse a phrase, e.g., “Take the spacer between the shaft and the lever!” in a situation like the one depicted in Fig. 7, we must not only be able to relate the verb to a robot operation - in this case pick - but it is also necessary to identify the objects which have been referred to in the utterance. The input “Take the spacer between the shaft and the lever!” is passed through a parser based on a unification grammar, [Harbusch 86], which 7

Evaluation

Where is the shaft ?

Analysis

It is near the spacing-piece.

[SENTENCE: [STRUCTURE: FORM: [QUEST: "where"] SUBJECT: [DETERMINER: "the" HEAD: "shaft"] VERB: "be" ...]]

(1) Interpretation of surface structure form: "where" -----> task: find locatzion subject: "shaft" -----> object to be located: "shaft:1"

(2) Generate valid spatial relations spatial relation

Applicability values

(right shaft:1 sideplate:1) (near shaft:1 spacing-piece:1) .. . (at shaft:1 spacer:4)

Generation

0.5589 0.9302 ... 0.1019

(3) Select most appropriate description (add-utt-par :identifier 'utt-par-1 :intention 'declarative) (add-np :cat 'ppron :pers 1 :num 'sg :gender 'ntr) (add-vp :head "be" :identifier 'vp-1 :mood 'indicative) (add-pp :head "near" :func 'location :identifier 'pp-1 :regent 'vp-1) (add-np :head '!spacing-piece! :identifier 'np-2 :regent 'pp-1 :func 'prepobject)

(near shaft:1 spacing-piece:1)

Figure 6: Example of Request for Explanation of Scene

produces the surface structure of the sentence. Further processing will be done by the evaluation component. The verb corresponds to the implicit robot operation pick, the direct object of the sentence (OBJ4) to the object with which the robot should do something. If the specification is unique the corresponding object, i.e., its internal name, can easily be found. If several objects of the same type exist, for example, there are 4 spacers in our domain, spatial relations given in the sentence (“between the shaft and the lever”) must be interpreted with respect to visual and geometric information. One of the spacers is the object to be located (LO), the relation is between, and shaft and lever are the reference objects of the spatial relation. For every spacer we create a so-called spatial proposition which consists of the relation, the LO and the REFO, e.g., (between spacer:1 shaft:1 lever:1). If the REFOs, too, are ambiguous this must be done for every combination. Then, for each of the created spatial propositions the applicability value is calculated to find out which of the four spacers is located in the most typical position with respect to this spatial relation. In our example this is spacer:2, so, we transfer the command “pick spacer:2” to KAMRO, where the command is interpreted by the plan execution system FATE. Fig. 8 shows the resulting configuration after picking up one of the spacers. Some problems may occur which are caused by the different demands of robot and user concerning the detail of specifications. It is obvious that KAMRO can only perform a place command if it has been preceded by a pick, i.e., the object to be placed is grasped by the manipulator. If this pick command misses an error situation will occur. Take, e.g., an utterance like “place the leftmost spacer onto the sideplate!”. Here, the operator intends the robot to pick the spacer first, and then, to place it onto the sideplate. We found two possible solutions for this problem: either this can be taken care of by KANTRA, inserting a pick before sending the commands to KAMRO, or the robot 8

Evaluation (1) Interpretation of surface structure

Take the spacer between the shaft and the lever !

Analysis [SENTENCE: [STRUCTURE: OBJ4:["spacer"] POBJ: [Conjunction: "and" "shaft" Next: "lever" Prep: "between"] VERB: ["take"...]]]

verb: "take" ------> Operation: "pick" object: "spacer" ---> Type: "spacer" spatial expression: preposition ---------> relation: "between" reference objects: object1 -------> Type: "shaft" object2 -------> Type: "lever"

(2) Analysis of spatial relation to identify object to be located created propositions

pick spacer:2

(between spacer:1 shaft:1 lever:1) (between spacer:2 shaft:1 lever:1) (between spacer:3 shaft:1 lever:1) (between spacer:4 shaft:1 lever:1)

Applicab. values 0.3728 0.8984 0.2373 0.0023

(3) Select most plausible interpretation

Figure 7: Example of Processing of a Natural Language Instruction

could verify if a preceding pick has been taken place and, if not, insert it itself. Occurring ambiguities can be solved by asking the user. We have chosen the latter solution, in order to obtain the highest possible division of responsibility between the two systems. A second problem is the question: what happens if the operator gives to many pick commands without a place between them. This problem is closely related to the robot itself and its number of manipulators. KAMRO has only two manipulators and, thus, is not able to perform such an action sequence. Before a third pick command can be carried out, a place must be given. In Fig. 9 a situation

Figure 8: A Spacer is Picked Up

9

is shown where both manipulators hold an object. We have chosen to give the responsibility to react on this problem to KAMRO: the KANTRA side of the system should not have to know anything about the physical realization of KAMRO. If the user gives a third pick command before a place has been given, the robot system asks the user what to do, and the user has to give information about which object to place and where to place it. On the one hand, it is possible that a human being uses one word that has two different interpretations for the robot. For example, the utterance “place shaft” can intend to place the shaft on the table, but also to join it into the hole of a sideplate depending on the actual context. Thus, one utterance can be used in natural language to express two different kinds of robot operations. This problem is solved using the intelligence of the robot system: if a hole exists at the specified position a join command is performed. On the other hand, it is possible that the operator explicitly uses a verb which is related to place but expresses one of the special cases of the place operation: “lay down” or “join”. If we just use place as corresponding command we lose this additional information. As a solution of the problem, we decided to add a keyword to the command representation: “place connected” then corresponds to “join”, “place placed” to “lay down”. If no additional information is given we can rely on the intelligence of the robot which decides depending on the actual situation.

Figure 9: Both Manipulators Hold Objects

7

Intelligence of the Robot System

In this section we will look at some aspects of the intelligence of the robot system. In the example of the last section, the command was “pick spacer” and nothing was said about which manipulator to use to pick up the spacer. FATE used its intelligence to make its own decision. When both manipulators are free, it would 10

choose the manipulator closest to the object, providing this can reach the position where the object is. Thus, the user does not have to provide all the details that are needed in the plan execution system to carry out the command. For the pick and the place commands required and optional information is to be distinguished. The following are strictly required parameters: pick: object name place: physical location Some other parameters are optional in a command specification, although appropriate values will be necessary for the execution of a plan: pick: manipulator place: object name, stable position, orientation FATE must rely on default values, some algorithm or previous values if an optional command parameter is not specified. If for example, the parameter stable position is missing, the object’s initial stable position, as it has been recorded in the database, would be used. A second aspect of robot intelligence comes into play in error situations. An example is a situation where a shaft has been placed in the hole of one of the sideplates. None of the manipulators is able to pick up the shaft directly from this position. Instead of calling the operator to solve the problem, the intelligent robot is able to cope with the error situation autonomously. The robot would generate a new plan for obtaining free access to the shaft by removing the sideplate first. A third aspect is that the robot is able to exchange the object between both manipulators (if the second manipulator is free) if the picking position is in the area of one manipulator and the assembling position is in the area of the other one. Similar to human being behaviour, no exchange command must be given by the instructor, the robot can expand the pick-and-place-operation itself.

8

Conclusion

In this article, our concept for natural language access to the autonomous mobile robot KAMRO has been presented. The current implementation of KANTRA concentrates on the construction of the common knowledge base and aims at realising dialogue-oriented communication between operator and intelligent robot, currently focussed on the execution of task descriptions. This contribution shows how the robot could be instructed in natural language and identified some problems related to this goal. The solutions we propose for these problems were presented as well.

9

Acknowledgement

This research work was performed at the Institute for Real-Time Computer Systems and Robotics (IPR), Prof. Dr.-Ing. U. Rembold and Prof. Dr.-Ing. R. Dillmann, 11

Faculty for Computer Science, University of Karlsruhe, and at the Department of Computer Science, University of the Saarland, Saarbruecken, Prof. Dr. W. Wahlster. This project is part of the nationally based research project on artificial intelligence (SFB 314) funded by the German Research Foundation (DFG).

References [Badler et al. 91] N. I. Badler, B. L. Webber, J. Kalita, and J. Esakov. Animation from Instructions. In: N. I. Badler, B. A. Barsky, and D. Zeltzer (eds.), Making Them Move: Mechanics, Control, and Animation of Articulaited Figures, pp. 51–93. San Mateo, CA: Morgan Kaufmann, 1991. [Bajcsy et al. 85] R. Bajcsy, A. Joshi, E. Krotkov, and A. Zwarico. LandScan: A Natural Language and Computer Vision System for Analyzing Aerial Images. In: Proc. of the 9th IJCAI, pp. 919–921, Los Angeles, CA, 1985. [Chapman 91] D. Chapman. Vision, Instruction, and Action. Cambridge, MA: MIT Press, 1991. [Harbusch et al. 91] K. Harbusch, W. Finkler, and A. Schauder. Incremental Syntax Generation with Tree Adjoining Grammars. In: W. Brauer and D. Hernandez (eds.), Verteilte Künstliche Intelligenz und kooperatives Arbeiten: 4. Int. GI-Kongreß Wissensbasierte Systeme, pp. 363–374. Berlin, Heidelberg: Springer, 1991. [Harbusch 86] K. Harbusch. A First Snapshot of XTRAGRAM, A Unification Grammar for German Based on PATR. Memo 14, Universität des Saarlandes, SFB 314 (XTRA), 1986. [Herskovits 86] A. Herskovits. Language and Spatial Cognition. An Interdisciplinary Study of the Prepositions in English. Cambridge, London: Cambridge University Press, 1986. [Herzog & Wazinski 94] G. Herzog and P. Wazinski. VIsual TRAnslator: Linking Perceptions and Natural Language Descriptions. Artificial Intelligence Review, 8(2):175–187, 1994. [Lobin 92] H. Lobin. Situierte Agenten als natürlichsprachliche Schnittstellen. Arbeitsberichte Computerlinguistik 3-92, Univ. Bielefeld, Germany, 1992. [Luth ¨ & Rembold 94] T. C. Luth ¨ and U. Rembold. Extensive Manipulation Capabilities and Reliable Behaviour at Autonomous Robot Assembly. In: Proc. of IEEE Int. Conf. on Robotics and Automation, San Diego, CA, 1994. [Neumann 89] B. Neumann. Natural Language Description of Time-Varying Scenes. In: D. L. Waltz (ed.), Semantic Structures: Advances in Natural Language Processing, pp. 167–207. Hillsdale, NJ: Lawrence Erlbaum, 1989. [Nilsson 84] N. J. Nilsson. Shakey the Robot. Technical Note 323, Artificial Intelligence Center, SRI International, Menlo Park, CA, 1984. 12

[Rembold et al. 93] U. Rembold, T. C. Lueth, and A. Hoermann. Advancement of Intelligent Machines. In: ICAM JSME Int. Conf. on Advanced Mechatronics, pp. 1–7, Tokyo, Japan, 1993. [Sato & Hirai 87] T. Sato and S. Hirai. Language-Aided Robotic Teleoperation System (LARTS) for Advanced Teleoperation. IEEE Journal on Robotics and Automation (RA), 3(5):476–480, 1987. [Schirra & Stopp 93] J.R.J. Schirra and E. Stopp. ANTLIMA – A Listener Model with Mental Images. In: Proc. of the 13th IJCAI, pp. 175–180, Chambéry, 1993. [Sondheimer 76] N. K. Sondheimer. Spatial Reference and Natural Language Machine Control. Int. Journal of Man-Machine Studies, 8:329–336, 1976. [Stopp et al. 94] E. Stopp, K.-P. Gapp, G. Herzog, T. Längle, and T. C. Luth. ¨ Utilizing Spatial Relations for Natural Language Access to an Autonomous Mobile Robot. In: B. Nebel and L. Dreschler-Fischer (eds.), KI-94: Advances in Artificial Intelligence, pp. 39–50. Berlin, Heidelberg: Springer, 1994. [Torrance 94] M. C. Torrance. Natural Communication with Robots. Master’s thesis, MIT, Department of Electrical Engineering and Computer Science, Cambridge, MA, 1994. [Vere & Bickmore 90] S. Vere and T. Bickmore. A Basic Agent. Computational Intelligence, 6(1):41–60, 1990. [Wahlster et al. 83] W. Wahlster, H. Marburger, A. Jameson, and S. Busemann. Over-answering Yes-No Questions: Extended Responses in a NL Interface to a Vision System. In: Proc. of the 8th IJCAI, pp. 643–646, Karlsruhe, FRG, 1983. [Wahlster 89] W. Wahlster. Natural Language Systems: Some Research Trends. In: H. Schnelle and N.O. Bernsen (eds.), Logic and Linguistics: Research Directions in Cognitive Science - European Perspectives, Vol. 2, pp. 171– 183. Hillsdale: Erlbaum, 1989.

13