Interactive Natural Programming of Robots: Introductory ... - CiteSeerX

1 downloads 0 Views 14MB Size Report
Interactive Natural Programming of Robots: Introductory Overview. R. Dillmann, R. ..... 3: Hand Identification. The adaption rate α is a factor which controls the.
Interactive Natural Programming of Robots: Introductory Overview R. Dillmann, R. Z¨ollner, M. Ehrenmann, O. Rogalla Universit¨at Karlsruhe (TH) Institute of Computer Design and Fault Tolerance Karlsruhe, D-76128, Germany, P.O. Box 6980 [email protected] Abstract Natural, multimodal interaction is a key issue of humanoid robots since a human-like appearance suggests human-like properties and behavior. Although human communication in all its facets seems to be hardly reachable today, interaction via verbal in- and output or via gestures is a big step to meet this requirement. Besides these channels, recognition of the interaction context is important because e.g. human references are often incomplete and sloppy. From human behavior, several context types like advising, commenting and teaching can be derived. In this article, we present a system that supplies context dependant interaction with robots in these contexts. Interaction may be established with verbal and gestural channels. Furthermore, an interactive teaching technique called ”programming by demonstration” is presented for native and intuitive transferring human skills and behavior to human-like robots. Since manipulation skills from humans are very well developed but not easy to describe with standard programming methods, it is expected to ease robot programming significantly.

1

Introduction

After the evolution of service robots, the development of humanoid robots is a trend in robotics. However, there is a big gap between humanoids and the latter: users expect much more in terms of communication, behavior and cognitive skills of these robots. Aiming to facilitate man-machine interaction in a direct and intuitive way, auditive and gesture based interfaces have been a very important research topic in recent years. Addressing real-life problems as in households or workplace environments raises new questions of handling gestures and speech input and of mapping reactions to given actuators. Verbal utterances and gestures as human communication in gen-

Figure 1: Assistant ALBERT at the Hannover Fair. eral, highly rely on their specific context. Therefore, new interaction methods may not be limited to single information channels but to cognitive processing and fusing several modalities depending on internal and external robot assistant states based on vast environment modelling. However, one of the main purposes for building robots is their ability to manipulate objects in their environment and to fulfil annoying manual tasks. At this point, a user should be able to easily teach a complex tasks and to have a robot accomplish it at will. Regarding household and workshop environments in particular, also the following facts should be kept in mind: • Households and workshops are structured environments, but they differ one from another and keep changing steadily. • Each household is tailored to an individuals needs and taste. So, in each household a different set of facilities, furniture and tools is used. Ready-made programs will not be able to cope with such environments. So, flexibility will surely be the most important requirement for the robot’s design. It

must be able to navigate through a changing environment, to adapt it’s recognition abilities to a particular scenery and to manipulate a wide range of objects. Furthermore, for the user it is extremely important to easily adapt his system to his needs, that means, to teach the system what to do and how to do it. We think that the only approach that satisfies the latter condition is programming by demonstration. Other programming techniques like textual coding or teachin, will certainly not be acceptable for a consumer market.

2

Passing the development of service and humanoid robots of the last years, it turns out that there are several research issues like designing and controlling humanoids, basic interaction methods, or programming techniques. Despite having reached very promising results in each of those disciplines, there has not yet been presented a single system that integrates all three topics.

2.1 Among household tasks, laying a table may be a good example to start with. It is a complex task consisting of several sub steps that may each be solved subsequently. Several and different objects have to be manipulated that are designed for a humans use and may be difficult to handle for a robot. Besides, from the operational point of view, it is a task that seems to be similar to many other every day workings like loading a dishwasher, shopping and unpacking goods or even tidy one’s room.

The following article presents processing techniques for interacting with Albert. After giving an overview over related approaches in the next section 2, the basis for communication with a user is explained. Both gestures and speech input are mapped to events which are processed according to internal and external states. The architecture for event processing is presented in section 3. Afterwards, handling of visual and speech input is outlined in the following two sections 4 and 5. Since teaching a robot complex tasks is a complex problem, interactive programming is explained in more detail in section 6. Here, observation of user’s actions is outlined in subsection 6.1. This serves as basis for an abstract task representation which is used to generate a robot program depending on it’s kinematic structure and gripper type. Subsection 6.4 presents the mapping strategies to do so. Experimental results for programming a table laying task can be found in section 7.

Humanoids

Regarding humanoid projects, one comes to think of the successes of several famous japanese projects like at Waseda University, Honda or Sony [22]. However, all these projects mainly address mechatronic aspects of humanoid robots f.e. coordinated movements or stable walking. When parts of humanoid systems are under development like in the case of Cog, KISMET, or ASKA [11, 30], they mostly serve as platforms for research on single abilities.

2.2 For our experiments, the service robot ALBERT has been set up (cf. figure 1). It consists of a mobile platform equipped with a SICK laserscanner for navigation. The robot’s manipulator is a 7-DOF humanoid arm. The end effector is a three finger Barrett hand which is attached to a DLR force/torque sensor. The head integrates stereo color vision. Additionally, speech input is given using a headset microphone.

Related works

Advising Interaction

This section discusses the state of the art in the fields of user observation and tracking, gesture recognition and interactive systems and event management. User Observation and Tracking: Image-based tracking of the human hand is considered as a specific and very demanding problem. The relatively great amount of degrees of freedom and the complex kinematic structure entail not only the possibility of shape and brightness changes but also of occlusions [27, 2, 15]. Gesture Recognition: “Gesture” is the commonly used term for a symbol used in order to command, instruct or converse what is expressed through hand postures or hand movements. Preferred sensors for the detection of gestures are image based systems or data gloves. An overview of several gesture recognition systems has been compiled by Kohler [18]. Here, only vision based systems are taken into account. Hand signs can be divided into static and dynamic gestures, depending on whether the significant part is either a finger posture or hand movement. Interactive Systems and Event Management: An example for an interactive robot system is the robot assistant CORA. It observes deictic gestures

and obeys speech commands. The robot’s behaviour is modeled with a dynamic systems-approach using Arami Neural Networks [28]. As with CORA, a lot of interactive robot systems make use of both speech input and gesture and/or object recognition, e.g. in [24] to give instructions to the robot more naturally, or to clarify ambiguities in the input [29]. [20] lay particular emphasis on the fact that both the (verbal) human-machine interface and the intelligent system of their robot KANTRA have to use similar environment models to allow for references to objects in the robot’s environment. A specific reference problem is attacked in [14], namely the problem of pronoun references. In their system, relative salience of discourse referents is maintained in an attentional state to allow for processing of underspecified input. The specific problems of speech recognition in the context of interaction with robot assistants are discussed in [3]. In particular, they mention replacement of words, unknown words, omission of words, semantic inconsistencies, verbal ambiguities, situational ambiguities and verbal interventions of the user as core problems of this domain.

2.3

Interactive Programming

Several Programming systems and approaches based on human demonstrations have been proposed during the past years. Many of them address special problems or a special subset of objects only. An overview and classification of the approaches can be found in [6]. Often, the analysis of a demonstration takes place observing the changes in the scene undertaken by the user. These changes can be described using relational expressions or contact relations [19, 17, 23]. For generalising of a single demonstration mainly explanation based methods are used [21, 13]. Those allow for an adequate generalisation taken from only one example (One-Shot-Learning). Approaches based on One-shotlearning techniques are the only ones feasible for end users since giving many similar performing examples is an annoying task. Besides a derivation of action sequences from a user demonstration direct cooperation with users has been investigated. Here, user and robot reside in a common work cell. The user may direct the robot on the basis of a common vocabulary like speech or gestures [33, 32, 28]. But this allows for rudimentary teaching procedures only. A general approach to programming by demonstration together with the processing phases that have to be realized when setting up such a system is out-

lined in [8]. A discussion of sensor employment and sensor fusion for the action observation can be found in [25, 34, 9]. Here, basic concepts for the classification of dynamic grasps is presented as well.

3

Event Management

As can be seen from chapter 2, melting multimodal input, especially input from gesture and object recognition with speech input, is a great topic in the actual research on robot instruction. In our system, so-called ”events“ are the key concept of representing input. An event is, on a rather abstract level of representation, the description of something that happened. E.g., the input values from the data glove are transformed into ”gesture events“ or ”grasp events“. Thus, an event is a conceptual description of input or inner states of the robot on a level which the user can easily understand and cope with – in a way, our robot Albert is speaking ”the same language“ as a normal user. As events can be of very different types (e.g. gesture events, speech events), they have to be processed with regard to the specific information they carry. Figure 2 shows the overall control structure transforming events in the environment to actions performed by robot devices. Action devices are called ”agents” and do represent semantic groups of action performing devices (Head, Arms, Platform, Speech, ...). Each agent is capable of generating events. All events are stored in the ”Eventplug”. A set of event transformers ”Automatons”, each of which can transform a certain event or group of events into an ”action“ is responsible for selecting actions. Actions are then carried out by the corresponding part of the system, e.g. the platform, the speech output system etc. A very simple example would be a speech event generated from the user input ”Hello Albert“, which the responsible event transformer would generate into a robot action speechoutput(Hello!). In principle, the event transformers are capable of fusing two or more events that belong together and generate the appropriate actions. E.g., this would be a pointing gesture with accompanying speech input like ”take this“. The appropriate robot action would be to take the object on which the user is pointing. The order of event transformers is variable, and transformers can be added or deleted as suitable, thus representing a form of focus – the transformer highest in order tries to transform incoming events first, then (if the first one does not succeed) the transformer second in order etc. For encapsulating the mechanism of reordering event transformers (automatons) the prior-

ity module is responsible.

In the segmented image b, the user’s head and hands can be detected heuristically: the head’s position is usually above the hands. Then, the user’s right hand gets tracked using local windows (see figure 3). In order to increase robustness against colour shifts, thresholds become adapted according to the color characteristics of the last hand region within the local region:

Eventplug

Head HW Resources

Speech Output

Sceduler Priority Manager

Platform Security/ Exceptions

σt µt

Agentplug Arm

= (1 − α)σt−1 + α · σstart = (1 − α)µt−1 + α · µstart

(2) (3)

Automaton Speech Input

Automaton Automaton Automaton Automaton Environment model

Figure 2: Event manager architecture. In each case, one or more actions can be started, as appropriate for a certain event. In addition, it is easily possible to generate questions concerning the input: an event transformer designed specifically for queries (to the user) can handle events which are ambiguous or underspecified, start an action which ask an appropriate question to the user, and then uses the additional information to transform the event. For controlling hardware resources a scheduling system locking and unlocking execution of the action performer is applied.

4

Visual Input

Figure 3: Hand Identification. The adaption rate α is a factor which controls the increment depending on the situation. After segmenting skin color regions, the image b is filtered using Close and Open filters. The resulting regions consisting of less than a certain amount of pixels become eliminated and the remaining blobs serve for adjusting the camera head position. In order to classify hand postures by their silhouette, the contours of these remaining blobs are extracted (figure 4 shows all these steps).

This section describes the observation of hand movements and gesture performance as well as the recognition of objects in our system ALBERT. Both of them transmit percepted items (gestures and objects) to the event transformer mechanism.

4.1

Hand Tracking

Skin colour is used as the basic feature for hand tracking. For recognizing gestures, it is necessary to focus the user’s hand. Therefore, the robot’s camera head follows hand movements. Basis for the skin color segmentation is the hue angle of the HLS representation. Initially, a value between 0 and 20 is accepted setting mean and variance to µstart := 10, σstart := 10 in the binarization step of the camera image p:  b(x, y) :=

1 if µ − k · σ < p(x, y) < µ + k · σ 0 else

Figure 4: Gesture Preprocessing.

4.2 (1)

Gesture classification

Since hand contours (sx , sy ) are different in length and depend on rotational angle, translation and scale,

a Fourier description of the silhouette is computed taking 32 samples at equidistant points all over the outline (less points do undersample the contour, see figure 5). This number turned out to be sufficient for the classification and has not shown inferior results than 64 samples. Resampled points

Resampled points

Resampled points

250

250

250

200

200

200

150

150

150

100

100

100

50

50

50

0 0

50

100

150

200

250

0 0

50

100

150

200

250

0 0

Figure 6: Descriptor for Hand Contour.

50

100

150

200

250

Figure 5: Contour Sampling with 8, 16 and 32 Points. Before applying the Fast Fourier Transform, the contour gets normalized with respect to position and size. For this reason, the center of gravity (cgx , cgy ) of the outline is computed and used as origin for the contour representation. Afterwards, all values become divided by maxsx (k) |sx (k)−cgx | and maxsy (k) |sy (k)− cgy | respectively and can afterwards be assumed as in [0, 1]. The coordinates of the sample points s(k) are then treated as complex numbers: {s(k)} = {x(k) + j · {y(k)} and are mapped to a sequence of the same length {S(n)} in Fourier space. Figure 6 on the left shows an example of five different hands performing a Victory symbol gesture that were transformed to a Fourier point sequence. These points still vary depending on the start point of the contour and depending on the hand rotation. Both of these characteristics affect the point angle in Fourier space only. Thus, a representation is used for classification that describes the absolute value of the singular elements: q {D(n)} = {|S(n)|} = { Sx2 (n) + Sy2 (n)}

(4)

The right part of figure 6 shows an example for this description of a hand posture: the application of formula 4 to the five hands on the left. As can be seen in this figure, the descriptors are very similar for postures of the same type. Having a set of models DM (n), a performed hand posture can be compared to them using the Euklidian distance ∆M :

then classified when ∆M exceeds a certain threshold θ. Six reference gestures have been chosen to test the above classification approach. They are listed in figure 7. The first gesture type may be used as a yes/no answer in dialogue situations while type two is supposed to stop actions in progress. Type three serves for deictic references and the forth model is intended to detect grasp situations. The other types can be used to trigger arbitrary programs. 1

1

1

1

1

1

0

0

0

0

0

0

-1 -1

-1 -1

-1 -1

-1 -1

-1 -1

-1 -1

0

1

0

1

0

1

0

1

0

1

0

1

Figure 7: Reference Gestures 1 to 6. The classification reaches an overall performance of 95.9%. Twelve persons have contributed six variants of each gesture type for the reference model and 30 variants for the validation giving 360 gestures to test. The single results are displayed in table 1 (where ”A-Error“ means that a performed gesture has been classified with a wrong type and ”B“-Error shows the rate of detected gestures that were not performed). The comparatively low results for gesture type 4 come from vast variances in the gesture models: each person showed the grasp hand configuration with a different angle between thumb and other fingers. Data Test A-Error B-Error

1 100 0 0

2 96 4 0

Gesture Type 3 4 5 96.4 83.3 100 3.6 8.3 0 0 8.3 0

6 100 0 0

∅ 95.7 0 4.3

Table 1: Result of the Gesture Classification in [%] ∆M =

N −1 X

(DM (n) − D(n))2

(5)

n=0

Reference gestures that serve as models for gesture types were collected using examples performed by five persons and then averaging over them. A gesture is

4.3

Object detection

Objects lying on a table in front of the robot are detected by means of a color segmentation. The color depth of the camera images gets reduced with a Mean

Data Test A-Error B-Error

Number of Learned Objects 5 8 9 12 96.6 85.4 66.6 58.3 3.4 14.6 33.3 41.7 0.0 3.6 8.2 18.0

Table 2: Result of Object Classification in [%] Shift Analysis that was proposed by Comaniciu [4]. The most frequent resulting colors serve for masking out the image background. Remaining blobs can then be compared to stored characteristics of previously segmented objects. The classification employs color histograms and simple geometric characteristics like width/height ratio and compactness. Figure 8 gives an impression of the scene and a classification result. This approach has been chosen due to it’s run-time speed and ease of learning new objects. Computing speed is one of the key conditions in interaction situations—users would not accept several seconds of processing time for simple recognition tasks.

front of you“). Each parsed phrase is transformed into a speech event (see section 3) that contains all the parsing results. For robustness, it was necessary to limit the vocabulary to a domain specific subset of the natural language. Since the robot assistant will work in certain well defined domains, it is feasible to provide several specific language repositories and switch between them if necessary.

6

Interactive Programming

For programming service tasks via demonstration a vast sensor use has to be made for both watching human moves as well as tracking the environment. Especially for learning manipulative tasks gathering hand motion and its pose is crucial. For learning “Pick & Place” or “Fetch & Carry” operations the tracking of manipulated objects in the household environment is of great importance and will lead to the achievement of the high demands on robustness and flexibility of this programming method.

6.1

Recognition of Ongoing Action Sequencies

Figure 8: Table Scene and Object Classification. Object recognition runs in less than a seconds time. New objects can be learned by presenting them singularily on the table. Recognition rate, however, is high only when regarding rather small sets of objects (see table 2). This will be improved by using more characteristics.

5

Verbal Input

For processing the user’s verbal input, the dictation engine of the speech recognition system ”Viavoice“ is being used for the German language. Since command sentence might depend on the context and the user’s preferences, the speech input cannot be used directly for generating a non-ambiguous reaction. Therefore, we are using a parser with a simple grammar optimized for command sentences. The output from the parser consists of the verb and, if available, subject, accusative object, dative object and the prepositional phrase (f.e. ”take the cup from the table which is in

Figure 9: Active Camera Head and Tactile Data Glove For observing a human demonstration the following sensors are used: a data glove, a magnetic field based tracking system for the respective hand and force sensors which are mounted on the finger tips as well as an active trinocular camera head (see figure 9). Object recognition is done by computer vision approaches using fast view-based approaches described in [7]. From the data glove, the system extracts and smoothes finger joint movements and hand positions in 3D space. To reduce noise in trajectory information, the user’s hand is additionally observed by the camera system. Both measurements are fused using confidence factors, see [9]. This information is stored with discrete time

stamps in the world model database. In order to understand the users intention the performed demonstration has to be segmented and analyzed. Therefore in a first step the sensor data has to be preprocessed for extracting reliable measurements and key points, which are used in a second step for segmenting the demonstration.

But if the grasped object collides with the environment the force profile will change. The results are high peaks i.e. both amplitude and frequency are oscillating (see lower part of figure 10). Looking at the force values during a grasp, three distinct profile classes can be derived: one for static grasps, one for appearing external forces and a third one for dynamic grasps. For more details refer to [34].

6.2 Grasp Segmentation

Static Grasp

External Force

Dynamic Grasp

Trajectory Segmentation

Grasp Classification

For mapping manipulative Tasks detection and representation of grasps, the diversity of human object handling has to be considered. In order to conserve as many information as possible from the demonstration two classes of grasps are distinguished. On one hand static grasps are used for Pick&Place tasks, while fine manipulative tasks require dynamic handling of objects. Grasp Emphasis on Security, Stability

Figure 10: Analyzing segments of a demonstration: force values and finger joint velocity.

Power

Precision

big

The segmentation of a recorded demonstration is performed in two steps: 1. Trajectory segmentation This step segments the trajectory of the hand during the manipulation task. Hereby the segmentation is done by detecting grasp actions. Therefore the time of contact between hand and object has to be determined. This is done by analyzing the force values with a threshold based algorithm. In order to improve the reliability of the system the results get fused by a second algorithm based on the analysis of trajectories of finger poses, velocity and acceleration with respect to minima. Figure 10 shows the trajectories of force values, finger joints and velocity values of three Pick&Place actions. 2. Grasp segmentation For detecting fine manipulation the actions during a grasp have to be segmented and analyzed. The upper part of figure 10 shows that the shape of the force graph features a rather constant plateau. Due to the fact that no external forces are applied to the object this effect is plausible.

Emphasis on Dexterity, Sensitivity

thin

Non volar

Platform 15

Volar

long compact

Prismatic

compact

Lateral Pinch 16

Circular

Circular

long

Prismatic small

Disc 10

Sphere 11

Disc 12

Sphere 13

Tripod 14

2Finger Thumb 8

Index Thumb 9

Stable Manipulation semistable abducted volar 3 Thumb 4

Big Small 1 Diameter 2

Increasing Power and Object Size

4Finger Thumb 6

light Tool 5

3Finger Thumb 7

Increasing Dexterity and decreasing Object Size

Figure 11: Cutkosky Grasp Taxonomy

6.2.1

Static Grasp Classification

Distal

Static grasps are classified according to the Cutkosky hierarchy. This hierarchy subdivides static grasps into 16 different types mainly with respect to their posture (see figure 11). Thus, a classification approach based on the finger angles measured by the data glove could be used. All these values are fed into a hierarchy of neural networks, each consisting of a three layer RBF network. The hierarchy resembles the Cutkosky hierarchy passing the finger values from node to node like in a decision tree. Thus, each network has to distinguish very few classes and could be trained on a subset of the whole set of learning examples. Classification results for all grasp types are listed in table 3. A training set of 1280 grasps recorded from 10 donators and a test set of 640 examples was used for testing. Average classification rate lies at 83,28% where most confusions happen with grasps of similar posture (e.g. types 10 and 12). In the application, classification reliability is furthermore enhanced by heuristically regarding hand movement speed (which is very low when grasping an object) and distance to a possibly grasped object in the world model (which has to be very small for grasp recognition).

1 0.95

2 0.90

3 0.90

Grasp type 4 5 0.86 0.98

6 0.75

7 0.61

8 0.90

9 0.88

10 0.95

11 0.85

Grasp type 12 13 0.53 0.35

14 0.88

15 1.00

16 0.45

Y Dorsal

X

Ulnar

Radial

Ventral

Z Proximal

Figure 12: Principal axes of the human hand Figure 13 depicts the distinguished dynamic grasps. The classification is done separately for rotation and translation of the grasped object. Furthermore the number of fingers involved in the manipulation (i.e. from 2 to 5) are considered in order to distinguish between precision and force grasps. There are several precision tasks which are performed only with thumb and index finger like opening a bottle which are classified as a rotation around the x-axes. Other grasps require higher forces and all fingers involved in the manipulation, f.e. Full Roll for screwing actions. The presented classification contains most of the common fine manipulations, if we assume that three and four finger manipulations are included in the five finger classes. For example a Rock Full dynamic grasp can be performed with three, four or five fingers.

Table 3: Classification rates gained on test data

6.2.2

Dynamic Grasps Classification

For describing various household activities like opening a twisted cap or screwing a bold in a nut, dynamic grasps need to be detected and represented. With dynamic grasps we denote operations like screw, insert etc. which all have in common that finger joints are changed during manipulation of an object (i.e. the force sensors provide non zero values). For classifying dynamic grasps we choose the movement of the grasped object as a distinction criterion. This allows a intuitive description of the users intention performing fine manipulative tasks. For describing grasped object movements these are transformed in hand coordinates. Figure 12 shows the axes of the hand, according to the taxonomy of Elliot & Co. (Refer to [10]).

Support Vector Machine Classifier The classification of dynamic grasps is done by a time delay support vector machine (SVM) [31]. For training the SVM Gaussian kernel functions, an algorithm based on the SVMLight [16] and the one-against-one strategy have been used. Twentysix classes corresponding to the elementary dynamic grasps presented in figure 13 where trained . Because of the fact that a dynamic grasp is defined by a progression of joint values a time delay approach was chosen. Consequently the input vector of the SVM Classifier comprised 50 joint configurations of 20 joint values. The training data set contained 2600 input vectors. Due to the fact that SVM’s learn from significantly less data than neuronal networks, the obtained results are reliable. Figure 13 shows results of the Dynamic grasp classifier DGC. Since the figure shows only right resp. for-

Rotation Rotation 98%

X

X/Y 89%

Translation Translation

Z 93%

X/Y 97%

Roll Index

92%

Roll Full

89%

Roll Tumb

93%

Y

Z

Rock Index

92%

Shift Palm

Rock Tumb

Shift Index

89%

Pinch Index

96%

88%

Shift Full Roll Palm

95%

Pinch Full

Rock Full

96%

Rock Radial

Figure 13: Hierarchy of Dynamic Grasps containing the Results of the DGC ward direction the displayed percents represents a average value of the two values. The maximum variance between this two direction is about 2 %. Remarkable is the fact that the SVM needs 486 support vectors (SV) only for generalizing over 2600 vectors (i.e. 18,7% of the data set). A smaller amount of SV improves not only the generalizing behavior but also the runtime of the resulting algorithm during the application.

6.3

Task Representation

The result of the segmentation step is a sequence of elementary actions like moves and static and dynamic grasps. The symbol sequence is now chunked in semantically related groups. These are for example: approach phases, grasp and ungrasp phases. This is then stored in a STRIPS ([12]) like structure of operations ordered in a tree called macro operator. For instance, a macro operator for a simple pickand-place operation has a tree-like structure starting the whole operation at the root. Figure 14 shows the knowledge representation of pick&place operations of a laying table task. The first branches lead to a symbol for a pick and a place. They are subdivided in approach, grasp/ungrasp and disapproach phases. Finally, the process ends up with the elementary operations at the leaves. Chunking into semantic groups has two advantages. First, conditions for applicability of subparts of the knowledge tree can be tested very easily and second, communication with the user becomes easier for complex task when humans prefer to think in semantic

Figure 14: Treelike macro operator knowledge representation groups rather than in operator sequences.

6.4

Mapping Tasks on Robot Manipulation Systems

Up to now, problem solving information stored in the macro operators is not directly usable for the target system. Grasps are represented in order to describe human hand configurations and trajectories are not optimal for robot kinematics. Besides, sensor control for the robot like e.g. force control information for the robot system is not extractable from the demonstration, directly. Since we are dealing with pick-andplace operations, we have developed a method for automatically mapping grasps to robot grippers. Furthermore, stored trajectories in the macro become optimized for the execution environment and the target system. The system uses generic robot and gripper models described in [26]. The mapping engine accepts 16 different power and precision grasps of Cutkosky’s grasp hierarchy [5]. As input the symbolic operator description and the positions of the user’s fingers and palm during the grasp are used. We’re calculating an optimal group of coupled fingers; these are fingers having the same grasp direction. These finger groups are asscociated with the chosen gripper type. This helps to orient finger positions for adequate grasp pose. Since the symbolic representation of performed grasps include information about inclosed fingers, only those fingers are used for finding the correct pose. Let’s assume that fingers used for the grasp are numbered as H1 , H2 , . . . , Hn , n ≤ 51 . For coupling fingers with the same grasping direction it is necessary to calculate forces affecting the object. In case of precision grasps the the finger tips 1 It

should be clear that the number of inclosed fingers can be less than 5

H4

are projected on a grasping plane E defined by the finger tip position during the grasp (figure 15). Since the plane might be overdetermined if more than three fingers are used, the plane is determined by using the least square method.

H3 H5 H2 E)

G

H1

Figure 17: Calculation of forces for a sphere precision grasp.

Figure 15: Calculation of a plane E and the gravity point G defined by the finger tip’s positions. Now the force-vectors can be calculated by using the geometric formation of the fingers given by the grasp type and the force value measured by the force sensors on the finger tips. Prismatic and sphere grasps are distuingished. For prismatic grasps all finger forces are assumed to act against the thumb. This leads to a simple calculation of the forces F1 , . . . , Fn , n ≤ 5 according to figure 16.

H2

H3

H4

• Precision Grasps: The center of gravity G which has been used for calculating the grasp pose serves as reference point for grasping. The robot gripper is positioned relative to this point performing the grasp pose.

H5

F1 F2

F3

F5

F4

• Power Grasps: The reference point for power grasping is the user’s palm relative to the object. Thus, the robot’s palm is positioned with a fixed tranformation with respect to this point.

H1 Figure 16: Calculation of forces for a prismatic 5 finger grasp. For circular grasps the direction of forces are assumed to act against the finger tip’s center of gravity G. To evaluate the coupling of fingers the degree of force coupling is defined: Dc (i, j) =

F~i .F~j |F~i ||F~j |

This means |Dc (i, j)| ≤ 1 (Dc (i, j) = 0 for f~i ⊥ ~ fj . So the degree of force coupling is high for force acting in the same direction and low for forces acting in orthogonal direction. The degree of force coupling is used for finding three or two optimal groups of fingers using Arbib’s method [1]. When the optimal group of fingers is found the robot fingers are being assigned to these finger groups. This process is rather easy for a three finger hand, since the three groups of fingers can be assigned directly to the robot fingers. In case there are only two finger groups, two arbitrary robot fingers are selected for being treated as one finger. The grasping position depends on the the grasp type as well as the grasp pose. Since there is no knowledge about the object present, the grasp position must be obtained directly from the performed grasp. For the system, two strategies have been used:

(6)

Now, the correct grasp pose and position are determined and the system can perform force controlled grasping by closing the fingers. Figure 18 shows 4 different grasp types mapped from human grasp operations to the robot. It can be seen that the robot’s hand position depends strongly on the grasp type. For mapping simple movements of the human demonstration to the robot, we defined a set of logical rules that select sensor constraints depending on the execution context. This is, for example to select

demonstration, this will not work in the general case. Therefore, we added the possibility to accept and reject interpretations of the human demonstation made by the system. Since the generation of a well working robot program is the ultimate goal of the programming process, the human user can interact with the system in two ways: 1. Evaluation of hypothesis concerning the human demonstration. E.g. recognized Grasps, grasped objects, important effects in the environment. 2. Evaluation and correction of the robot program that will be generated from the demonstration.

Figure 18: Example of different grasp types. 1. 2finger precision grasp, 2. 3-finger tripoid grasp, 3. circular power grasp, 4. prismatic power grasp a force threshold parallel to the movement when approaching an object or selecting zero-force control during grasping an object. So context information serves for selecting intelligent sensor control. The context information is stored within the macro-operators and is available for processing. Thus, a set of heuristic rules has been defined for handling the system’s behavior depending on the current context. To give an overview the following rules turned out to be useful in specific contexts:

Currently, interaction with the programming system is being established by performing a post processing phase on the demonstration. For this purpose a simulation system has been developed showing the identified objects of the demonstration and actuators (human hand with the respective observed trajectory and robot manipulators) in a 3D environment. During the evaluation phase of the user demonstration all hypothesis about grasps including types and grasped objects are displayed in a replayed scenario. With graphical interfaces the user is prompted for acceptance or rejection of actions. It is also possible to modify the relevance of relation necessary for the macrogeneration phase (see figure 19).

• Approach: The main approach direction vector is determined. Force control is set to 0 in orthogonal direction to the approach vector. Along the grasp direction a maximum force threshold is been selected. • Grasp: Set force control to 0 in all directions. • Retract: The main approach direction vector is determined. Force control is set to 0 in orthogonal direction to the approach vector. • Transfer: Segments of basic movements are chunked into complex robot moves depending on direction and speed. Summarizing, the mapping module generates a set of default parameters for the target robot system itself as well as for movements and force control. These parameters are directly used for controlling the robot.

6.5

Human Comments and Advice

Although it is desireable to make hypothesis about the user’s wishes and intention based on only one

Figure 19: Dialogue mask for accepting and rejecting hypothesis made by the system After the modified user demonstration has been mapped to a robot program the correctness of the pro-

gram is validated in a simulation phase. Here, modification of the environment an gripper trajectories are displayed within the 3D environment. Again the user can interactively modify trajectories or grasp points from the robot if desired (see figure 20). Modifications are being performed by dragging trajectories or grippers to favored positions.

The examples have been recorded in our training center (see figure 23 left). Here, the mentioned sensors are integrated into an environment allowing free movements without restrictions. In a single demonstration, crockery is placed for one person (see left column of figure 22). The observed actions may be replayed in simulation (see second column). Here, sensor errors get corrected interactively. The system displays it’s hypotheses on recognized actions and their segments in time. After interpreting and abstracting this information, the resulting macro operator may be mapped to a specific robot. For this purpose, our service robot Albert was modeled, set up and used (confer third column in figure 22).

Figure 20: Visualization of modifiable robot trajectories.

7

Interactive Programming: an example

Based on above mentioned cognitive skills of Albert and the presented event management concept, intuitive and fluent interaction with the robot is possible. One experiment deals with triggering a previously generated complex robot program for laying a table. As in figure 21, the user shows the manipulable objects to the robot which instantiates an executable program from it’s database with the given environmental information. It is also possible to advice the robot to grasp selected objects and lay them down at desired positions.

Figure 23: Training Center (left) and Service Robot Albert (right) After testing the mapped trajectory, the generated robot program is transfered to Albert and executed in the real world (see fourth column in figure 22).

8

Conclusions

In this paper, we presented a framework for building robot assistants which can instructed and controlled by anthropomorphic interaction. Beside the general architecture with its basic human-friendly communication components, a methodology for integrating new manipulation skills by user demonstrations has been outlined. Here, fetch and carry tasks can be taught as well as fine manipulations of simple objects. Although promising results have been already achieved robustness of the cognitive components still need further investigation.

ACKNOWLEDGMENT Figure 21: Interaction with Albert. For an experiment with this approach, off-the-shelf cups and plates in different forms and colors were used.

This work has been supported partially by the BMBF project “Morpha” and the SFB588 funded by the German Research Association.

Figure 22: Experiment Phases

References [1] M. A. Arbib, T. Iberall, and D. M Lyons. Coordinated Control Programs for Movements of the Hand, chapter unknown, pages 111–129. Springer Verlag, 1985.

[15] D. Gavrila and L. Davis. Towards 3d model-based Tracking and Recognition of Human Movement: a Multi-View Approach. In Martin Bichsel, editor, International Workshop on Face and Gesture Recognition, Z¨ urich, pages 272–277, Juni 1995.

[2] A. Arsenio and J. Santos-Victor. Robust visual tracking by an Active Observer. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS), volume 3, pages 1342–1347, 1997.

[16] T. Joachims. Making large scale SVM learning practical. In B. Sch¨ ollkopf, C. J. C., and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169– 184. 1999.

[3] W. Burgard, T. Christaller, and A. Knoll. Robotik. In G. G¨ orz, C.-R. Rollinger, and J. Schneeberger, editors, Handbuch der K¨ unstlichen Intelligenz, chapter 0, pages 871–939. Oldenbourg Wissenschaftsverlag M¨ unchen, 2000.

[17] S. Kang and K. Ikeuchi. Temporal Segmentation of Tasks from Human Hand Motion. Technical Report CMU-CS93-150, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, April 1993.

[4] D. Comaniciu and P. Meer. Robust Analysis of Feature Spaces: Color Image Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 750–755, Juni 1997.

¨ [18] M. Kohler. Ubersicht zur Handgestenerkennung. http://ls7-www.cs.uni-dortmund.de/research/gesture/, 2000.

[5] M. R. Cutkosky. On Grasp Choice, Grasp Models, and the Design of Hands for Manufacturing Tasks. IEEE Transactions on Robotics and Automation, 5(3):269–279, 1989.

[19] Y. Kuniyoshi, M. Inaba, and H. Inoue. Learning by Watching: Extracting Reusable Task Knowledge from Visual Observation of Human Performance. IEEE Transactions on Robotics and Automation, 10(6):799–822, 1994.

[6] R. Dillmann, O. Rogalla, M. Ehrenmann, R. Z¨ ollner, and M. Bordegoni. Learning Robot Behaviour and Skills based on Human Demonstration and Advice: the Machine Learning Paradigm. In 9th International Symposium of Robotics Research (ISRR 1999), pages 229–238, Snowbird, Utah, USA, 9.-12. Oktober 1999. [7] M. Ehrenmann, D. Ambela, P. Steinhaus, and R. Dillmann. A Comparison of Four Fast Vision-Based Object Recognition Methods for Programing by Demonstration Applications. In Proceedings of the 2000 International Conference on Robotics and Automation (ICRA), volume 1, pages 1862–1867, San Francisco, Kalifornien, USA, 24.–28. April 2000. [8] M. Ehrenmann, O. Rogalla, R. Z¨ ollner, and R. Dillmann. Teaching Service Robots complex Tasks: Programming by Demonstration for Workshop and Household Environments. In Proceedings of the 2001 International Conference on Field and Service Robots (FSR), volume 1, pages 397–402, Helsinki, Finnland, 11.–13. Juni 2001. [9] M. Ehrenmann, R. Z¨ ollner, S. Knoop, and R. Dillmann. Sensor Fusion Approaches for Observation of User Actions in Programming by Demonstration. In Proceedings of the 2001 International Conference on Multi Sensor Fusion and Integration for Intelligent Systems (MFI), volume 1, pages 227–232, Baden-Baden, 19.–22. August 2001. [10] J. Elliot and K. Connolly. A classification of hand movements. Developmental Medicine and Child Neurology, 26:283–296, 1984. [11] R. Brooks et. al. The cog project: Building a humanoid robot. In C. L. Nehaniv, editor, Computation for Metaphors, Analogy and Agents, volume 1562 of Springer Lecture Notes in Artificial Intelligence. Springer-Verlag, 1999. [12] R.E. Fikes, P.E. Hart, and N.J. Nilsson. Learning and executing generalized robot plans. Artificial Intelligence, 3(4), 1972. [13] H. Friedrich. Interaktive Programmierung von Manipulationssequenzen. PhD thesis, Universit¨ at Karlsruhe, 1998. [14] J. Fry, H. Asoh, and T. Matsui. Natural dialogue with the jijo-2 office robot. In Proceedings of the 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems, volume 4, pages 1278–1283, 1998.

[20] T.C. Lueth, Th. Laengle, G. Herzog, E. Stopp, and U. Rembold. KANTRA – Human-Machine Interaction for Intelligent Robots using Natural Language. In IEEE International Workshop on Robot and Human Communication, volume 4, pages 106–110, 1994. [21] T. Mitchell. Explanation-based generalization - a unifying view. Machine Learning, 1:47–80, 1986. [22] Y. Ogura, Y. Sugahara, Y. Kaneshima, N. Hieda, H. Lim, and A. Takanishi. Interactive biped locomotion based on visual/auditory information. In 2002 IEEE International Workshop of Robot and Human Interactive Communication (ROMAN), volume 1, pages 253–258, Berlin, 2002. [23] H. Onda, H. Hirukawa, F. Tomita, T. Suehiro, and K. Takase. Assembly Motion Teaching System using Position/Force Simulator—Generating Control Program. In 10th IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 389–396, Grenoble, Frankreich, 7.-11. September 1997. [24] D. Perzanowski, A. Schultz, W. Adams, E. Marsh, and M. Bugajska. Building a Multimodal Human-Robot Interface. IEEE Journal on Intelligent Systems, pages 16–21, Januar–Februar 2001. [25] O. Rogalla, M. Ehrenmann, and R. Dillmann. A sensor fusion approach for PbD. In Proc. of the IEEE/RSJ Conference Intelligent Robots and Systems, IROS’98, volume 2, pages 1040–1045, 1998. [26] O. Rogalla, K. Pohl, and R. Dillmann. A general Approach for Modeling Robots. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS), volume 3, pages 1963–1968, 2000. [27] H. Sidenbladh, D. Kragic, and H. Christensen. A Person Following Behaviour for a Mobile Robot. In Proceedings of the IEEE International Conference on Robotics and Automation, Detroit, MI, USA, pages 670–675, April 1999. [28] A. Steinhage and T. Bergener. Learning by Doing: A Dynamic Architecture for Generating Adaptive Behavioral Sequences. In Proceedings of the Second International ICSC Symposium on Neural Computation (NC), pages 813–820, 2000.

[29] T. Takahashi, S. Nakanishi, Y. Kuno, and Y. Shirai. Helping Computer Vision by Verbal and Nonverbal Communication. In Proceedings of the 14th IEEE International Conference on Pattern Recognition, volume 4, pages 1216– 1218, 1998. [30] T. Tojo, Y. Matsusaka, T. Ishii, and T. Kobayashi. A conversational robot utilizing facial and body expressions. In 2000 IEEE International Conference on Systems, Man, and Cybernetics, volume 2, pages 858–863, Baden-Baden, 2000. [31] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., 1998. [32] R. Voyles and P. Khosla. Gesture-Based Programming: A Preliminary Demonstration. In Proceedings of the IEEE International Conference on Robotics and Automation, Detroit, Michigan, pages 708–713, Mai 1999. [33] J. Zhang, Y. von Collani, and A. Knoll. Interactive Assembly by a Two-Arm-Robot Agent. Robotics and Autonomous Systems, 1(29):91–100, 1999. [34] R. Z¨ ollner, O. Rogalla, J. Z¨ ollner, and R. Dillmann. Dynamic Grasp Recognition within the Framework of Programming by Demonstration. In The 10th IEEE International Workshop on Robot and Human Interactive Communication (Roman), pages 418–423, 18.-21. September 2001.