Interacting with a Mobile Robot: Evaluating Gestural Object References Joachim Schmidt1 and Nils Hofemann1 and Axel Haasch1 and Jannik Fritsch1,2 and Gerhard Sagerer1 1 Applied Computer Science, Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany {jschmidt|nhofeman|ahaasch|sagerer}@techfak.uni-bielefeld.de 2
is now with: Honda Research Institute Europe, 63073 Offenbach, Germany
[email protected]
Abstract— Creating robots able to interact and cooperate with humans in household environments and everyday life is an emerging topic. Our goal is to facilitate a humanlike and intuitive interaction with such robots. Besides verbal interaction, gestures are a fundamental aspect in humanhuman interaction. One typical usage of interactive gestures is referencing of objects. This paper describes a novel integrated vision system combining different algorithms for pose tracking, gesture detection, and object attention in order to enable a mobile robot to resolve gesture-based object references. Results from the evaluation of the individual algorithms as well as the overall system are presented. A total of 20 minutes of video data collected from four subjects performing almost 500 gestures are evaluated to demonstrate the current performance of the approach as well as the overall success rate of gestural object references. This demonstrates that our integrated vision system can serve as the gestural front end that enables an interactive mobile robot to engage in multimodal human-robot interaction.
I. I NTRODUCTION The aim of bringing robots into the household and therefore the need to interact with humans in everyday life is often addressed. But only by adopting to the interaction styles their human partners are used to such robots will be truly personal robots and not technical gadgets. Consequently, research is turning to those perceptual processes that facilitate a humanlike and intuitive interaction with robots. Besides verbal interaction, gestures are a fundamental aspect in humanhuman interaction. One typical usage of interactive gestures that is employed very often when the conversation focuses on the physical environment is referencing of objects. This requires not only the development of individual algorithms for pose tracking, gesture detection, and object attention but also the integration of these different methods in a single vision system. In combination with the analysis of verbal information (i.e., modules for speech understanding and dialog) the visually extracted information allows for situated human-robot interactions involving actions, objects, and references between them. This paper describes a novel integrated vision system that combines several vision algorithms for enabling a mobile robot to resolve gesture-based object references. The fist step is to enable the robot to detect and track the human partner independently of the image background. This is achieved using an articulated 3D body model that is fitted
to the image data by applying a particle filtering method. Using a 3D model is more principled than image-space approaches and makes the tracking independent from the viewpoint of the camera. As the next step in the process, the gesture is sought to be identified based on the tracked hand trajectory. A multiple hypothesis approach employing a view independent gesture representation is used to segment the trajectories and identify the current gesture from a set of learned models. For finally finding the referenced objects, the recognized pointing gestures are used to calculate a region of interest (ROI) representing the estimated object position in the world. Images taken from this ROI could be analyzed with established object detection methods but this is not in the scope of this paper. Rather we will here evaluate the vision front end resulting in the ROI. Next, we will outline related approaches. Section III illustrates the overall system design and the individual modules integrated. In Section IV we describe the extensive evaluation performed to show the usability and precision of the overall system as well as its parts. The paper ends with a conclusion in Section V. II. R ELATED W ORK Since the early work of Crowley and Coutaz [3], hands and the vision based detection of their motions is known as ways to interact with computer systems. This has developed into complex integrated vision systems, integrating gesture recognition, object detection as well as dialog systems to build up multimodal and interactive systems [1]. Especially the combination of several modalities enables robots to become aware of interactions partners in the real world and to achieve situated and human-like interactions with them. A robot system heading for this goal is presented by Stiefelhagen et al. [16]. In detail they utilize the temporal synchrony of speech and hand trajectories for recognizing pointing actions to objects. However, a specialized depth sensor (stereo or time-of-flight) is needed for tracking the human’s face and hands in 3D, similar to [9] who estimate the direction of pointing using stereo cameras. But even a monocular camera can be enough for realizing visual recognition and tracking of people and hand gestures, as Germa et al. [5] show with their tour-guide robot. The fusion of multiple visual cues is achieved by a particle
Fig. 1.
Evaluation scenario: Robot B IRON facing a table with objects.
filtering framework, command gestures towards the robot are recognized from hand configuration templates. As Germa et al. neither apply 3D body tracking nor hand trajectory recognition, their scenario does not involve interaction with real-world objects. In the CoSy project, Kruijff et al. [10] work on aspects of anaphoric and exophoric references to objects. They present a framework to manage the problems of dynamic changes of the environment typical for human-robot interactions. Combining speech and object recognition they achieve learning and detection of objects. However, they do not dwell on gestural input as an additional cue. Even if each, object recognition, visual attention, language and action processing yield useful results as individual algorithms, Fay et al. [4] prove that integrating sensory data from different modalities is essential for human-like performance in dealing with ambiguities. Brooks and Brazeal[2] present a modular spatial reasoning framework that is able to determine the object referent from deictic reference including pointing gestures from a stereo camera setup and speech. In our work we aim for an integrated system for resolving object references, where the pointing direction is explicitly recognized based on a tracked 3D body model utilizing a monocular camera and relying on trained gesture models without requiring an explicit triggering of the recognition process. Being part of a bigger framework for a mobile robot, it is convenient for our system to benefit from incorporating verbal information, the robot’s person attention and further modalities. The same framework is also suitable to enable human-oriented interaction with an anthropomorphic robot [15]. However, this paper will focus on the vision aspects only and provide detailed evaluation on the quality of the individual algorithms and the vision system as a whole.
system are a model for human awareness, the ability to navigate in unknown environments and a dialog system for naturally spoken language, see also [15]. The vision components presented in this paper are designed to be integrated into a multimodal system that can bear the requirements of human-robot interaction in everyday life. The contribution of this paper can be seen as part of a bigger learning scenario: Imagine the robot to be placed in an unknown environment, e.g., the first time to see the flat of the human owner. The task here is to let the robot learn new objects by pointing at them, therefore having the human and the robot observing each other and interacting on objects in the real world. Regarding the technical basis, we use two cameras, one for recognizing the human and his gestures and one active camera for the acquisition of closeup images of objects. For body tracking an Apple iSight (body tracking camera) is used, whose field of view is wide enough to completely observe the upper body of the interaction partner at a distance of 2.0 - 2.5 m. The second camera (object camera) is a Sony Evi pan-tilt camera, utilized to get a closeup of the referenced object. Calibrating both cameras with respect to the robot’s coordinate system allows the object camera to aim at the target position of a pointing gesture seen from the body tracking camera. The software basis for the integrated system presented in this paper is provided by the XMLbased communication framework XCF [18] and the image processing framework IceWing[11]. The robot system (cf. right side of Fig. 2) provides information about the environment and the human interaction partner. The person attention processes multimodal cues, including speaker localization as well as face and legs detection for finding and tracking possible interaction partners. The natural spoken dialog system is used for speech input and output; the execution supervisor controls the different modules during interaction. Sensory data and processing results are stored in a scene model and made available for other modules. Evaluated Vision System Control
Body Tracking
Control
Gesture Recognition
Robot System Dialog Control
Object Classification
Control
Execution Supervisor
Control
Object Attention Control
Person Attention − legs − faces − speech
Scene− model
Control
Fig. 2. System sketch for recognizing pointing gestures on a mobile robot.
III. S YSTEM SKETCH AND A LGORITHMS In our study we made use of the mobile robot B IRON (Bielefeld Robot Companion [7]). Intended to be a personal robot companion, the robot B IRON is a research and evaluation platform used in human-robot studies focused on social interaction. Besides the vision system for resolving object references, the basic functionalities of our robotic interaction
The vision system evaluated in this paper (cf. left side of Fig. 2) consists of three modules that are presented in the following. We start with tracking the human’s upper body, especially his acting hand (Body Tracking, Section III-A). These motions are interpreted in order to detect actions like pointing (Gesture Recognition, Section III-B). Based on the
recognized action, a region of interest (ROI) is determined (Object Attention, Section III-C) to finally enable detecting and/or learning the referenced object. A. 3D Body Model Tracking In the presented scenario we assume the human is standing in front of the robot with the upper body being visible in the gesture camera. For estimating the pose of the human for each frame, we apply the monocular 3D body model tracking algorithm described in [14]. The motions of the person are tracked using the articulated model depicted in Fig. 4 with 14 degrees of freedom. Estimated parameters are the humans position and orientation (each 3 DOF) with respect to the robot and the joint angles of the two arms (3 DOF for each shoulder, 1 DOF for each elbow). Currently, we apply an individual model for each person observed and initialize tracking by hand, assuming that the person attention module provides a coarse positioning. To match a given pose of the model with the image data, the model is first back-projected into the image, then evaluated for each limb of the model based on moltiple colorand intensity-cues. During system design, we deliberately avoided techniques employing background adaptation like silhouette computation, to allow tracking to work in unknown and changing environments or even while the robot or the camera move.
Fig. 4. Articulated 3D upper body Fig. 5. Activation potentials for difmodel with 14 DOF. ferent gesture models (a) and gesture completion likelihood (b).
Due to the constraints of the presented scenario, the trajectory of the right hand is sufficient for recognizing the human’s gestures. Trajectory models are generated from annotated training examples. For the presented approach, we trained five models mk for the following gestures: pointing at an object (point), retracting the arm after pointing (back), raising the arm (up), waving (wave) and lowering the arm finally (down), referred by k ∈ {p, b, u, w, d}. Briefly, the actual recognition is done by comparing the observed motion z with each of the learned trajectory models mk seperately, using a time window [t − I, t]: k
d(z, m ) =
2 t X X
2 zi (j) − αmki (φ − ρj) ,
(1)
i=1 j=t−I
Fig. 3.
Tracking results for subject A while pointing at an object.
Although depth information is not directly observable in a monocular setup, it can be inferred from the best configuration of the articulated model. The pose tracking algorithm thereby provides 3D motion trajectories of the upper body in the robot-centered coordinate system.
with i iterating the dimensions of the feature space. As a new gesture can start any time, we employ a particle filtering scheme [8] tracking a multitude of hypotheses each with different parameters. Optimized parameters are the type of gesture k, the time offset φ when the gesture started and scaling factors for duration ρ and amplitude α of the motion, respectively. These parameters form the state vector of a particle. Its values define a trajectory hypothesis that is compared with the trajectory actually observed based on the similarity measure d. From this comparison the likelihood for each hypothesis can be calculated and used for the propagation step in the particle filter. For each trajectory model mk the likelihoods of all hypotheses can be added over the complete particle set resulting in the likelihoods for each motion k ∈ {p, b, u, w, d} at any given time (see Fig. 5(a)) and for completing the gesture successfully (see Fig. 5(b)). The gesture recognition thereby segments the continuous hand trajectory into meaningful gestures. External triggering, e.g., by speech is not needed.
B. Trajectory-based Gesture Recognition The tracked motions are used as input for the gesture recognition in order to detect meaningful gestures. The use of a 3D model enables the robot to track motions independent from the viewpoint of the camera. Motions are represented in a cylindrical coordinate system with its basis in the human’s shoulder, the features used for the recognition process are the relative radial and the vertical velocities with respect to the torso. In a preprocessing step the sequential data is smoothed by a causal Savitzky-Golay filter [12].
C. Object Attention The presented interaction scenario eventually aims at recognizing objects the human points at. The tracked and recognized gestures provide the basis for calculating a region of interest (ROI) to resolve the referenced object. Only gestures that have been recognized as ’pointing’ are considered for further processing. The position of the human is compared to the robot’s person attention information to ascertain that the tracking results originate from the interaction partner.
To determine the ROI, the pointing direction is assumed as a line extending from the users head over his hand, similar to [16]. The region of interest is represented as a sphere in the world coordinate system, centered at the estimated object position. The distance between the tracked hand position and the ROI Fig. 6. Region of interest for a pointing gesture. center can be scaled comprising verbal information, e.g., describing the size (’large’, ’tiny’) of the object. The zoom of the object camera is set accordingly. A fixed distance is used for this evaluation, as speech input has not been considered. The ROI position is then transformed into the coordinate system of the robot’s object camera and an image is captured for analysis, see Fig. 6. The object attention verifies if the verbal and visual information correspond during interaction. Predefined color words, e.g. ’blue’, are used to apply a simple color based object segmentation using a direct mapping to preassigned RGB-values. This enables the user to resolve ambiguities from pointing, e.g., for two nearby objects by naming their color. Thus a coupling of verbal and gestural modalities is supported by the object attention. The color segmentation can be used to acquire a closeup view of the referenced object for later use and/or training in an object recognition system.
The other half were experienced users. After an initialization phase for the body tracking the person pointed (point) at each object in sequence (crocodile, cup, ball, lemon, bottle), withdrawing their hand (back) after each gesture. The next gestures are raising the hand (up) and waving (wave) into the camera, finally lowering the arm (down). All participants had to perform the same sequence of gestures three times changing the order of targets each time. We recorded each subject performing the experiments four times, resulting in 16 sequences with a total number of 18572 images, equivalent to more than 20 minutes of video, counting 496 performed gestures in total. For evaluating the experiments, the 3D body model tracking is manually initialized by determining an initial pose. The resulting hand trajectories of the body tracker are used as input for the gesture recognition. The detected pointing gestures are consecutively used to calculate the ROIs and to acquire close up images of the objects. In the following we discuss the recognition results of each module in detail. A. Evaluating Body Tracking The body tracking algorithm has been evaluated on three of the recorded sequences with 836 images in total. Ground truth has been generated by manually annotating the position of the human’s hands and the head in the images. The measures of the body model and an initial posture are also defined manually for each sequence.
IV. E VALUATION For evaluation the following setup has been used: The human is standing in front of a table with five objects, facing the robot B IRON. The person is asked to show the objects to the robot by pointing at them. The human’s actions are recorded by the body tracking camera while we ensured that the upper body is in its field of view. Likewise, the object positions are well within range of the pan-tilt camera. As the evaluation was designed to test each system component on its own as well as the system as a whole, ground truth is needed for the different processing steps. All object positions have been measured in the coordinate system of the robot. Likewise the video data has been manually annotated for the evaluation of the body tracking by marking the right hand’s position in the image and for gesture recognition by segmenting and naming every action. For the experiments, the 3D body tracking has been parameterized to provide trajectories at best accuracy, which as a trade off needs longer processing time. Due to these demands and the drawback that the 3D body tracking needed a manual initialization, the evaluation was conducted offline. Since the dialog capacities as well as the other robot functionalities (e.g., person attention, overall execution control) were not in the focus of this evaluation the necessary parts were simulated. The experiments were performed with 4 persons, two female and two male. Half of the participants were inexperienced and performed the gestures for the very first time.
error [pixel]
100
pointing at object (5)
500 particles 1500 particles
80 60 40 20 0
200
250
300
350
400
image #
Fig. 7. Tracking error of subject B for two configurations. Mean errors µ as in Tab. I. Pointing at object (5) around frame 270 is difficult to track.
Person subj A subj B subj C
# pic 318 242 276
1500 particles µ σ 18.7 13.9 14.5 8.6 12.0 10.1
500 particles µ σ 52.7 41.3 32.9 23.3 72.6 38.3
TABLE I Body tracking evaluation: Position error for the right hand. µ: RMSE (root mean squared error), σ: standard deviation in [pixel]. Mean errors µ for subject B also shown in Fig. 7.
We configured the system in two ways: 1) Using 1500 particles and 6 meanshift iterations takes approx. 5 sec/image for computation. This configuration is used for carrying out the overall system evaluation. 2) Using 500 particles and three meanshift iterations results in a lower accuracy but achieves faster computation at approx. 1-2 Hz, shown for comparison only. Tab. I shows that we still need a relatively high number of particles to ensure accurate tracking over a longer image sequence. Gestures that effect in self-occlusion
B. Evaluating Gesture Recognition The gesture recognition is evaluated using the trajectories provided by the body tracking component. The models of the gestures are trained separately for each person using the tracked and annotated motions from three of the four runs. The remaining sequence is used for evaluation. This is resulting in sets of only five to 20 training samples for each gesture model. For the intended application, fast adaption to new users is an important feature but obviously prevents extensive collection of training data. The presented algorithm shows robust performance even if only a low number of training examples are provided. Note, that systematic errors of the body tracking are co-learned in the training process, thus enabling robust gesture detection even for coarse tracking results. Table II is illustrating the recognition results and error types for the five different motions. In each column the number of annotated and correctly detected gestures is noted followed by the error types of double detections, deletions, false insertions and substitutions. As well the resulting error and recognition rates are given. The last column is summarizing the results. Subject IV executed the wave motion simply by moving the hand instead of the forearm which is not recognizable for the body tracking. The marked column (wave*) therefore displays the results neglecting subject IV for a better comparability. gesture # gest. correct double delete insert substit. error [%] recog. [%]
point 182 167 49 15 24 0 21.4 91.8
back 187 182 102 5 26 0 16.6 97.3
up 29 29 0 0 1 0 3.5 100
wave* 71 65 0 5 2 1 11.3 91.5
down 27 26 0 1 6 0 25.9 96.3
all 496 469 151 26 59 1 17.3 94.6
TABLE II Gesture detection evaluation: Errors and recognition rates for all objects and subjects.
The gesture recognition shows a high recognition rate of 94.6% and an acceptable low error rate of 17.3%. This rate is calculated by adding up all errors (deletions, insertions and substitutions) for the respective gesture. The false positive recognitions (insertions) make up the major part in this error (11,9% ; e=59). For a robot system, this could lead to false assumptions about the gestures performed by the user. Compared to our previous work [6], appropriate depth information about the current body pose can significantly improve the robustness of action recognition. Probabilistic
fusion of data from gesture and speech recognition, however, could help to validate assumptions, as speech and gestures naturally co-occur in everyday communication [17]. C. Evaluating Object Attention Both, body tracking trajectories and recognized gestures are essential for determining the region of interest. For evaluating the object attention module only the 167 correctly recognized pointing gestures are used. Five of those gestures h pt 2.2 de 2 1.8 0.8
0.6
0.4
height
are more likely to produce noisy results (cf. Fig. 7 around frame 270), as the observation function for occluded limbs, e.g., the torso provides imprecise or even wrong results. Fusing multiple cues for each limb and considering the body model as a whole helps to keep up tracking even if detectors for individual body parts were mislead. Both evaluated configurations are able to recover from tracking errors, but only 1) permanently provides reliable trajectories.
0.6
0.4
0.2
0
width −0.2
−0.4
−0.6
Fig. 8. ROI center positions for recognized pointing gestures and ground truth (bold markers) for each object. Objects (1-5) from left to right.
have been neglected by the object attention (cf. Table II), because their calculated ROI was outside the interaction space observable by the object camera. Furthermore, no verbal information is used, so that the distance between the ROI and the hand is not altered, as the color based selection of objects and the segmentation of the ROI image is not in the scope of this paper. The final object position error is calculated as the euclidian distance in [m] in the world coordinate system between the measured object position and the ROI center. Table IV-B presents the RMSE error µ and the variance σ for the ROIs separately for each subject. The objects (1-5) are ordered in the same manner as in Figures 8 and 3. D. System Performance The ROI errors are for most objects not bigger than 25cm. The error in the ROI positions is mainly due to the coarse gesture information from the preceding modules. Also, estimating the pointing direction as a line from the head through the hand does not always suit well. For objects (4) and (5) to the left of the person (at the right in the image), the ROI is estimated to be too close to the human (cf. Fig. 8) for two reasons: 1) The body tracking cannot handle the self occlusion that good and produces more noisy results (cf. Fig. 7), 2) for the described posture it is uncommon to have the hand in the line of sight towards the object, the gesture is more like ’shooting from the hip’. This calls for learning task-specific dependencies. The error could be overcome by altering the ROI offset for this specific gesture, as the recognized ROI positions are still compact (low σ in Table IV-B) but their mean position is slightly offset (cf. Fig. 8). Objects (1-3) show that better results are achievable. Comparing the different subjects, the errors vary only slightly
Object (1) green crocodile (2) blue cup (3) pink ball (4) yellow lemon (5) red bottle all
#g 9 8 8 7 8 40
subj A µ 0.22 0.28 0.29 0.38 0.58 0.35
σ 0.04 0.02 0.06 0.05 0.04 0.14
#g 7 10 7 7 10 41
subj B µ 0.10 0.21 0.14 0.25 0.47 0.25
σ 0.03 0.07 0.05 0.03 0.07 0.15
#g 7 7 8 7 11 40
subj C µ 0.15 0.23 0.25 0.25 0.51 0.30
σ 0.04 0.02 0.04 0.05 0.05 0.14
#g 9 7 7 8 10 41
subj D µ 0.13 0.23 0.22 0.34 0.50 0.29
σ 0.06 0.03 0.07 0.03 0.04 0.14
#g 32 32 30 29 39 162
all µ 0.15 0.24 0.23 0.31 0.51 0.30
σ 0.06 0.05 0.08 0.07 0.06 0.14
TABLE III Position error for the calculated region of interest (ROI) for all objects (1-5) and subjects (A-D). # g: number of recognized gestures, µ: mean error, σ: standard deviation, both in [m].
although the gestures were performed very differently across subjects. Individual gesture models therefore seem to be a good way to deal with such variations. V. C ONCLUSION AND O UTLOOK In this paper we presented a short overview and an extensive offline evaluation of the vision system performing the processing of visual input on our mobile robot B IRON. Employing only the visual input channel, the quality of understanding object references is already quite good. The exactness could be further increased if verbal descriptions for size and color were available and information from the person attention (e.g., position of the gesturer, acoustic input present) were utilized. Integrating context knowledge about the environment and the current task could also help to reduce the search space for the body tracking as well as the gesture recognition. The 3D body tracking is the only part of our vision system that does not yet provide a robust tracking when configured to work in real-time. Calculation speed could be improved by restricting the search space through the learned motion models or by refining the observation model to provide accurate results for even less cues. Automatic initialization and error recovery of the tracking algorithm [13] will help to make the system more usable for humanrobot interaction. Ultimately, the presented system will be the basis for a sophisticated human-robot interaction system. As the body tracking operates background-independent and a single camera would be sufficient, the system is well suited for application on a humanoid robot, contributing to more human-like interaction with such a robot. R EFERENCES [1] C. Bauckhage, M. Hanheide, S. Wrede, T. K¨aster, M. Pfeiffer, and G. Sagerer. Vision systems with the human in the loop. EURASIP Journal on Applied Signal Processing, 2005(14):2375–2390, 2005. [2] A. G. Brooks and C. Breazeal. Working with robots and objects: revisiting deictic reference for achieving spatial common ground. In HRI ’06: Proceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction, pages 297–304, New York, NY, USA, 2006. ACM. [3] J. L. Crowley and J. Coutaz. Vision for man machine interaction. In Proceedings of the IFIP TC2/WG2.7 Working Conference on Engineering for Human-Computer Interaction, pages 28–45, London, UK, 1996. Chapman & Hall, Ltd. [4] R. Fay, U. Kaufmann, A. Knoblauch, H. Markert, and G. Palm. Integrating object recognition, visual attention, language and action processing on a robot in a neurobiologically plausible associative architecture. In G. Palm and S. Wermter, editors, NeuroBotics Workshop Proceedings. 27th German Conf. of Artificial Intelligence, University of Ulm, 2004.
[5] T. Germa, F. Lerasle, P. Danes, and L. Brethes. Human / Robot Visual Interaction for a Tour-Guide Robot. In Int. Conf. on Intelligent Robots and Systems (IROS), 2007. [6] A. Haasch, N. Hofemann, J. Fritsch, and G. Sagerer. A Multi-Modal Object Attention System for a Mobile Robot. In Proceedings of the IEEE / RSJ International Conference on Intelligent Robots and Systems, pages 1499–1504. IEEE/RSJ, IEEE, 2005. [7] A. Haasch, S. Hohenner, S. H¨uwel, M. Kleinehagenbrock, S. Lang, I. Toptsis, G. A. Fink, J. Fritsch, B. Wrede, and G. Sagerer. BIRON – The Bielefeld Robot Companion. In E. Prassler, G. Lawitzky, P. Fiorini, and M. H¨agele, editors, Proc. Int. Workshop on Advances in Service Robotics, pages 27–32, Stuttgart, Germany, 2004. Fraunhofer IRB Verlag. [8] N. Hofemann, J. Fritsch, and G. Sagerer. Recognition of Deictic Gestures with Context. In C. E. Rasmussen, H. H. B¨ulthoff, M. A. Giese, and B. Sch¨olkopf, editors, Pattern Recognition, 26th DAGM Symposium, T¨ubingen, Germany. Proceedings, volume 3175 of Lecture Notes in Computer Science, pages 334–341, Heidelberg, Germany, 2004. Springer. [9] N. Jojic, T. Huang, B. Brumitt, B. Meyers, and S. Harris. Detection and estimation of pointing gestures in dense disparity maps. In FG ’00: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000, page 468, Washington, DC, USA, 2000. IEEE Computer Society. [10] G.-J. M. Kruijff, J. D. Kelleher, and N. Hawes. Information fusion for visual reference resolution in dynamic situated dialogue. In Perception and Interactive Technologies, PIT 2006, volume 4021 of LNCS, pages 117–128. Springer Berlin / Heidelberg, 2006. [11] F. L¨omker, S. Wrede, M. Hanheide, and J. Fritsch. Building modular vision systems with a graphical plugin environment. In Proceedings of the International Conference on Vision Systems. IEEE, January 2006. [12] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipes in C. Cambridge University Press, Cambridge, UK, 2nd edition, 1992. [13] J. Schmidt and M. Castrillon. Automatic Initialization for Body Tracking - Using Appearance to Learn a Model for Tracking Human Upper Body Motions. In Proceedings of the International Conference on Vision and Applications, Funchal, Madeira - Portugal, January 2008. accepted. [14] J. Schmidt, B. Kwolek, and J. Fritsch. Kernel Particle Filter for RealTime 3D Body Tracking in Monocular Color Images. In Proceedings of the International Conference on Automatic Face and Gesture Recognition, volume 20, pages 1183–1199, Southampton, UK, 2006. IEEE. [15] T. Spexard, M. Hanheide, and G. Sagerer. Human-oriented interaction with an anthropomorphic robot. IEEE Transactions on Robotics, 23, Special Issue on Human-Robot Interaction(5):852–862, October 2007. [16] R. Stiefelhagen, H. Ekenel, C. F¨ugen, P. Gieselmann, H. Holzapfel, F. Kraft, K. Nickel, M. Voit, and A. Waibel. Enabling multimodal human-robot interaction for the karlsruhe humanoid robot. IEEE Transactions on Robotics, 23, Special Issue on Human-Robot Interaction(5):840–851, October 2007. ¨ urek, and P. Hagoort. When language meets [17] R. M. Willems, A. Ozy¨ action: The neural integration of gesture and speech. Cerebral Cortex, 2006. [18] S. Wrede, M. Hanheide, S. Wachsmuth, and G. Sagerer. Integration and Coordination in a Cognitive Vision System. In Proceedings of the International Conference on Vision Systems. IEEE, 2006.