Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems October 9 - 15, 2006, Beijing, China
A Real-Time Framework for Vision based Human Robot Interaction Randeep Singh
Bhartendu Seth
Uday Desai
KR School of Information Technology Dept. of Mechanical Engineering Dept. of Electrical Engineering Indian Institute of Technology, Bombay Indian Institute of Technology, Bombay Indian Institute of Technology, Bombay Powai, Mumbai 400 076 Powai, Mumbai 400 076 Powai, Mumbai 400 076 Email:
[email protected] Email:
[email protected] Email:
[email protected]
Abstract— Interactive mobile robots are an active area of research. This paper presents a framework for designing a real-time vision based hand-body dynamic gesture recognition system for such robots. The said framework works in real world lighting conditions with complex backgrounds, and can handle intermittent motion of the camera. We present here a novel way in which the Motion History Image (MHI) and the Motion Energy Image (MEI) is built. We propose a robust combining of the motion and color cues and we call this image as Motion Color Image (MCI). The input signal is captured by using monocular color camera. Vision is the only feedback sensor being used. It is assumed that the gesturer is wearing clothes that are slightly different from the background. Gestures are first learned offline and then matched to the temporal data generated online in real time. We have tested this on a gesture database consisting of 11 hand-body gestures and have recorded recognition accuracy up to 90%. We have partially implemented and testing the system for Sony’s Aibo robot dog using Remote Framework (RFW) SDK by Sony.
I. I NTRODUCTION Robots in the future will interact with users in a natural way using gestural and speech interfaces. These interfaces will be fast, intuitive and easy to learn for both young and the elderly. Interactive and mobile robots will find widespread use in therapeutic, educational, social and amusement sectors. A robot that can respond to hand-body gestures of a user will enable high level of personalized interaction. A typical example of this interaction would be the user pointing towards a corner of a room and that corner being cleaned by a vision enabled mobile robotic vacuum cleaner. Autonomous robots will be involved in exploration and interaction with the environment. Robots equipped with sensors (like IR, ultrasound, tactile, vision etc.) will analyze the incoming data in order to generate a suitable response of the robot to the external stimulus. The vision sensor will be mainly used for obstacle detection and avoidance, object detection and tracking, world model building, wandering and learning behaviors and interaction with users. We are interested in building a realtime vision based framework for dynamic gesture recognition to enable natural and intuitive human robot interaction. We have implemented an artificial vision system in which the robot performs a particular behavior in response to the user’s gesture using Active Vision methodology [1].
1-4244-0259-X/06/$20.00 ©2006 IEEE
A hand-body gesture user interface should be in real-time, the response time being comparable to conventional interfaces of a desktop computer. There are subtle differences in the way the user interacts with a computer and the way in which he will interact with the mobile robot of the future. The WIMP (Windows, Icons, Menus, and Pointing devices) interfaces of the desktop computer will not directly translate into similar interfaces for mobile robots. Clearly, we need technologies that are suitable to mobile robots domain. Gesture recognition depends on the ability to separate an object from its background. In conventional approaches, this object segmentation requires a simple background, or the use of special markers affixed to the object to make it stand out. We perform Gesture recognition without employing any abstract geometric model of the human body. Instead of a top-down approach we make use of a feature-based bottomup strategy by tracking the actual movements of the hands and body. Motion activity information is encoded into a continuous stream described by hand-body movements, learned off-line and then recognized in real-time. The experimental setup includes a Sony Aibo ERS-7 which is connected to a PC through wireless LAN. The Aibo sends compressed jpeg images to our software which uncompresses, analyzes and send the motion control commands to the Aibo. Consequently, the image processing work is done on the PC rather than on Aibo. All the facilities for data sending and receiving is provided in Remote Framework (RFW) SDK [2] by Sony. II. R ELATED R ESEARCH Pfinder [3] can track the user with blobs through use of stereo vision and skin color using top-down model based human body tracking. For recognizing dynamic gestures, researchers have used various techniques ranging from Finite State Machines (FSM), Hidden Markov Modeling (HMM), Artificial Neural Networks (ANN), Dynamic Time Warping (DTW) and other techniques reviewed in [4][5][6][7]. There are also many statistical learning techniques like discriminant analysis [8], nearest centroid algorithm [9], eigenspace [10] applied to gesture recognition. There are some commercially available vision enabled smart toys such as the Sony’s Aibo [11]. Aibo is the most advanced, autonomous and interactive robot. Its visual pattern recognition capabilities
5831
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 4, 2008 at 02:24 from IEEE Xplore. Restrictions apply.
Fig. 1.
Fig. 2.
Vision enabled Smart Toys (a) Aibo (b) Cindy (c) Robota
include face recognition, recognition of supplied pink ball, AIBOne and visual cards. Recent works on Static Gesture Recognition for Aibo are [12] and [13]. Cindy Smart [14] is designed to be both entertaining and educational. Cindy’s visual capabilities include reading the alphabets, numbers, analog clock, colors, pictures, and simple geometric shapes but it is a stationary robot. Robota [15] is a social robot that can mirror the movements of arms/ head of the child. Robota recognizes postures (sitting, lifting up a leg/arm) of a stick figure drawn by the child on a drawing pad or on a touch screen and will simply reproduce the posture of the stick figure in the drawing by moving adequately its arms and legs.
Fig. 3.
• •
A. Our Contributions Our framework is simpler and easier to implement than many of the above schemes. It is real-time, can handle intermittent camera motion, works in cluttered backgrounds and uses monocular vision. To the best of our knowledge, this collection of capabilities has not been demonstrated earlier in a single framework. III. ACTIVE V ISION FOR ROBOT B EHAVIOR Active vision is characterized by its specific, task-oriented design. The active vision paradigm closely integrates perception and action. It is therefore suited to put vision into a behavioral context. It promotes the use of visual cues and attentional mechanisms, has a certain tolerance to errors, lacks elaborate internal representations of the 3D-world and is therefore computationally less expensive. Active vision depends more on recognition, rather than reconstruction. Active vision has been successfully used on robots [16]. The robot behaviors are carefully calibrated such that the object of interest (person in this case) remains inside the frame of view when the robot ends performing a behavior. Details for the robot behaviors follow: • Japanese Bow: The robot comes forward, stops and then goes backwards. • Yes: The robot comes forward, stops and then goes backwards, this cycle is repeated twice. This behavior is similar to head nodding in humans. • No: The robot rotates clockwise (CW), stops, and then rotates counter clockwise (CCW), this cycle is repeated twice. This behavior is similar to performing of No by a human head.
(a) Image Regions (b) Fuzzy Input Categories
• •
Initialization Scheme
GoToLeft: The robot rotates CCW for sometime, moves forward and then rotates CW. GoToRight: The robot rotates CW for sometime, moves forward and then rotates CCW. TurnLeft: The robot rotates CCW till it completes one revolution. TurnRight: The robot rotates CW till it completes one revolution.
A. Fuzzy Logic Control We use a multi-input and multi-output (MIMO) fuzzy logic control system to keep the user centered in area ABCD in Fig. 2(a). The two inputs (Fig. 2(b)) are the X and Y coordinates of the center of gravity (C.G.) of the Color Blob. The outputs are the speed and direction of head pan and tilt values for the Aibo. Six fuzzy rules are sufficient for control. B. Ego Motion Detection We use an idea derived from [17] to determine whether the Aibo is moving or stationary. The technique involves suspension of temporal data building whenever the motion activity in either two of the regions of Left (region 1), Top (region 2) or Right (region 3) in Fig. 2(a) is non-zero. IV. I MAGE P ROCESSING The system is initialized by the user taking a walk in front of the stationary Aibo in such a way that it crosses the centerline of the image twice (Fig. 3). It has to be ensured that there is only one moving object in the image of size greater than 50 × 30 pixels. A model histogram of 256 bins of the user is stored, calculated from the moving pixels in the bounding box.
5832 Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 4, 2008 at 02:24 from IEEE Xplore. Restrictions apply.
Fig. 4.
Fig. 5. Motion Activity (a) LeftArmUp (top-row) (b) StandUp (bottom-row)
Color Localization (a) Original Image (b) Back projected Image
A. Color Localization The color histogram [18] is well suited to this task because of its ability to implicitly capture complex, multimodal patterns of color. Moreover, because it disregards all geometric information, it remains relatively invariant to many complicated and non-rigid motions. For good results, the user has to wear clothes that are slightly different from the background to get a back projected images (Fig. 4(b)) in which there are more than 50% pixels in the bounding box as compared to the rest of image. At run time, back projected image is calculated as the histogram intersection between the model histogram and the image histogram at each pixel location. N min(I(i), M (i)) φc = i=1 N (1) i=1 I(i) where I(i) and M (i) are the numbers of pixels in the ith bin, N is the total number of bins and φc is the new value of the pixel in the back projected image scaled between 0 and 255 as below. After initialization, the color window is tracked with Camshift [19]. φc =
φc − min(φc ) × 255. max(φc ) − min(φc )
(2)
C. Motion Energy Image A Motion Energy Image (MEI) [20] is a cumulative motion image. It is a simple but useful representation of the observed motion. It indicates the spatial location where the motion occurred. It is calculated as in (6). Eγ (x, y, t) = 0 if B(x, y, tγ ) = 0, tγ ∈ {t − γ, . . . , t} (6)
B. Vision based Motion Detection Our input video consists of a sequence of color frames in RGB space. We perform a color space conversion to a gray scale color model. Ir (x, y, t) + Ig (x, y, t) + Ib (x, y, t) I(x, y, t) = (3) 3 where Ic (x, y, t) is the channel c from the input frame at time t. The motion detection mechanism is based on change detection, the difference between two consecutive incoming frames. This difference is then thresholded to form a binary map that shows where there is a high likelihood of motion being present. D(x, y, t) = |I(x, y, t) − I(x, y, t − 1)| 1 if D(x, y, t) > τ B(x, y, t) = 0 otherwise
at location (x, y) at time t and τ is the selected threshold. The threshold τ is calculated dynamically to be equal to 10% of the highest occurring intensity value in the previous frame I(x, y, t − 1). This low level processing does not necessarily guarantee that the captured motion will represent the motion in which we are interested. We assume here that there is very little movement of the camera between image capture. We detect when the robot is moving by ego motion detection and suspend processing of the data for classification for the gesture. We construct a view specific representation of the motion based on where motion has occurred and what are the temporal characteristics of it. The result of this spacetime process collapses the time dimension to a static depiction of the action. These static representations are called Motion Energy Images (MEI) and Motion History Images (MHI). They are functions of the observed motion properties at the corresponding spatial image location in the image sequence.
(4) (5)
where I(x, y, t) is the processed frame, D(x, y, t) is the difference image and B(x, y, t) is the binary difference image
where γ represents the duration of the time window used to capture the motion. D. Motion History Image Temporal characteristics of the motion are important when analyzing actions. Motion History Image (MHI) [20] characterize the temporal component of the action as in (7). γ if B(x, y, t) = 1 Hγ (x, y, t) = max(0, Hγ (x, y, t − 1)) otherwise (7) E. Motion Color Image We propose a new image representation called Motion Color Image (MCI) which is constructed by bit-wise OR of the MEI for 4 (γ = 4 in (6)) previous levels and the color localization data. This representation gives us a motion and color region from which to prepare a feature vector. This region of interest (ROI) is found by finding the connected components in the MCI. MCI is important as it gives context to the motion data.
5833 Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 4, 2008 at 02:24 from IEEE Xplore. Restrictions apply.
Fig. 6.
The complete framework for vision based Human Robot Interaction
V. S YSTEM I NTEGRATION Our framework is a bottom-up framework that facilitates information sharing by passing information between different processes using object-oriented constructs. Fig. 6 shows the complete framework. It is divided into the Color, Motion, Temporal and Control processes. A. Color Process The input image from Aibo, tag 1 in Fig. 6 (all tags refer to Fig. 6) obtained is of dimension 208 × 159 we pad the image with an extra row and columns to make the dimension as 224 × 160. The image is scaled down to 56 × 40 ( 2 ) using pyramid vision techniques [21] to allow data reduction while still keeping the processing in real time. We had earlier tested with an image size of 112 × 80 but it was not giving realtime performance. Note that all the sizes are multiples of 8, if not, we pad extra rows and columns to make it a multiple of 8. This ensures byte alignment and hence fast processing. The user initializes the system as defined earlier. A histogram of the moving pixels is made, shown as the “blue” pixels of the pink shirt ( 5 ). The next step also show the probability values of the pink color scaled between 0 and 255 according
to (1) and (2). This image is binarized into image ( 6 ) using a suitable threshold. The next step involves finding connected components of the color data. The colored boundaries in the image ( 6 ) (we recommend seeing the picture in color) represent the following: • Yellow: The full bounding box of the color data. • Red: The bounding box with the maximum color density. • Blue: The color tracker box, implemented with inertia. • Green: The median of the color data. The control process moves the Aibo in such a way so as to keep the person’s color data centered in the image. B. Motion Process The input images to the motion process is of dimension 56 × 40, the image differences between them from (5, 2 ) are stored in a circular buffer of 8 levels ( 3 ). This is slightly different from (7) and what has already been described earlier, we do not merge the motion difference data but maintain the data as different binary images. A Motion Energy Image (MEI) using (6, 4 ) is constructed for 4 (γ = 4 in (6)) previous levels and this data is merged with the color data by bit-wise OR of the color localization data ( 6 ). By the end of this step
5834 Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 4, 2008 at 02:24 from IEEE Xplore. Restrictions apply.
Fig. 7.
(a) Bounding Box Division (b) Start/ End of Gesture
we get data which represents the motion and color data in a single image ( 7 ). By finding connected components of the motion and color data we get a ROI from which to prepare a feature vector. We call this image the Motion Color Image (MCI)( 7 ). The red bounding box is the full bounding box of the motion color data. The feature vector has to be selected from within the bounding box calculated from the MCI. This section can be summarized as follows: 1) Compute a sequence of difference pictures from a sequence of images 2) Compute MHI, MEI and MCI 3) Divide the motion color bounding box into 9 boxes 4) Compute motion pixels count of motion data in MHI in each of the 9 boxes 5) Prepare the feature vector as the sum of motion pixels in each boxes for classification
Fig. 8.
TABLE I PRELIMINARY RECOGNITION RATE Gesture
Bow BowUp BowDown RightArmUp RightArmDown RightArmUpDown LeftArmUp LeftArmDown LeftArmUpDown SitDown StandUp
C. Control Process The control process is responsible for calculating the Aibo’s command parameters and the following tasks performed by the Aibo: • Track user • Appropriately Behave in response to user’s action VI. T EMPORAL F EATURE V ECTOR Let It be the current image (converted into gray scale), It−1 be the previous image and It−n be the input image at time (t − n). The current 1-bit inter-frame difference image bt is defined as It − It−1 , bt−n is It−n − It−n−1 . Another 1-bit image Bt is defined as bt and Bt−n is defined as bt | Bt−n−1 where | denotes bit-wise OR operator. Both bt−n and Bt−n are calculated and maintained as separate images up to n = 8. This procedure is different from conventional building of MHI and MEI [22] as mentioned earlier. The bounding box for the motion color data is obtained from the biggest connected component of the MCI. The bounding box is divided into 9 equal rectangular regions as shown in Fig. 7(a). The start and end of gesture is determined with the help of motion activity in the image Bt−4 . The gesturestart threshold is greater than the gesture-end threshold by δ, Fig. 7(b). Motion activity is calculated from corresponding image regions of bn . The motion activity in each smaller region is normalized by dividing with the total motion activity of the bounding box ( aia ). Fig. 8(a) shows the Temporal Motion Activity of the i
(a) Typical Stream (b) LeftArmUp (c) StandUp
% of wrongly classified patterns
% of false classified patterns
Recognition rate in %
3 20 20 24 15 5 6 10 20 6 10
10 14 10 16 10 4 4 5 14 5 2
87 66 70 60 75 91 90 85 66 89 88
9 regions in different graphs. Figs. 8(b) and 8(c) shows many instances of the same gesture being overlapped on the same graph. VII. C LASSIFICATION OF T IME S ERIES INTO G ESTURES For classification of multi-variate time series into gesture classes we use the software TClass developed in [23]. This software is developed on Weka [24]. With this, we are able to generate a decision tree that can classify the gesture classes in real time. This is defined in earlier work [25]. We are currently performing extensive tests to further validate our framework. VIII. C ONCLUSION The recognition rates (Table I) achieved with this system are satisfactory. The recognition rate is highly dependent on the
5835 Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 4, 2008 at 02:24 from IEEE Xplore. Restrictions apply.
training data. We are validating our method with more experimental data and comparisons with existing approaches. We are currently extending the system up to 40 gestures. The tracker module relies heavily on color of the users’ clothes; we will be enhancing this with the use of skin color. In the long run, we will develop a continuous action perception cycle between the robot and its human user in service system domains, where the framework described here will be the key building block. The results of this paper constitute an important building block for the design of real-time architecture that can work with large vocabulary of gestures and remains user independent. Recent trends in the robotic industry has given impetus to the development of interactive mobile robots and it is not difficult to foresee robots interact with humans in a natural and intuitive way. These natural interfaces will find use in Toys, Amusement, Assistive, and Geriatric care robots. Our work is a small step in that direction. R EFERENCES [1] A. Blake and A. Yuille, Active Vision. Cambridge, Massachusetts.: MIT Press, 1992. [2] http://openr.aibo.com/openr/en/. [3] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: Realtime tracking of the human body,” MIT, Cambridge, MA., Tech. Rep. Media Lab Technical Report 353, 1995. [4] V. Pavlovic, R. Sharma, and T. Huang, “Visual interpretation of hand gestures for human-computer interaction: a review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 677– 695, 1997. [5] A. Corradini and H.-M. Gross, “Camera-based gesture recognition for robot control,” in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 4, 2000, pp. 133–138. [6] M. Shah and R. Jain, Motion-Based Recognition. Kluwer Academic Publishers, 1997. [7] T. Kobayashi and S. Haruyama, “Partly-hidden markov model and its application to gesture recognition,” in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 4, 1997, pp. 3081–3084. [8] Y. Cui and J. Weng, “Hand sign recognition from intensity image sequences with complex backgrounds,” in Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on, Killington, VT, 1996, pp. 259–264.
[9] R. Polana and R. Nelson, “Low level recognition of human motion (or how to get your man without finding his body parts),” in Motion of NonRigid and Articulated Objects, 1994., Proceedings of the 1994 IEEE Workshop on, Austin, TX, 1994, pp. 77–82. [10] T. Watanabe and M. Yachida, “Real time gesture recognition using eigenspace from multi-input image sequences,” in Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, Nara, 1998, pp. 428–433. [11] http://www.us.aibo.com/. [12] M. Hasanuzzaman, T. Zhang, V. Ampornaramveth, P. Kiatisevi, Y. Shirai, and H. Ueno, “Gesture based human-robot interaction using a frame based software platform,” in Systems, Man and Cybernetics, 2004 IEEE International Conference on, vol. 3, 2004, pp. 2883–2888. [13] S. Radhakrishnan and W. D. Potter, “Gesture recognition on aibo,” unpublished. [14] http://www.manleytoyquest.com/. [15] A. Billard, “Robota: Clever toy and educational tool,” Robotics and Autonomous Systems, 2003. [16] J. Riekki and Y. Kuniyoshi, “Architecture for vision-based purposive behaviors,” in Intelligent Robots and Systems 95. ’Human Robot Interaction and Cooperative Robots’, Proceedings. 1995 IEEE/RSJ International Conference on, vol. 1, Pittsburgh, PA, 1995, pp. 82–89. [17] J.-G. Kim, H. S. Chang, J. Kim, and H.-M. Kim, “Efficient camera motion characterization for mpeg video indexing,” in Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International Conference on, vol. 2, New York, NY, 2000, pp. 1171–1174. [18] M. Swain and D. Ballard, “Color indexing,” International Journal of Computer Vision, vol. 7, no. 1, pp. 11–32, 1991. [19] G. Bradski, “Real time face and object tracking as a component of a perceptual user interface,” in Applications of Computer Vision, 1998. WACV ’98. Proceedings., Fourth IEEE Workshop on, Princeton, NJ, 1998, pp. 214–219. [20] A. Bobick and J. Davis, “Real-time recognition of activity using temporal templates,” in Applications of Computer Vision, 1996. WACV ’96., Proceedings 3rd IEEE Workshop on, Sarasota, FL, 1996, pp. 39–42. [21] P. Burt, “Smart sensing within a pyramid vision machine,” Proceedings of the IEEE, vol. 76, no. 8, pp. 1006–1015, 1988. [22] G. Bradski and J. Davis, “Motion segmentation and pose recognition with motion history gradients,” in Applications of Computer Vision, 2000, Fifth IEEE Workshop on., Palm Springs, CA, 2000, pp. 238–244. [23] M. W. Kadous, “Learning comprehensible desciptions of multivariate time series,” in Machine Learrning: Proceedings of the Sixteenth International Conference (ICML ’99), I. Bratko and S. Dzeroski, Eds. Morgan-Kaufmann, 1999, pp. 454–463. [24] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools with Java implementation. San Francisco: Morgan Kaufman, 2000. [25] R. Singh, B. Seth, and U. Desai, “A framework for real time gesture recognition for interactive mobile robots,” in IEEE International Symposium on Circuits and Systems (ISCAS), Kobe, Japan, 2005, pp. 3207– 3210.
5836 Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 4, 2008 at 02:24 from IEEE Xplore. Restrictions apply.