keywords: online gesture recognition, Hidden Markov Models ... classical image processing algorithms, based on segmentation and template matching tech-.
Continuous Online Gesture Recognition Based on Hidden Markov Models Stefan Eickeler
Gerhard Rigoll
Faculty of Electrical Engineering - Computer Science Gerhard-Mercator-University Duisburg 47057 Duisburg, Germany eickeler,rigoll @fb9-ti.uni-duisburg.de
Abstract
This paper presents the extension of an existing vision-based gesture recognition system using Hidden Markov Models (HMMs). Several improvements have been carried out in order to increase the capabilities and the functionality of the system. These improvements include position-independent recognition, rejection of unknown gestures, and continuous online recognition of spontaneous gestures. We show that especially the latter requirement is highly complicated and demanding, if we allow the user to move in front of the camera without any restrictions and to perform the gestures spontaneously at any arbitrary moment, because gesture detection and additional distinction between movements and gestures becomes extremely difficult in this case. We present novel solutions to this problem by modifying the HMM-based decoding process and by introducing online feature extraction and evaluation methods. The result is a user friendly, high performance gesture recognition system that identifies spontaneous dynamic gestures in isolated or connected mode and has real-time capabilities. keywords: online gesture recognition, Hidden Markov Models
1 Introduction Gesture recognition has emerged as one of the most important research areas in the field of motion-based image processing and recognition. The potential of vision-based human-
computer interfaces for visual communication has increased dramatically during recent years and gesture recognition systems promise to play a major role in these scenarios. Although the problem of gesture recognition can be facilitated by using sensors for tracking body parts, or data gloves, clearly the most desirable interface would be a purely vision-based gesture recognition system, because this would lead to a mostly unrestricted visual communication process for the user. Identification of gestures using only image information is however the most demanding and challenging approach to gesture recognition. As in any other research areas, also in gesture recognition, various different basic approaches have been established. One major distinction in this area can be made between static and dynamic gesture recognition. Recognition of static gestures involve mostly the identification of hand gestures, using classical image processing algorithms, based on segmentation and template matching techniques. Dynamic gesture recognition approaches often involve hand and arm or even body gestures performed as a characteristic motion of the particular body parts. In this case, pattern recognition methods for time-varying dynamic features have to be used in order to evaluate the motion information in the video sequences. The following section gives a brief introduction into dynamic gesture recognition systems and presents the system developed by our research group. As it will be shown, this system does already represent a high technical standard for gesture recognition and obtains satisfactory recognition rates for a fairly challenging person independent recognition task consisting of 24 gestures. However, it is still a long way until such a system will be mature enough in order to be accepted by a wide range of users in a realworld application. Besides improved robustness and recognition accuracy, one of the most urgently needed properties is increased flexibility and user friendly behavior of the system, in order to facilitate the visual communication process. In the following sections we will present a few new approaches that might bring us a few steps closer to this goal. 1.1 Dynamic gesture recognition systems
In the research area of dynamic gesture recognition, Hidden Markov Models are one of the mostly used methods. The movements of the person over a sequence of images is classified. The first approach for the recognition of human movements based on Hidden Markov Models is described in [7]. It distinguishes between six different tennis strokes. This system separates the person from the background, divides the image into meshes and counts the number of person pixels for each mesh. The feature for each mesh is part of a feature vector, that is converted to a discrete label by a vector quantizer. The labels of a movement are classified based on discrete HMM. [5] presents a system to recognize three different hand gestures composed from four hand poses. The system is based on Hidden Markov Models and Kalman filters. The used features are the width and height of a bounding box of the hand. The system
Hand-WavingBoth
Hand-WavingRight
Hand-WavingLeft
To-Right
To-Left
To-Top
To-Bottom
Round-Clockwise
Round-Counterclockwise
Stop
Come
Nod-No
Clapping
Kowtow
Spin
Go-Left
Go-Right
Turn-Right
Turn-Left
Draw-A
Draw-B
Draw-C
Draw-D
Nod-Yes
Figure 1. Samples of the 24 gestures
described in [6] is capable of recognizing 40 different connected person dependent gestures of the American sign language. This system uses colored gloves to track the hands of the user, but can also track the hand without the help of gloves. The position and orientation of the hands are used for the classification. 1.2 Baseline System
The work presented in this paper is based on our gesture recognition system presented in [2, 3]. The system is able to recognize 24 different gestures with an accuracy of 92.9%. See Fig. 1 for samples of the 24 gestures. The system operates in person and background independent mode, and is six times faster than real-time. It was demonstrated for the first time on the industrial fair in Hannover in April 1996 and worked very reliably even with unexperienced users at the fair. The system consists of the preprocessing, feature extraction and the classification module. After the preprocessing of the image sequence, the feature extraction module calculates a feature vector sequence. In the final step, the classification module recognizes the feature sequence.
The preprocessing is based on the difference image of adjacent frames, which is a good indication for moving objects in the image.
(1)
Moments of the difference image (e.g. center of gravity) are used as features to specify the movement. One feature vector is calculated for each frame of the image sequence. The main features are:
center of motion
! #" $ #" % '& "/. 012"43 56. + ,
( *
) #" . 017"43 56. +) , -
59. 017"43 56. + ,
8 *
) #5 . 017"43 56. +) , mean absolute deviation from the center of motion : ; : " $ : 5 '& . 017"43 56 - 17@A6A6D. : " 8 +) , - . 017"43 56. : 5 8 +) , - . 012"43 56. +) , +) , intensity of motion E G F 4 " 3 5 F ) E 8 "43 5 )
(2)
(3)
(4)
The feature vector sequence is classified by the Viterbi algorithm based on the use of Hidden Markov Models [1], that were previously trained on sample data. See [3] for a detailed explanation of our gesture recognition system. The recognition system has three main restrictions. The first restriction is the position dependence requiring the user of the system to be positioned in the center of the image. The second restriction is the lack of a method to reject unknown gestures. The third is the need for a second operator to use the system, because the recognition is based on image sequences of three seconds length. The second user has to start the image sequence recognition by pressing a button. The goal of this paper is to present an extended system that eliminates the described restrictions. The new system is capable of position independent and continuous online recognition with a minimum time delay with respect to the recognition output.
2 Position independent recognition One restriction of our previous system is that the user has to be in the center of the image to obtain the best recognition rates. This is a direct result from the fact that the above mentioned features are global motion features, which can be easily calculated for the entire image
without any image segmentation procedure. The drawback of this approach is the fact that these features indicate the absolute position of the motion in the image and therefore are position dependent. Several approaches have been tested in order to obtain a position independent recognition while keeping the advantage of the simple and robust feature extraction. 2.1 Delta features for the center of motion
# " 8 #" # "
An obvious way to obtain a position independent recognition is to use the delta features of the center of motion , according to the equations:
#5 #5 #5 /
(5)
This method worsens the recognition of the gestures, because the gestures that consist of the same movement, but have other positions relative to the body can not be distinguished (e.g. waving with the right hand and waving with the left hand). It turned out that a position independent recognition is not feasible by just modifying our usual global features with some postprocessing method. Instead, it is necessary to find a method in order to relate these features somehow to the current position of the user and to calculate relative features from this information. 2.2 Region of interest
This method is based on the assumption, that the complete body of a person performing a gesture is always moving slightly and the person can be found by detecting this changing pixels. The sum of all difference images of the sequence is calculated. Then we binarize it by applying a threshold, that depends on the motion intensity in the image sequence. The bounding box of the pixels above the threshold is the region of interest (ROI). Fig. 2 shows the region of interest for a sample sequence of the gesture STOP. The moments of the difference image are scaled and repositioned depending on the ROI. The region of interest results in a very good position independent gesture recognition, but it can only be applied to complete sequences, because the ROI has to be calculated over all frames of the sequence. For spontaneous online recognition, the input-stream has to be segmented into isolated gesture sequences prior to the calculation of the region of interest. Therefore, the ROI method is not very suitable for this recognition mode. To achieve a continuous online recognition we have to create a person tracking algorithm, that is based on one frame of the image sequence.
Figure 2. Region of interest for the gesture STOP
2.3 Person tracking
An obvious approach for obtaining position independent features would be person tracking. From the information of the person’s position, relative motion features could be easily calculated. However, accurate person tracking is usually a complicated and time consuming process, that is somewhat counterproductive to our goal of real-time gesture recognition. Person tracking has emerged as an important research field in image sequence processing and for the time being we were still hesitating to incorporate it additionally into our gesture recognition activities. An alternative way to track a person in an image is to compare the current image with a stored image of the background. The absolute difference image is calculated and pixel values below a relatively high threshold are set to zero to eliminate the shadows of the person and noise in the images. For this difference image we calculate the center of gravity according to Eq. (2). This center corresponds to the center of the person. The relative position of the center of motion to the center of the person is used in the feature vector. This is only a simple way of person tracking, but it is sufficient for gesture recognition. We use a method to correct changes in image brightness caused by the auto-shutter function of the camera, to increase the accuracy of the person tracking technique. The tracking can be further improved by using the chrominance information of the images instead the luminance information, because shadows are only darkened parts of the background and have almost no influence on the chrominance. An update of the stored background image by an estimation of the current background can increase the ability of the system to work over a long time. Kalman filters are capable to improve the person tracking.
gesture models
filler model 1
Hand-Waving-Both Hand-Waving-Left Hand-Waving-Right To-Right To-Left To-Top To-Bottom Round-Clockwise Round-Counterclockwise Stop Come Clapping Kowtow
Go-Left Go-Right Turn-Right Turn-Left
filler model 2 Nod-Yes Nod-No Spin A B C D
Table 1. Gesture models and filler models
3 Rejection of undefined gestures The rejection of gestures which are not defined in the gesture vocabulary is very important for the use of the recognition system in real-world applications. The HMM approach provides an easy solution to this problem by using so-called ”filler models” for modeling arbitrary movements and other ”garbage” motions. We separated the gestures of Fig. 1 into a gesture vocabulary and gestures used to train the filler models, which are defined to represent all other movements (Tab. 1). In speech recognition, filler models are used for keyword spotting based on Hidden Markov Models [4]. The remaining gesture vocabulary consists of 13 different gestures, one filler model for movements of the whole body and one filler model for head and arm movements. Fig. 3 shows the network which connects the models for the recognition. A test of this rejection technique showed that it gives good results for our purposes.
4 Continuous online gesture recognition The main requirement for the use of gesture recognition in human-computer interfaces is a continuous online recognition with a minimum time delay of the recognition output. In this case, the user is able to position himself in front of the camera and to perform a gesture spontaneously without the necessity of initializing the recognition procedure by a second person. He should then be able to move around in front of the camera without the system falsely detecting a gesture (which has to be accomplished by the filler model described in the previous section),
only continuous recognition Hand-Waving-Both Hand-Waving-Right Hand-Waving-Left To-Right Kotow Filler Model 1 Filler Model 2 Figure 3. Network for recognition
and to perform a gesture at any arbitrary moment. It is obvious that this is the ultimate vision-based interface the user desires, but such a system performance is extremely difficult to achieve, because of the following reasons: The fact that the starting and ending point of the gesture is not known is a very complicated problem. In our case, the Viterbi decoder used for recognition is not able to process a fixed number of frames and to assign this sequence to one of the competing classes in the usual way. The problem becomes even more complicated because the duration of the sequence cannot be determined by simple methods such as motion detection for start and end of the sequence. This is due to the fact, that two gestures can be connected, and the fact that we allow ”garbage” movements (e.g. repositioning, turning) which can occur directly before or after a gesture. Therefore the gesture system has to detect start and end of a gesture online and has the additional burden to distinguish between garbage and real gesture. This can only be achieved by direct modification of the decoding process. In addition, the three modules, that were executed one after the other in our old system, have to work synchronously. A frame captured by the framegrabber has to be immedediately preprocessed and the features have to be extracted and used for the classification, before the next frame is captured. We developed two methods for online recognition. 4.1 Online backtracking
Online recognition is performed by a modification of the Viterbi algorithm, that we call online backtracking. In online backtracking we calculate the temporary most likely state sequence (optimal path) of the recognition network (see Fig. 3) at every incoming feature vector
State 4
3
2
1
1
2
3
Correspondence with most likely path
4
5
6
7
8
9
10
11
1 2 3 4
1 1 2 3 4
1 1 2 2 3 4
1 1 2 2 3 3 4
1 1 2 2 2 2 3 4
1 1 2 2 2 2 3 4 4
1 1 2 2 2 2 3 3 3 4
1 1 2 2 2 2 3 3 3 4 4
most likely path at time t resulting from online backtracking most likely path
Time
Figure 4. Online backtracking to determine the optimal state sequence
of the classification module. The major problem here is to reliably estimate the final most likely path, although the end of the sequence is not yet reached. Appending further observations to the temporary optimal path has no influence on the beginning of the optimal path. Fig. 4 shows a typical evaluation of the optimal path for a linear Hidden Markov Model. An increasing part of the temporary optimal path is equal to the final most likely state sequence. Only the end of the temporary optimal path changes. This means that we can determine the recognition result of the gestures up to a certain time. Fig. 5 shows the correspondence between the temporary optimal path and the final optimal path for a sample gesture sequence. The length of the temporary optimal path is equal to the amount of time steps to this point. The correspondence is the number of equal states of the final optimal path and the temporary optimal path. The important information is the difference between the length of the path and the correspondence with the final path. It shows the delay that has to be used to get the final optimal path from the temporary path. One can see that the optimal state sequence can be determined up to the time without the need for undoing any false assumptions. 150 frames at a frame rate of 12.5 fps accord to a time delay of 12 seconds. This delay is too high for human-computer interaction, but this method can be used for some special cases of gesture recognition (e.g. sign language recognition). Furthermore, the time delay can be reduced and the system could display assumptions which may be replaced by the final recognition result if
(
Length of path
400
300
200
100
Length of path Correspondence with final path 0 0
100
200
300
400
Time
Figure 5. Correspondence between final optimal path and online optimal path
they turn out to be wrong. 4.2 Continuous isolated gesture recognition
We developed an alternative method to reduce the time delay of the resulting output. We calculate the probability of the feature vector sequence at each time step for the gesture and filler models based on the forward algorithm. Based on our experiments with this method, we defined the following condition for the moment of recognition: If the probabilities of the observation sequence for all models are decreasing during a short interval, the most likely model at the end of this interval is the recognition result. Then we reset the forward algorithm and restart this procedure. Fig. 6 shows the logarithmic probabilities for some of the models during a sequence of six gestures. At the beginning of the gestures (time step 0, 19, 42, ...) the probabilities for the models are zero (logarithmic probability is ) and after a number of time steps, that depends on the number of states in the linear models, the probabilities increase rapidly (time step 10, 30, 53, ...). Then the probabilities of the observation sequence for the models decrease in most cases. This can be seen for instance at the time steps 40 and 125 in Fig. 6. If all probabilities are decreasing we assume that the model with the highest probability is the correct classified model. Tests showed that this method gives surprisingly very good recognition results and has no noticeable time delay.
F
0 Hand-Waving-Both To-Right To-Left To-Top Clapping Kotow Filler 1
-200
-400
-600
-800
-1000
-1200
-1400
-1600 0
20
40
60
80
100
120
140
Figure 6. Model likelihood during continuous isolated gesture recognition
5 Conclusions and future work We presented an extension of an existing gesture recognition system. The new capabilities of the system are recognition independence of the user position in the frame, rejection of undefined gestures by filler models, and continuous online recognition of spontaneous gestures. We were able to keep all advantages of the old system, while adding a considerable amount of new and important functionality to the system. The system is able to recognize dynamic gestures in person and background independent mode and works several times faster than real-time. Further work will be focused on the recognition of gestures of a rotated person and gestures of a person walking in the field of view.
References [1] L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. of the IEEE, 77(2):257–285, 1989. [2] G. Rigoll and A. Kosmala. New improved Feature Extraction Methods for Real-Time High Performance Image Sequence Recognition. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 2901–2904, Munich, Apr. 1997.
[3] G. Rigoll, A. Kosmala, and S. Eickeler. High Performance Real-Time Gesture Recognition Using Hidden Markov Models. In Proc. Gesture Workshop, Bielefeld, Germany, Sept. 1997. [4] R. C. Rose and D. B. Paul. A Hidden Markov model based keyword recongition system. In Proc. ICASSP ’90, pages 129–132, Alburquerque, NM, Apr. 1990. [5] J. Schlenzig, E. Hunter, and R. Jain. Recursive identification of gesture inputs using Hidden Markov models. Proc. of Workshop on Applications of Computer Vision, pages 187–194, Dec. 1994. [6] T. Starner and A. Pentland. Visual Recognition of American Sign Language Using Hidden Markov Models. In International Workshop on Automatic Face and Gesture Recognition, Zurich, Switzerland, 1995. [7] J. Yamato, J. Ohya, and K. Ishii. Recognizing Human Action in Time-Sequential Images Using Hidden Markov Models. In Proc. of Computer Vision and Pattern Recognition (CVPR), pages 379–385, Champaign, IL, June 1992. IEEE.