Hidden Markov Model Based Continuous Online Gesture ... - CiteSeerX

50 downloads 37876 Views 43KB Size Report
... Kosmala, Gerhard Rigoll. Faculty of Electrical Engineering - Computer Science ... and continuous online recognition of spontaneous gestures. We show that .... has to be in the center of the image to obtain the best recog- nition rates. This is ...
Hidden Markov Model Based Continuous Online Gesture Recognition Stefan Eickeler, Andreas Kosmala, Gerhard Rigoll Faculty of Electrical Engineering - Computer Science Gerhard-Mercator-University Duisburg 47057 Duisburg, Germany feickeler,[email protected]

Abstract This paper presents the extension of an existing visionbased gesture recognition system using Hidden Markov Models (HMMs). Several improvements have been carried out in order to increase the capabilities and the functionality of the system. These improvements include positionindependent recognition, rejection of unknown gestures, and continuous online recognition of spontaneous gestures. We show that especially the latter requirement is highly complicated and demanding, if we allow the user to move in front of the camera without any restrictions and to perform the gestures spontaneously at any arbitrary moment. We present novel solutions to this problem by modifying the HMM-based decoding process and by introducing online feature extraction and evaluation methods.

1 Introduction Gesture recognition has emerged as one of the most important research areas in the field of motion-based image processing and recognition. The potential of visionbased human-computer interfaces for visual communication has increased dramatically during recent years and gesture recognition systems promise to play a major role in these scenarios. Although the problem of gesture recognition can be facilitated by using sensors for tracking body parts, or data gloves, clearly the most desirable interface would be a purely vision-based gesture recognition system, because this would lead to a mostly unrestricted visual communication process for the user. Identification of gestures using only image information is however the most demanding and challenging approach to gesture recognition. The following section gives a brief introduction into dynamic gesture recognition systems and presents the system developed by our research group. As it will be shown, this system does already represent a high technical standard for gesture recognition and obtains satisfactory recognition rates for a fairly

challenging person independent recognition task consisting of 24 gestures.

1.1 Dynamic gesture recognition systems In the research area of dynamic gesture recognition, Hidden Markov Models are one of the mostly used methods. The movements of a person over a sequence of images is classified. The first approach for the recognition of human movements based on Hidden Markov Models is described in [7]. It distinguishes between six different tennis strokes. This system divides the image into meshes and counts the number of pixels representing the person for each mesh. The numbers are composed to a feature vector, that is converted into a discrete label by a vector quantizer. The labels are classified based on discrete HMMs. The system described in [6] is capable of recognizing 40 different connected person dependent gestures of the American sign language. This system uses colored gloves to track the hands of the user, but can also track the hands without the help of gloves. The position and orientation of the hands are used for the HMM based classification.

1.2 Baseline System The work presented in this paper is based on our gesture recognition system presented in [3, 4]. The system is able to recognize 24 different gestures with an accuracy of 92.9%. The system operates in person and background independent mode, and is six times faster than real-time. The system consists of the preprocessing, feature extraction and the classification module. After the preprocessing of the image sequence, the feature extraction module calculates a sequence of feature vectors. In the final step, the classification module recognizes the feature sequence. The preprocessing is based on the difference image of adjacent frames, which is a good indication for moving objects in the image. Moments of the difference image are

used as features to specify the movement. The main features are:



center of motion m ~ (t)T

m (t) = x

P P

m x(t); m y(t)]

=[

xjD (x;y )j

x;y

x

x

m (t) = y

jD (x;y )j

P P

x;y

yjD (x;y )j

x;y

jD (x;y )j

x;y

(1)



mean absolute deviation from the center of motion

~(t)

T

 (t);  (t)]

=[

x

 (t) = x

P jD x; y x ? m P jD x; y j

y

(

)(

x;y

(

t))j

gesture models Hand-Waving-Both Hand-Waving-Left Hand-Waving-Right To-Right To-Left To-Top To-Bottom Round-Clockwise Round-Counterclockwise Stop Come Clapping Kowtow

filler model 1 Go-Left Go-Right Turn-Right Turn-Left

x(

)

(2)

Table 1. Gesture models and filler models only continuous recognition

x;y



intensity of motion i(t)

i(t) =

P jD x; y j P (

filler model 2 Nod-Yes Nod-No Spin A B C D

Hand-Waving-Both Hand-Waving-Right

)

x;y

1

(3)

x;y

One feature vector is calculated for each frame of the image sequence. The feature vector sequence is classified by the Viterbi algorithm based on the use of Hidden Markov Models [2], that were previously trained on sample data. See [4] for a detailed explanation of our gesture recognition system. The recognition system has three main restrictions. The first restriction is the position dependence requiring the user of the system to be positioned in the center of the image. The second restriction is the lack of a method to reject unknown gestures. The third is the need for a second operator to use the system, because the recognition is based on image sequences of three seconds length. The second user has to start the image sequence recognition by pressing a button. The goal of this paper is to present an extended system that eliminates the described restrictions.

Hand-Waving-Left To-Right Kotow Filler Model 1 Filler Model 2

Figure 1. Network for recognition the shadows of the person and noise in the images. For this difference image we calculate the center of gravity according to Eq. (1). This center corresponds to the center of the person. The relative position of the center of motion to the center of the person is used in the feature vector. This is only a simple way of person tracking, but it is sufficient for gesture recognition.

3 Rejection of undefined gestures 2 Position independent recognition One restriction of our previous system is that the user has to be in the center of the image to obtain the best recognition rates. This is a direct result from the fact that the above mentioned features are global motion features, which indicate the absolute position of the motion in the image and therefore are position dependent. Several approaches have been tested [1] in order to obtain a position independent recognition while keeping the advantage of the simple and robust feature extraction. The user is tracked in the image by comparing the current image with a stored image of the background. The absolute difference image is calculated and pixel values below a relatively high threshold are set to zero to eliminate

The rejection of gestures which are not defined in the gesture vocabulary is very important for the use of the recognition system in real-world applications. The HMM approach provides an easy solution to this problem by using so-called ”filler models” for modeling arbitrary movements and other ”garbage” motions. We separated the gestures of the baseline system into a gesture vocabulary and gestures used to train the filler models, which are defined to represent all other movements (Tab. 1). In speech recognition, filler models are used for HMM based keyword spotting [5]. The remaining gesture vocabulary consists of 13 different gestures, one filler model for movements of the whole body and one filler model for head and arm movements. Fig. 1 shows the network which connects the models for the recognition.

A test of this rejection technique showed that it gives good results for our purposes.

4 Continuous online gesture recognition

0 Hand-Waving-Both To-Right To-Left To-Top Clapping Kotow Filler 1

-200

-400

-600

The main requirement for the use of gesture recognition in human-computer interfaces is a continuous online recognition with a minimum time delay of the recognition output, but such a system performance is extremely difficult to achieve, because of the following reasons: The fact that the starting and ending point of the gesture is not known is a very complicated problem. In addition, the three modules, that were executed one after the other in our old system, have to work synchronously. A frame captured by the framegrabber has to be immedediately preprocessed and the features have to be extracted and used for the classification, before the next frame is captured. Two methods were developed for online recognition [1], but only one method is applicable to human-computer interfaces. The continuous recognition of isolated gestures has a minimum time delay for the recognition result. The probability of the feature vector sequence is calculated at each time step t for the gesture and filler models using the forward algorithm. Based on experiments with this method, the following condition for the moment of recognition was defined: If the probabilities P (Oji ) of the observation sequence O for all models i are decreasing during a short interval, the most likely model at the end of this interval is the recognition result. Then we reset the forward algorithm and restart this procedure. Fig. 2 shows the logarithmic probabilities for some of the models during a sequence of six gestures. At the beginning of the gestures (time step 0, 19, 42, ...) the probabilities for the models are zero (logarithmic probability is ?1) and after a number of time steps, that depends on the number of states in the linear models, the probabilities increase rapidly (time step 10, 30, 53, ...). Then the probabilities of the observation sequence for the models decrease in most cases. This can be seen for instance at the time steps 40 and 125 in Fig. 2. If all probabilities are decreasing we assume that the model with the highest probability is the correct classified model. Tests showed that this method gives surprisingly very good recognition results and has no noticeable time delay.

5 Conclusions and future work We presented an extension of an existing gesture recognition system. The new capabilities of the system are recognition independence of the user position in the frame, rejection of undefined gestures by filler models, and continuous online recognition. We were able to keep all advantages of the old system, while adding new and important functionality to the system. The system is able to recognize dynamic

-800

-1000

-1200

-1400

-1600 0

20

40

60

80

100

120

140

Figure 2. Model likelihood during continuous isolated gesture recognition

gestures in person and background independent mode and works several times faster than real-time. Further work will be focused on the recognition of gestures of a rotated person and gestures of a person walking in the field of view.

References [1] S. Eickeler and G. Rigoll. Continuous Online Gesture Recognition Based on Hidden Markov Models. Technical report, Faculty of Electrical Engineering - Computer Science, Gerhard-Mercator-University Duisburg, 1998. http://www.fb9-ti.uni-duisburg.de/report .html. [2] L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. of the IEEE, 77(2):257–285, 1989. [3] G. Rigoll and A. Kosmala. New improved Feature Extraction Methods for Real-Time High Performance Image Sequence Recognition. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pages 2901–2904, Munich, Apr. 1997. [4] G. Rigoll, A. Kosmala, and S. Eickeler. High Performance Real-Time Gesture Recognition Using Hidden Markov Models. In Proc. Gesture Workshop, Bielefeld, Germany, Sept. 1997. [5] R. C. Rose and D. B. Paul. A Hidden Markov model based keyword recongition system. In Proc. ICASSP ’90, pages 129–132, Alburquerque, NM, Apr. 1990. [6] T. Starner and A. Pentland. Visual Recognition of American Sign Language Using Hidden Markov Models. In International Workshop on Automatic Face and Gesture Recognition, Zurich, Switzerland, 1995. [7] J. Yamato, J. Ohya, and K. Ishii. Recognizing Human Action in Time-Sequential Images Using Hidden Markov Models. In Proc. of Computer Vision and Pattern Recognition (CVPR), pages 379–385, Champaign, IL, June 1992. IEEE.

Suggest Documents