A Hidden Markov Model-Based Continuous Gesture Recognition

0 downloads 0 Views 340KB Size Report
Forward algorithm presents the best performance and achieves average rate recognition 98.94% and 95.7% for isolated and continuous gestures, respectively.
A Hidden Markov Model-Based Continuous Gesture Recognition System for Hand Motion Trajectory Mahmoud Elmezain, Ayoub Al-Hamadi, J¨org Appenrodt, Bernd Michaelis Institute for Electronics, Signal Processing and Communications (IESK), Otto-von-Guericke-University Magdeburg, Germany {Mahmoud.Elmezain, Ayoub.Al-Hamadi}@ovgu.de Abstract In this paper, we propose an automatic system that recognizes both isolated and continuous gestures for Arabic numbers (0-9) in real-time based on Hidden Markov Model (HMM). To handle isolated gestures, HMM using Ergodic, Left-Right (LR) and LeftRight Banded (LRB) topologies with different number of states ranging from 3 to 10 is applied. Orientation dynamic features are obtained from spatio-temporal trajectories and then quantized to generate its codewords. The continuous gestures are recognized by our novel idea of zero-codeword detection with static velocity motion. Therefore, the LRB topology in conjunction with Forward algorithm presents the best performance and achieves average rate recognition 98.94% and 95.7% for isolated and continuous gestures, respectively.

1. Introduction The use of hand gesture is an active area of research in the vision community, mainly for the purpose of sign language recognition and Human-Computer Interaction (HCI). A gesture is spatio-temporal pattern which may be static, dynamic or both. Static morphs of the hands are called postures and hand movements are called gestures. The goal of gesture interpretation is to push the advanced human-machine communication to bring the performance of human-machine interaction close to human-human interaction. In the last decade, several methods of potential applications [1, 2, 3] in the advanced gesture interfaces for HCI have been suggested but these differ from one to another in their models. Some of these models are Neural Network [1], Hidden Markov Model (HMM) [2] and Fuzzy Systems [3]. HMM is a statistical model and is capable of modeling spatio-temporal time series where the same gesture

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

can differ in shape and duration. The major problem is segmentation, which arising in real-time gesture recognition system for continuous gestures to extract isolated gestures. This problem means how to determine when a gesture starts and when it ends from hand motion trajectory. Lee et al. [5] proposed an ergodic model based on adaptive threshold to classify the meaningful gestures by a combination of all the states from all trained gesture models using HMM. But, this method runs offline over a non complex background. Nguyen et al. [6] developed a hand gesture recognition system to recognize real-time gesture in unconstrained environments where the system was tested to a vocabulary of 36 gestures including the American Sign Language (ASL) letter spelling alphabets and digits. This method [6] runs in real-time over a complex background, but it studies the posture of the hand, not the motion trajectory of the hand as it is in our system. Nianjun et al. [7] introduced a method to recognize the 26 letters from A to Z by using different HMM topologies with different states. In this paper, we develop a system to recognize the continuous Arabic numbers (0 - 9) in real-time from color image sequences by the motion trajectory of a single hand using HMM. Our system depends upon the following main steps; using Gaussian Mixture Model (GMM) for skin color detection, the orientation between two consecutive points is extracted as basic feature, zero-codeword detection with static velocity motion, Baum-Welch (BW) algorithm for training and Forward algorithm in conjunction with Viterbi path for testing. Moreover, each isolated gesture number is based on 30 video sequences (20 for training and 10 for testing) and the continuous gestures are also based on 70 video sequences for testing. Our system achieved better recognition results for isolated and continuous gestures. This paper is organized as follows; Section 2 demonstrates the suggested system in three subsections. The experimental results are described in Section 3. Finally, the summary and conclusion are presented in Section 4.

2. Hand gesture recognition system We propose an automatic system that recognizes both isolated and continuous gestures for Arabic numbers (0-9) in real-time from stereo color image sequences by the motion trajectory of a single hand using HMM. Our main motivation is to improve gesture recognition in natural conversation. This requires techniques for skin segmentation and handling occlusion between hands and faces to overcome the difficulties of overlapping regions. In particular, the gesture recognition system consists of three main stages; automatic segmentation and preprocessing, feature extraction, and classification (Fig. 1). Stereo images capture

2.1 Segmentation & 2.2 Feature extraction preprocessing

determine the initial configuration of GMM parameters. For more details, the reader can refer to [2]. For the skin segmentation of hands and face in stereo color image sequences an algorithm is used, which calculates the depth value in addition to skin color information. By the given depth information from the camera set-up system (Fig. 2 (b)) [2], the overlapping problem between hands and face is solved since the hand regions are closer to the camera rather than the face region. In addition, we use blob analysis to determine the boundary area, bounding box and centroid point for each hand region. Consequently, we select a search area in the next frame (Fig.2(c)) around the bounding box that is determined from the last frame in order to track the hand and reduce the gesture region of interest. Thereby, the new bounding box is calculated and the centroid point is determined. By iteration of this process, the motion trajectory of the hand so-called gesture path is generated from connecting hand centroid points (Fig. 5).

2.3 Classification

Skin Color Detection

Gesture path

Gestures Database

Hand Tracking

Vector Quantization

HMM Gesture Recognition

a. Hand detection

Automatic segmentation and tracking

In this paper, a method for detection and segmentation of a hand in stereo color images with complex background is described where the hand segmentation takes place using 3D depth map and color information. Segmentation of skin colored regions becomes robust if only the chrominance is used in analysis. Therefore, Y Cb Cr color space is used in our system where Y channel represents brightness and (Cb , Cr ) channels refer to chrominance [8]. We ignore Y channel in order to reduce the effect of brightness variation and use only the chrominance channels which fully represent the color information. A large database of skin and non-skin pixels is used to train the Gaussian model. In the training set, 18972 skin pixels from 36 different races persons and 88320 non-skin pixels from 84 different images are used. The GMM technique begins with modeling of skin by using skin database where a variant of k-means clustering algorithm [8] performs the model training to

c. Search area

Figure 2. Hands and face localization.

Figure 1. Gesture recognition system.

2.1

b. Depth value

2.2

Feature extraction

Selecting good features to recognize the hand gesture path play significant role in system performance. There are three basic features; location, orientation and velocity. The previous research [2, 7] showed that the orientation feature is the best in term of accuracy results. Therefore, we will rely upon it as a main feature in our system. A gesture path is spatio-temporal pattern which consists of centroid points (xhand , yhand ). So, the orientation is determined between two consecutive points from hand gesture path (Eq. 1). µ θt = arctan

yt+1 − yt xt−1 − xt

¶ ; t = 1, 2, ..., T − 1 (1)

where T represents the length of gesture path. The orientation θt is quantized by dividing it by 20◦ to generate the codewords from 1 to 18 (Fig. 3). Also the

codewords contain zero code notably in case of continuous gesture. Thereby, the discrete vector is determined and then is used as input to HMM. 90° 4

5

3

6

2

7

(xt+1,yt+1) 8

1 0

180° 9

dy

18 17

10 11

LJt (xt,yt)

0° 360°

16 12 13

dx

14

15

270°

(a)

(b)

Figure 3. Orientation and its codewords.

2.3

(Fig. 4). To be ensure that all states are used, we considered the LRB topology with 5 states for the following reasons (Fig. 4, Fig. 6). Since each state in Ergodic topology has many transitions rather than LR and LRB topologies, the structure data can be lost easily. In addition, LRB topology is more restricted rather than LR topology and has no backward transition where the state index either increases or stays the same as time increases. For the continuous gesture, our system is designed to segment and recognize the isolated gesture by zero-codeword detection (Fig. 3(b), Fig. 5). Each gesture ends by line segment, which is assigned a 0codeword. There are many gestures (i.e. 2, 4 and 7) contain zero-codeword in some segmented parts, which in turn lead to the separation process at the same gesture. In order to overcome this problem, we assign static velocity as a threshold. Furthermore, between the two gestures, there are links that must be ignored and is done by neglecting some frames adaptively after detecting the end point of gesture.

Hand gesture path classification

HMM is a mathematical model of stochastic process and includes three parameters λ = (Π, A, B) [4] where Π represents initial vector, A is the transition matrix and B refers to emission matrix. Evaluation, Decoding and Training are the main problems of HMM and they can be solved by using Forward-Backward, Viterbi and BW algorithms respectively [4]. Also, HMM has three topologies; Fully Connected (i.e. Ergodic model) where any state can be reached from other states, LR model such that each state can go back to itself or to the following states and LRB model in which each state can go back to itself or the next state only (Fig. 4). The isolated and continuous gestures paths are recognized by its discrete vector and HMM Forward algorithm corresponding to maximal gesture models over the Viterbi best path. Moreover, BW algorithm is used to do a full training for the initialized HMM parameters [2] to construct gestures database.

s1

s2

s3

s4

s5

Codewords : 12

0

5

14

14

Figure 4. Line segment for LRB topology. The number of states is based on the complexity of each gesture number (0-9) and is determined by mapping each straight-line segment into a single HMM state

Linkinig two gestures

0-codeword

Figure 5. Continuous gesture path.

3. Experimental results Our proposed system showed good results to recognize the numbers in real-time from color image sequences. In our experimental results, each isolated gesture number (0-9) was based on 30 video sequences which 20 video samples for training and nearly 10 video samples for testing. In other words, our database contains 200 video sequences for training and 98 video sequences for testing isolated gestures. It also contains 70 video sequences for testing continuous gestures. The higher priority was computed by forward algorithm in conjunction with Viterbi path to recognize the numbers in real-time frame by frame. The system was implemented in Matlab and C++ language and the input images were captured by Bumblebee stereo camera system that has 6 mm focal length for about 2 to 5 second at 15 frames per second with 240 × 320 pixels image resolution. We designed a different HMM topologies with

different states ranging from 3 to 10. From Fig. 6, the average ratio of LRB topology from 3 to 10 states was 98.94%. Also, LR and LRB achieved the best recognition at state 3 and state 4. In general, LRB topology with 5 states is the best in terms of results empirically.

Figure 6. Isolated gesture recognition results for HMM topologies with the number of states ranging from 3 to 10. The continuous gestures contain isolated gestures where these gestures are segmented and recognized by our idea of zero-codeword detection with static velocity as a threshold. The threshold of static velocity that was used in our system is smaller than 56 pixels per second. Our system tested 70 video sequences for continuous gestures where each video sequence contains more than one isolated gestures within itself. The recognition was achieved on continuous gestures 95.7%. The recogni-

Figure 7. Result of continuous gesture 90. tion ratio is the number of correctly recognized gestures over the number of test gestures (Eq. 2). # recognized gestures × 100% (2) # test gestures Fig. 7 shows the result of continuous gesture where at t=51, the first gesture is ended with result 9, at t=68 the linking between two gestures, and at t=129 the final result is related to 90. Reco. ratio =

4. Summary and conclusion This paper proposes an automatic system that recognizes both isolated and continuous gestures for Arabic numbers (0 - 9) from stereo color image sequences by the motion trajectory of a single hand using HMM. The proposed system is suitable for real-time application and depends on our novel idea of zero-codeword detection with static velocity to recognize the continuous gestures. Our database contains 30 video sequences for each isolated gesture number from 0 to 9 and 70 video sequences for continuous gestures. The LRB topology with 5 states presents the best performance. Our results show that; an average recognition rate is 98.94% and 95.7% for both isolated and continuous gestures, respectively. In future, our research focuses on the motion trajectory will be carried out by fingertip instead of hand centroid point using multi-camera system over combined features.

5

Acknowledgments

This work was supported by Bernstein-Group (BMBF: 01GQ0702) and NIMITEK grants (LSA: XN3621E/1005M).

References [1] X. Deyou. A Network Approach for Hand Gesture Recognition in Virtual Reality Driving Training System of SPG. In ICPR 06 Conference, pp. 519-522, 2006. [2] M. Elmezain, A. Al-Hamadi, and B. Michaelis. RealTime Capable System for Hand Gesture Recognition Using Hidden Markov Models in Stereo Color Image Sequences. W S C G Journal, Vol. 16(1), pp. 65-72, 2008. [3] E. Holden, R. Owens, and G. Roy. Hand Movement Classification Using Adaptive Fuzzy Expert System. Expert Systems Journal, Vol. 9(4), pp. 465-480, 1996. [4] R. R. Lawrence. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceeding of the IEEE, Vol. 77(2), pp. 257-286, 1989. [5] H. Lee and J. Kim. An HMM-Based Threshold Model Approach for Gesture Recognition. IEEE Trans. TPAMI, Vol. 21(10), pp. 961-973, 1999. [6] D. B. Nguyen, S. Enokida, and E. Toshiaki. RealTime Hand Tracking and Gesture Recognition System. IGVIP05 Conference, CICC, pp. 362-368, 2005. [7] L. Nianjun, C. L. Brian, J. K. Peter, and A. D. Richard. Model Structure Selection & Training Algorithms for a HMM Gesture Recognition System. In International IWFHR, pp. 100-106, 2004. [8] S. L. Phung, A. Bouzerdoum, and D. Chai. A Novel Skin Color Model in Y Cb Cr Color Space and its Application to Human Face Detection. In IEEE International Conference on Image Processing (ICIP), pp. 289-292, 2002.

Suggest Documents