Robust Modeling and Recognition of Hand Gestures with ... - CiteSeerX

3 downloads 0 Views 231KB Size Report
In this paper, we propose a new gesture recognition model for a set of both one-hand and two-hand ges- tures based on the dynamic Bayesian network frame-.
Robust Modeling and Recognition of Hand Gestures with Dynamic Bayesian Network Heung-Il Suk† , Bong-Kee Sin‡ , and Seong-Whan Lee† Department of Computer Science and Engineering, Korea University, Korea † {hisuk, swlee}@image.korea.ac.kr ‡ Department of Computer Engineering, Pukyong National University, Korea ‡ [email protected]



Abstract In this paper, we propose a new gesture recognition model for a set of both one-hand and two-hand gestures based on the dynamic Bayesian network framework which makes it easy to represent the relationship among features and incorporate new information to the model. Unlike the coupled HMM, the proposed model has room for common hidden variables which are believed to be shared between two variables. In an experiment with ten isolated gestures, we obtained a recognition rate upwards of 99.59% with leave-one-out cross validation. The proposed model is believed to have a strong potential for successful applications to other related problems such as sign languages.

1

Introduction

A hand gesture can be described by a locus of hand motion recorded in a sequence of frames. To model this kind of sequential input hidden Markov models(HMMs) have been widely used in the field of speech recognition, computer vision, and so on. Recently, there has been an increasing interest in a more general class of probabilistic models, called dynamic Bayesian networks(DBNs), which include HMM and Kalman filter as special cases. The DBN is a generalized version of the Bayesian network(BN) with an extension to temporal dimension. Du et al. defined five actions that could happen between two persons and developed a DBN-based model which took local features such as contour, moment, height and global features such as velocity, orientation, distance as observations [3]. Avil´es-Arriaga et al. extracted the area and the center of a hand as the input features and used a n¨aive DBN to recognize ten hand

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

gestures [1]. Earlier, Pavlovic proposed the use of DBN for gesture recognition that can be seen as a combination of an HMM and a dynamic linear system [7]. On the other hand Le´on et al. used a sliding window of 15 frames and represented the motion between contiguous frames by a random variable [4]. Last but not least, Nefina et al. compared several different methods of audiovisual speech recognition and suggested the use of coupled HMMs and factorial HMMs by showing that coupled HMMs outperformed all the other models in the performance of recognition [6]. In this paper, we describe a dynamic Bayesian network-based hand gesture recognition method that can be used to control a media player or PowerPointT M . Taking the motion of each hand and the relative positions between two hands and between a face and two hands as observations, we proposed a new gesture model utilizing a DBN framework. In an experiment with ten isolated gestures, the model achieved a recognition rate of 99.59% with cross validation. In the rest of the paper, we will begin with defining ten hand gestures and covering topics on what kinds of features to use and how to extract them from an input video. The proposed use of DBN model and the inference algorithm are explained in Section 3 and the experimental results are presented and analyzed in Section 4. Finally Section 5 concludes the paper.

2 2.1

Hand Gestures and Feature Extraction Ten Hand Gesture Commands

With a potential application to controlling a media player or PowerPointT M , we define 10 different hands gestures, five two-hand gestures and five one-hand gestures as shown in Figure 1. In the figure each black dot

2(c)). The code ‘0’ implies that two hands and a hand and a face are overlapping. (a) OP

(b) CL

(c) PL

(d) PA

(e) ML

(f) MF

(g) TF

(h) TB

3 3.1

(j) FR

(i) FF

Figure 1. Ten hands gestures: (a) open a file, (b) close a file, (c) play, (d) pause, (e) move to the last frame, (f) move to the first frame, (g) 10 seconds forward, (h) 10 seconds backward, (i) fast forward, (j) fast rewind.

represents the starting position of the hand and each directed curve the motion trajectory of the hand.

2.2

Feature Extraction

Dynamic Bayesian Network: DBN

Although the HMM [8] is a very useful tool for modeling variabilities, its power is limited to a simple state space with a single discrete hidden variable. The coupled HMM is an HMM variant tailored to represent the interaction of two independent processes [2]. It is essentially two HMMs coupled between the state variables across the HMMs. Although useful for modeling simple interacting processes, this model does not have room for common hidden variables which are believed to be shared between two variables. The dynamic Bayesian network(DBN) [5] is a generalized framework of HMM and Bayesian network(BN). With an appropriate design, it can make up for the weaknesses of the HMMs by factorizing the hidden variable into a set of random sub-variables.

3.2

Among the variety of possible features, the most important information about a gesture will be the motion of the hand. The motion can be described by the trajectory of the hand in space over time which in turn is represented by a sequence of positions or equivalently motions vectors xt , t = 1, · · ·. Each pair of successive hand locations defines a local motion vector. Then we can represent the whole motion trajectory by a sequence of motion vectors each of which is in turn encoded by a direction code using the scheme as shown in Figure 2(a). The central code ‘0’ denotes ‘no motion’. Given a video, we extract two chain codes one for each hand. With the separate chain coding for each hand, ambiguities can arise between gestures. To remove the ambiguities incurred by representing the motion using only the chain code, we introduce two more features: the relative position of the two hands(Figure 2(b)) and the position of the each hand relative to the face(Figure

Hand Gesture Model

Proposed Model Architecture

The ten hand gestures defined in Section 2.1 include bimanual gestures as well as monomanual gestures. We are proposing a new design of DBN which has three hidden variables and five observable variables. The two hidden variables X 1 and X 2 model the motion of the left and the right hand respectively, and each is associated with two observations of the features of the corresponding hand’s motion and the position relative to the face. The third hidden variable X 3 has been introduced to resolve the ambiguity between similar gestures. It models the spatial relation between hands. Suppose that the relative position of two hands has changed from Figure 3(a) to Figure 3(b). In this case, when the left hand is lowered, we can infer that the right hand has been either raised or stationary as shown in Figure 3(c). Similarly when the right hand is raised, the left hand has been either lowered or stationary as shown in Figure 3(d). This means that given the value of the node X 3 the two nodes X 1 and X 2 are conditionally dependent X1 ⊥  ⊥ X 2 |X 3 .

(a)

(b)

(c)

Figure 2. Features: (a) 17 direction codes for hand motions, (b) hand-hand positional relation, (c) face-hand positional relation.

(1)

We take this conditional relationship to be the backbone of the proposed DBN. Using the first-order Markov assumption to simplify the motion dynamics we propose a new hands gesture model as shown in Figure 4, with hidden variables in gray square nodes and observable variables in white circle nodes. In the figure, O1 and O3 denote the chain

(a)

Π

=

A

=

P (X11 )P (X12 )P (X13 |X11 , X12 )  T  1 ) · P (X 2 |X 2 )  P (Xt1 |Xt−1 t t−1 3 3 1 2 ·P (Xt |Xt−1 , Xt , Xt )

=

T  t=1

P (Ot1 , Ot2 |Xt1 )P (Ot3 , Ot4 |Xt2 )P (Ot5 |Xt3 ) (6)

(d)

(c)

Figure 3. Changes of the relative position of two hands: (a) the initial position of the hands, (b) after a motion, (c) the left hand lowered, (d) the right hand raised. X t1 Ot1

Ot2

X

2 t

Ot3

Ot4

4 4.1

Ot1+1 Ot2+1

X t2+1 Ot3+1 Ot4+1

X t3

X t3+1

Ot5

Ot5+1



Data Description

4.2 Modeling with Coupled HMM

code of left and right hand motion, O2 and O4 the spatial relation between each hand and the face, and O5 the spatial relation between two hands. In this model the time dependency of the hidden variable X 3 , that is, 3 3 → Xt3 → Xt+1 → · · · , implicitly cap· · · → Xt−1 tures the correlation between two hidden variables X 1 and X 2 as was done by the pair (X 1 , X 2 ) in the coupled HMM. This simplifies the proposed model and relieves it of the complication of the coupled HMM.

Inference

The inference over a DBN is simply computing the marginal probability P (Xi |O1:τ ) of hidden variables Xi given an observation sequence O1:τ = O1 O2 · · · Oτ . The joint probability of variables in a DBN can be factored into a product of local conditional probabilities one for each variable through conditional independencies or d-separation [5]. The full joint probabil1:3 = [X 1 , X 2 , X 3 ]t ...[X 1 , X 2 , X 3 ]t and O 1:5 = ity of X1:T 1 1 1 T T T 1:T 1 1 , · · · , O 5 ]t for the DBN in Figure 4 can [O1 , · · · , O15 ]t ...[OT T be computed by multiplying three factored probabilities as follows: 1:3 1:5 P (X1:T , O1:T )

Experimental Results

For each of the ten gestures, we captured seven videos from seven different subjects at different times, a total of 490 video sequences for training and testing the baseline models. All the videos were captured using small CMOS camera at 30 frames per second, 320×240 in size and 24 bit colors. We carried out 7-fold leaveone-out cross-validation in which every seventh of the set in turn was used for testing while the rest were for training. The overall performance was measured by the average of rates of the seven tests.

X t1+1

Figure 4. The proposed dynamic Bayesian network model for hands gestures.

3.3

(5)

t=2

(b)

B



(4)

=

1:5 1:3 1:3 P (O1:T |X1:T )P (X1:T )

(2)

=

Π×A×B

(3)

In the first experiment, we created coupled HMMs for the ten gestures to compare the performance with that of the proposed DBNs. They observe the same chain codes of each hand’s trajectory. We assigned uniform values to the probability distributions of the hand which is not participating in one-hand gestures to ignore its unintentional motion. When tested on a selected set of video showing only one hand in one hand gestures and the other hand is out of scene, the coupled HMMs recorded the hit ratio of up to 97.35%. The detailed cross-validation result is shown in the Figure 5(a) where each bar represents the hit ratio for the corresponding subset of data.

4.3 Dynamic Bayesian Network with Additional Information Just like the coupled HMMs, we created ten DBNs for the target gestures but with the addition of the relative position between two hands as required by the proposed model. This information, although not absolutely required as an input (it can be missing), is important for inferencing in the proposed DBN models. The DBN recorded the recognition rate of 98.98%(Figure 5(b)). One advantage of the proposed model is that it can accept both one-hand and two-hand gestures whether two hands are in view or not. When tested on an additional set of six one-hand gesture samples, which was

Table 2. Hand gesture recognition results.

(a)

(b)

Figure 5. Comparison of recognition rates between coupled HMM and DBN: (a) Coupled HMM having observations of chain codes, (b) DBN having observations of chain codes and hand-hand relative positions.

Gestures

# of test gestures

# of hits

# of misses

Recognition rates(%)

OP CL PL PA MF ML TF TB FF FR Sums and Rates

49 49 49 49 49 49 49 49 49 49 490

49 49 49 49 48 48 49 49 49 49 488

0 0 0 0 1 1 0 0 0 0 2

100 100 100 100 97.59 97.59 100 100 100 100 99.59

Acknowledgments Table 1. Comparison of the coupled HMM and the DBN on the one-hand gestures: (hit), ×(miss) where the mis-recognized labels are given in parentheses. One-Hand gesture data FF PA TF TB FR TF Hits

Recognition results Coupled HMM DBN ×(ML)  ×(PA) ×(PA) ×(MF) ×(PA) 1

      6

not included in the data set of 490 video clips, we obtained the result detailed in Table 1. The coupled HMM recognized only one of the six gestures while the DBN recognized all of them correctly. This tells us that the additional information helps disambiguate the gestures ‘FF’, ‘ML’, ‘FR’, and ‘MF’ which are ambiguous when only chain codes are used. With all the input features considered, the proposed model recognized 99.59% of the input gestures as shown in the rightmost column of Table 2.

5

Conclusions

We proposed a new hands gesture model having three hidden variables which together take five observations: chain codes of each hand’s motion, relative position between the face and each hand, and relative position of two hands. We tested the DBN-based system performance with a data set which was captured from 7 different subjects at different times, in total 490 video sequences. The DBN model showed the recognition rate of 99.59% in isolated gesture recognition with a cross-validation technique.

This research was supported by the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Knowledge Economy of Korea

References [1] H. Avil´es-Arriaga, L. Sucar, and C. Mendoza. Visual recognition of similar gestures. In Proc. of IEEE International Conference on Pattern Recognition, volume 1, pages 1100–1103, August 2006. [2] M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for complex action recognition. In Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 994–999, June 1997. [3] Y. Du, F. Chen, W. Xu, and Y. Li. Recognizing interaction activities using dynamic bayesian network. In Proc. of IEEE International Conference on Pattern Recognition, pages 618–621, August 2006. [4] R. Le´on. Continuous activity recognition with missing data. In Proc. of IEEE International Conference on Pattern Recognition, volume 1, pages 439–446, August 2002. [5] K. Murphy. Dynamic Bayesian Network: Representation, Inference and Learning. PhD thesis, University of California, Berkeley, 2002. [6] A. Nefina, L. Liang, X. Pi, X. Liu, and K. Murphy. Dynamic bayesian networks for audio-visual speech recognition. Journal of Applied Signal Processing, 11(1):1–15, 2002. [7] V. Pavlovic. Dynamic Bayesian Networks for Information Fusion with Applications to Human-Computer Interfaces. PhD thesis, University of Illinois at UrbanaChampaign, 1999. [8] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:257–285, 1989.

Suggest Documents