Dynamic Hand Gesture Recognition Based on SURF

Dynamic Hand Gesture Recognition Based on SURF Tracking Jiatong Bao, Aiguo Song, and Yan Guo

Hongru Tang

School of Instrument Science and Engineering Southeast University Nanjing, Jiangsu Province, China [email protected]

School of Energy and Power Engineering Yangzhou University Yangzhou, Jiangsu Province, China [email protected]

Abstract—A novel method of dynamic hand gesture recognition based on Speeded Up Robust Features (SURF) tracking is proposed. The main characteristic is that the dominant movement direction of matched SURF points in adjacent frames is used to help describing a hand trajectory without detecting and segmenting the hand region. The dynamic hand gesture is then modeled by a series of trajectory direction data streams after time warping. Accordingly, the data stream clustering method based on correlation analysis is developed to recognize a dynamic hand gesture and to speed up calculation. The proposed algorithm is tested on 26 alphabetical hand gestures and yields a satisfactory recognition rate which is 87.1% on the training set and 84.6% on the testing set. Keywords- dynamic hand gesture recognition; SURF; feature tracking; correlation analysis; data stream

I.

INTRODUCTION

Body language is an important part of communication for human beings. Direct use of human gestures for providing natural human-computer interaction (HCI) without using other devices has been a great emphasis in HCI research. The common interaction methods mainly include recognition of human face, facial expression, hand gesture, and body gesture, etc. More and more application systems are taking vision-based hand gesture recognition as the main interface of HCI, owing to the fast development of computing technology and the fact that Virtual Reality (VR) has been widely acquainted. In general, vision-based hand gesture recognition has three basic processing stages including hand segmentation, gesture analysis, and finally gesture recognition. The main procedure of hand segmentation is to detect hand regions in the sequence of hand gesture and separate them from backgrounds. It should be mentioned that the correctness of hand gesture recognition has a close relationship to the accuracy of hand segmentation. Many conventional methods [1-3] of hand segmentation take advantages of color cues. However, the accuracy of hand gesture segmentation tends to be affected easily by several factors such as the skin color differences between humans, the sensitivity of color to illumination, and especially the situation when hand suffers in the presence of objects with similar skin color. Therefore, many works [4-5] try to combine the information from color cue and other cues (i.e., motion, shape) in order to provide more accurate hand detection or segmentation. Besides, with the intensive study of object tracking, many object tracking methods such as parameterized

deformable template based tracking algorithm [6], and particle filtering based tracking algorithm [7] are employed for hand localization. In addition, hand detection also belongs to the research scope of object detection. Thus, many pattern recognition methods [8] based on classifiers are always used for hand detection and recognition. In the stage of hand gesture analysis, hand postures as well as motion patterns are calculated from the hand gesture frame sequence, and the hand gesture model is created accordingly. The final stage is hand gesture recognition in which the output of current gesture model from the second stage is compared with each model in hand gesture database where the most matched hand gesture is selected as final recognition result. Actually, different hand gesture modeling methods has diverse recognition approaches. In [1], the hand gesture is recognized by counting the number of active fingers. In [9], the hand gesture is modeled as the star skeleton, and the recognition is performed by distance signature. Other features such as hand position and direction, finger position and direction, and distance between the fingers are always used for establishing spatial model of hand gesture. As to dynamic hand gesture, we should further take the changes of spatial model in the temporal sequence into account. By integrating the temporal and spatial characteristics, many works [10] use a trajectory in high dimension space to represent dynamic hand gesture, and typically use Hidden Markov Model (HMM) to recognize hand gesture. The non-trajectory gesture representation method [11] is also used. It mainly uses statistical features (i.e. statistical moment) for hand gesture recognition. To our knowledge, most hand gesture recognition methods extremely rely on the results of hand segmentation. However, how to segment hand gesture exactly under complex circumstances is still a difficult problem. Using object tracking methods to detect hand gesture region within complex backgrounds is a good choice which can get relatively better results. However, the two factors that are initial object modeling and model updating during tracking process tend to make hand tracking fail. The pattern recognition algorithms used for object detection usually have high computational complexity, and have seldom been applied in real-time systems. Considering the fact that there are a wealth of stable salient points in hand gesture images, and their movements can reflect the hand gesture changes, we propose the novel method for dynamic hand gesture recognition based on SURF [12] tracking.

The novelties are as follows. First, the proposed algorithm focuses on observing the dominant movement of matched SURF points only in adjacent frames but not in overall image sequence because of the fact that feature points do not keep unchanged in the overall gesture sequence. Second, the dominant movement direction is calculated and selected as the movement feature for hand gesture representation. Thus, the accuracy of hand gesture representation does not depend on the correctness of hand segmentation. Third, the robust and efficient SURF algorithm is employed to extract salient feature points, which ensures that the general movement of matched SURF points in adjacent frames can characterize the hand gesture movement accurately. Based on the feature extraction method for hand gesture representation, the dynamic hand gesture is finally modeled by a series of trajectory direction data streams after time warping. Accordingly, the data stream clustering method based on correlation analysis is developed to recognize a dynamic hand gesture. II.

MODELING OF DYNAMIC HAND GESTURE

A. Feature Extraction The dominant movement direction of matched SURF points in adjacent frames is used for dynamic hand gesture representation. The most appealing descriptor to represent salient points seems to be the SIFT [13] descriptor which can get good performance in the case of scaling, rotation, view point changing, etc. Inspired by SIFT, the SURF descriptor outperforms SIFT in computational speed greatly and gains good robustness as well as SIFT. This brings us opportunity to adopt the SURF descriptor to describe salient feature points in object in order to satisfy the demand of both robustness and effectiveness. Algorithmically, interest points are first found in the image by a Fast-Hessian Detector. The approximation of the Hessian determinant is used to find extreme points or interest points in each scaled image. Then, these interest points are compared with their 26 neighbors in the scale-space. If they are still extreme, they are then considered as the feature point candidates. The localization of accurate feature points based on interpolation is performed next. For each found accurate feature point, Haar wavelet responses for both x and y directions are then calculated around it and the most dominant direction is chosen to achieve rotation invariance. Finally, Haar wavelet responses over the surrounding area of each feature point are used to form the final descriptor of that point. In the SURF algorithm, the most significant improvement in computation speed is achieved by use of integral image, which allows fast calculation of filter responses in most all steps. The next procedure is to match extracted SURF points in adjacent frames. Let St = { f t i }iM=1 be the set of M SURF points in the frame at time t , where f t i represents the feature vector of the i th point, and let St +1 = { f t +j 1}Nj =1 be the set of N SURF points in the frame at time t + 1 . The matching method in [13] is adopted. For any point f t i ∈ St , its Euclidean distance from every point in St +1 is calculated. This operation is repeated for N times, and the shortest ( d1 ) as well as the second shortest

distance ( d 2 ) are known. If d1/ d 2 is lower than the preset threshold, then the point which induces the shortest distance is selected as the one matched with f t i . The computational complexity of finding all matched points is O ( M × N ) . The final procedure is to calculate the dominant movement direction of all pairs of matched SURF points. The pairs of matched points with no displacement are removed firstly. Suppose we finally get K pairs of matched SURF points, and they are represented by St ,t +1 = {< X ti , X ti+1 >}iK=1 , where < X ti , X ti+1 > is the i th pair. Then, the dominant movement

direction drt (t ) is calculated as follows. First, a computational model is defined as q = {qu }um=1 , where u is direction angle interval with length of l , m = 360 − l , and qu represents the probability of drt (t ) ∈ u as in (1). K

qu = C ¦ k (|| X ti ||2 )δ [b( X ti ) − u ]

(1)

i =1

In the above equation, k ( x) is an isotropic kernel function, δ [b( X ti ) − u ] is kronecker function which is 1 if the direction angle of < X ti , X ti+1 > falls into the u th bin, and the constant C is a normalization function defined as C = 1

K

¦ k (|| X

i t

||2 ) .

i =1

Finally, the dominant movement direction of all pairs of matched SURF points is calculated by (2) drt (t ) = mid (arg max{qu }um=1 ) . u

B. Gesture modeling The dynamic hand gesture is modeled by a series of trajectory direction data streams. The trajectory direction data stream is a data sequence which is represented by Y = { y0 ,..., yT −1} , where yt (t = 0,..., T − 1) is the data item at time t , and T is the total length of the hand gesture sequence. Thus, the trajectory direction data stream can also be represented by a one-variable real function f (t ), t ∈ [0, T ) . Because the motion of hand gesture is continuous in space, the ideal function curve of f (t ) must be smooth. However, the function curve does always have noises generated by the complex backgrounds. Therefore, a smoothing method which performs in frequency domain is employed to remove the noises. First, Fourier transform is conducted on f (t ) using (3). F (u ) =

1 T −1 − j 2π (ux ) ] f (t )exp[ ¦ T t =0 T

(3)

Then, the low pass filter as shown in (4) is used for smoothing. 1 if D(u ) ≤ D0 H (u ) = ® ¯0 Otherwise

(4)

The final smoothed curve can be got by (5). 1 T −1 j 2π (ux ) f (t ) = ¦ ( F (u ) × H (u ))exp[ ] T u =0 T

(5)

As to the same dynamic hand gesture, different person would finish it with different time taken. Thus, the smoothed curve needs to be further adjusted in order to keep uniform. Given the smoothed curve of length L with the form of Yk = fk (t ), t ∈ [0, L) , the time warping procedure is then I L −1 executed by using Yk = fk (t × ), t ∈ [0, T ) , where T is set to T −1

be the average length of all data streams in training set of hand gestures. To this extent, the model of any hand gesture g in gesture database is defined as G M g =< Dg , N g ,Tg , Cg >

(6)

where Dg is the set of trajectory direction data streams, N g is the number of data streams or training samples, Tg is the G length of the data stream, and Cg is the center of Dg or the clustering center of gesture g . III.

RECOGNITION OF DYNAMIC HAND GESTURE

Based on the above modeling method, the correlation analysis approach is introduced to measure the similarity between two hand gestures. Given two gestures represented by two data streams X = {x0 ,..., xT −1} and Y = { y0 ,..., yT −1} respectively, their correlation coefficient is calculated by ρ ( X ,Y ) =

¦ ¦

T −1 t =0

T −1 t =0

( xt − x )( yt − y )

( xt − x ) 2 ¦ t = 0 ( yt − y ) 2 T −1

,

overall processing speed is 8-16 frames/s. Thus, it satisfies the demand of real-time interaction. In this paper, 26 alphabetical dynamic hand gestures shown in Fig.1 are used. Each gesture has 40 samples that are captured from 20 persons. Each person performs one gesture for twice. Among the 40 samples of each gesture, 20 samples are used for training, while the other 20 are used for testing. Since one gesture could be finished in 1.5-4 seconds, the expected gesture sequence length is 16-40 frames. Therefore, the overall time taken for recognition of one gesture is 1-3 seconds. Fig.2 shows the average sample length and the average processing time of each alphabetical hand gesture (see from left to right). Fig.3 shows the processing result of a sample of gesture h . The hand gesture sequence of frames 1, 12, 16, and 20 is shown in the top row and overlapped with matched SURF points. The circles are SURF points existing in the next frame and are connected with the matched SURF points in the current frame by a black line. It is obvious that there exists a wealth of feature points whose movement can reflect the motion of hand gesture well. The left figure in the bottom row draws the curve of trajectory direction of hand gesture, while the right one draws the smoothed curve. Note that there is a great change of lighting during the hand gesture sequence. However, it has little impact on the proposed algorithm. Fig.4 shows the processing result of another sample of gesture h . Although other hand gestures appear in the background, and in the 15th frame a great noise disturbance arises, the algorithm performs well.

(7)

1 T −1 1 T −1 ¦ xt , and y = T ¦ t =0 yt . From (7), we can T t =0 see that ρ ( X , Y ) ≤ 1 , and the larger value of ρ ( X , Y ) means the stronger correlation between two gestures. Especially, X and Y are uncorrelated when ρ ( X , Y ) = 0 . where x =

G c = arg max{ρ (Ck , Yc )}kK=1 k

(8)

40 3.0

35 30

average processing time /s

average sample length /frames

Suppose that there are K number of defined dynamic hand gestures modeled by M k , k = 1,..., K in database, and the current hand gesture sample is represented by a data stream Yc . M k , k = 1,..., K has the form of (6). Then, the recognition result of current hand gesture Yc can be calculated by using (8).

Figure 1. 26 alphabetical hand gestures

25 20 15 10 5 0

2.5

2.0

1.5

1.0

0.5

0.0

a b c d e f g h i

a b c d e f g h i

j k l m n o p q r s t u v w x y z

j k l m n o p q r s t u v w x y z

26 alphabetical hand gestures

26 alphabetical hand gestures

Figure 2. Average sample length and processing time of each alphabetical hand gesture

G Note that Yc need to be the same length with Ck , so Yc is adjusted at each time when (7) is performed.

The proposed algorithm is implemented on a PC with 2.20GHz CPU and 2G memory running Windows XP. The dynamic hand gesture sequences with 24-bit true color are captured by an ordinary webcam with a resolution of 176h144 pixels. The algorithm is coded by VC++, and the OpenCV SDK is used. In the algorithm, the hand gesture capture thread runs in parallel with the hand gesture analysis thread. The

400

400

300

300 y(t) /angle

EXPERIMENTAL RESULTS y(t) /angle

IV.

200

100

0

200

100

0

5

10

15 t /frame

20

25

30

0

0

5

10

15 t /frame

20

25

30

Figure 3. Algorithm result of one sample of hand gesture h

400

300

300 y(t) /angle

y (t) /angle

400

200

100

0

200

100

0

5

10

15 20 t /frame

25

30

0

35

0

5

10

15 20 t /frame

25

30

35

Figure 4. Algorithm result of another sample of hand gesture h

For each alphabetic hand gesture, its training samples are processed. Consequently, the hand gesture database is constructed. The first four gesture reference models are visualized in Fig.5 (see from left to right, top to bottom). In the figure, the blue solid lines are data streams of trajectory directions, while the red circle line is the clustering center of the hand gesture. Finally, every gesture sample is recognized, and we totally have satisfactory recognition success rates of 87.1% on the training data set and 84.6% on the testing data set. The recognition success rate of each hand gesture is shown in Fig.6. From this figure, we see that some gestures such as h , u , x , and y have low recognition success rate due to their similarity in hand gesture movement. 400

350

350

300

300

ACKNOWLEDGMENT This work is supported by the National High Technology Research and Development Program (863) under Grant No. 2006AA04Z246, the Key Plan Project of State Education Ministry of China under Grant No. 708045, and Nature Science Foundation of Jiangsu Province under Grant No. BK2009183. REFERENCES [1]

[2]

250

y(t) /angle

250 y(t) /angle

gesture is 1-3 seconds, thus the proposed algorithm satisfies the demand of real time. However, there are still some drawbacks in the proposed algorithm. First, we suppose that user should stop for a while before the meaningful gesture starts and after it ends. Especially, there should be no disturbance in that moment. This makes the algorithm not robust enough. Therefore, a robust hand gesture spotting algorithm need to be further studied. Second, the trajectory direction is only used for gesture representation, which induces the low discrimination between different gestures especially when the number of defined gestures grows. Thus, how to integrate multi-cues for hand gesture representation is also our future work.

200

[3]

200

150

150

100

100

50

50 0

0

5

10

15 t /frame

20

25

0

30

0

5

10

400

400

15 t /frame

20

25

30

[4]

350 350

y(t) /angle

y(t) /angle

300 300

250

250

200

200

150

150

100 0

2

4

6

8

10 t /frame

12

14

16

18

20

0

5

10

15 t /frame

20

25

30

[5]

Figure 5. Reference model of hand gesture a, b, c, and d training set testing set

[6]

success rate of recognition (%)

100

[7]

90

80

[8] 70

60

[9]

50 a b c d e f g h i j k l mn o p q r s t u v w x y z

26 alphabetical hand gesture

Figure 6. Hand gesture recognition success rate

V. CONCLUSION Hand gesture recognition provides a natural way of HCI. It requires the accuracy and real-time performance. A novel method based on SURF tracking is proposed to recognize dynamic hand gesture. The satisfactory recognition success rate is obtained. The overall time taken for recognition of one

[10]

[11]

[12] [13]

A. Malima, et al., “A fast algorithm for vision-based hand gesture recognition for robot control”, Proceeding of the IEEE International Conference on Signal Processing and Communications Applications. Antalya, Turkey: 2006. 1-4. E. Sanchez-Nielsen, et al., “Hand gesture recognition for humanmachine interaction”, Journal of WSCG, 2003,12(1-3). Ying Wu, et al., “An adaptive self-organizing color segmentation algorithm with application to robust real-time human hand localization”, Proceeding of the Asian Conference on Computer Vision, Taiwan, China: 2000. 1106-1111. Yuanxin Zhu, et al., “A real-time approach to the spotting, representation, and recognition of hand gestures for human-computer interaction”, Computer Vision and Image Understanding, 2002, 85(3): 189-208. Chuanbo Weng, et al., “Robust hand posture recognition integrating multi-cue hand tracking”, Lecture Notes in Computer Science, 2010, 6249: 497-508, Yu Zhong, Anil K.Jain, “Object tracking using deformable templates”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(5): 544-549. Caifeng Shan, et al., “Real-time hand tracking using a mean shift embedded particle filter”, Pattern Recognition, 2007, 40(7): 1958-1970. Eng-Jon Ong, R. Bowden, “A boosted classifier tree for hand shape detection”, Proceeding of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, Seoul, Korea: 2004. 889-894. S. Mohamed Mansoor Roomi, et al., “Hand gesture recognition for human-computer interaction”, Journal of Computer Science, 2010, 6(9): 994-999. Ho-Sub Yoon, et al., “Hand gesture recognition using combined features of location, angle and velocity”, Pattern Recognition, 2001, 34:14911501. A. Bobick, J. Davis, “The recognition of human movement using temporal templates”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3): 257-267. Herbert Bay, et al., “SURF: Speeded Up Robust Features”, Computer Vision and Image Understanding, 2008, 110(3): 346-359. David G. Lowe, “Distinctive image features from scale-invariant keypoints”, International Journal of Computer Vision, 2004, 60(2): 91110.