Fingertips Detection and Hand Gesture Recognition ... - IEEE Xplore

Fingertips Detection and Hand Gesture Recognition Based on Discrete Curve Evolution with a Kinect Sensor Zhongyuan Lai #1 , Zhijun Yao ∗2 , Chun Wang †3 , Hui Liang ‡4 , Hongmei Chen #5 , Wu Xia #6 #

Institute for Interdisciplinary Research, Jianghan University, Wuhan 430056, P.R.China 1

[email protected] [email protected] 6 [email protected] 5

∗

The 723 Institute of China Shipbuilding Industry Corporation, Yangzhou, P.R.China 2

†

Institute of Computer Application, China Academy of Engineering Physics, Mianyang 621999, P.R.China 3

‡

[email protected]

[email protected]

Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), 138632, Singapore 4

[email protected]

Abstract—In this paper, we propose a novel method that can detect fingertips as well as recognize hand gestures. Firstly, we collect the hand curves with a Kinect sensor. Secondly, we detect fingertips based on the discrete curve evolution. Thirdly, we recognize hand gestures using evolved curves partitioned at the detected fingertips. Experimental results show that our method performs well in both fingertips detection and hand gesture recognition. Index Terms—Hand gesture recognition; fingertips detection; Kinect sensor; discrete curve evolution; curve partition

I. I NTRODUCTION Fingertips and hand gesture play very important roles in human-computer interaction [1]. Recently, with the rise of mobile Internet technology, due to the advantages of contactlessness and low implementation cost, vision-based fingertips detection and hand gesture recognition techniques have been more and more widely used in areas such as digital home entertainment, interactive television services, smart wear, and intelligent driving [2]. Traditional vision-based fingertips detection and hand gesture recognition systems use optical sensors to capture hand images. Generally, these images contain rich color, texture, shading, context as well as shape information of hands. For the fingertips detection and hand gesture recognition, knowing the shapes of hands is often enough [2]. This requires hand segmentation from captured images. However, the quality of images captured by the optical sensor is easily influenced by the illumination and background clutter, which may lead to inaccurate segmentation results [3]. The appearance of Kinect sensor has opened up a new path for the above-mentioned problem [4]. Besides an RGB camera in the middle, Kinect device has an infrared laser projector on c 2016 IEEE 978-1-5090-5316-2/16/$31.00

the left side and a monochrome CMOS sensor on the right side, as shown in Fig. 1. These two sensors are collectively called the depth sensors and cooperate to capture the depth information. With the help of this depth information, hand shapes can be accurately segmented from clutter backgrounds under any ambient light conditions [5]. However, due to the nature of depth sensors, the captured depth images have insufficient space-time resolution to represent moving hands having small sizes, which may result in a lot of noise and distortions on the contour curves of the hands, as shown in Fig. 2. This brings a new challenge to the follow-up fingertips detection and hand gesture recognition [6]. To address the above problems, part-based hand gesture recognition methods were firstly proposed. These methods decompose hand shape into palm and finger parts, and then match the finger parts with templates via Finger-Earth Mover’s Distance (FEMD) [6]. The typical decomposition methods include the Thresholding Decomposition (TD) [6], Minimum Near-Convex Decomposition (MNCD) [7], and Perceptual Shape Decomposition (PSD) [8]. Recently, more supervised learning-based methods were proposed. These methods extract various different depth features, and then recognize hand gestures via popular classification models such as the Support Vector Machine (SVM) [9], Sparse Representation Classification (SRC) [10], and Random Forest [11]. The typical depth features include the Histogram of 3D Facets (H3DF) [9], [10] and depth difference features [11]. However, neither the part-based representation nor the above depth features contain the fingertips information. Therefore, the above-mentioned methods cannot detect fingertips, which restricts its extensive applications. In order to detect fingertips as well as recognize hand gesture, we propose a method based on Discrete Curve Evolution (DCE) [12]. As illustrated in Fig. 2, we collect VCIP 2016, Nov. 27 – 30, 2016, Chengdu, China

RGB camera

Monochrome CMOS sensor

color map Hand Curve Collection

hand curve

depth map

Fingertips Detection

polygonal curve fingertips

Hand gesture Recognition

hand gesture lable

infrared laser projector Fig. 1.

The structure of the Kinect sensor and the framework of our fingertips detection and hand gesture recognition system.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 2. The collection of a hand curve and the detection of fingertips. (a) The color map and (b) the depth map captured with a Kinect sensor; (c) the combination of hand and partial forearm region obtained by thresholding the depth map; (d) the accurate hand region obtained by removing partial forearm region with the black belt; (e) the extracted hand curve starting at the red circle; (f) the polygonal curve (in solid blue) evolved by DCE and the fingertips (indicated by red dots) detected with turning angle thresholding.

hand shapes with a Kinect sensor and detect fingertips on the polygonal curves simplified by DCE. Then we recognize hand gestures using the evolved curves partitioned at the detected fingertips. Our method belongs to the contour-based method and can detect fingertips as well as recognize hand gestures. Our method introduces perceptual distortion measure [13] into the DCE process, which can effectively eliminate contour noise and resist contour distortions. Compared with the curvature-based method [14], our method does not change the original contour curve. The fingertips are directly detected on the original contour curve, which improves the precision of detection. Moreover, our method is of low computational complexity and easy implementation. II. O UR M ETHOD In this section, we give a brief introduction to the hand curve collection with a Kinect sensor, and then give detailed descriptions of the fingertips detection and hand gesture recognition based on DCE. A. Hand Curve Collection Our hand curve collection method is very similar to the method proposed in [6]. The user is required to wear a black belt at the wrist and keep the hand closest to the Kinect sensor. These requirements can be easily satisfied in practice. In this way, we can segment a rough hand from the depth map with simple thresholding method, improve the accuracy of hand segmentation with the help of the black belt detected from the color map, and extract the hand curve from the improved segmentation result, as shown in Fig. 2. This hand curve is an open eight-connected digital curve with no self-intersections. It starts at the left pixel of the hand wrist, traverses the hand contour in the clockwise direction, and ends at the right pixel of the hand wrist. As we can see, the collected hand curve

contains a lot of noise, which brings challenge to the following fingertips detection. B. Fingertips Detection Our main idea is to find a simplified version of the collected hand curve. This version can neglect noise and distortions while preserving the perceptual appearance at a level sufficient for fingertips detection. Here we use DCE [12] to implement this idea. Let the ordered set C = {c0 , c1 , · · · , cNC −1 } denote the original hand curve, where NC is the total number of pixels in C. Let the ordered set P = {p0 , p1 , · · · , pNP −1 } denote the evolved polygonal curve with P ⊆ C, where pi is the ith vertex of P , NP is the total number of vertices in P and the ith line segment starts at pi−1 and ends at pi , Note that p0 = c0 and pNP −1 = cNC −1 , i.e., the starting and ending points of P are always the same as those of C and will not be changed in the DCE process. Let Ki denote the visual contribution of pi to C. Let the ordered set T = {t0 , t1 , · · · , tNT −1 } denote the detected fingertips with T ⊂ P , where tj is the jth fingertip on P and NT is the total number of detected fingertips with NT ∈ {0, 1 · · · 5}. The procedure to evolve hand curve is shown as follows. 1) Initialize polygonal curve P with the original hand curve C. 2) Calculate the visual contributions K1 , K2 , · · · , KNC −2 and sort them in a descending order to obtain an array A. 3) Delete the last element in A and its corresponding vertex p∗ in P . Connect the two neighbouring vertices of p∗ , recalculate their visual contributions, and insert them to the appropriate positions in A. 4) Repeat 3) until the last element in A is larger than a pre-specified threshold KT .

where α (pi ) is the turning angle at the common vertex pi of the line segments (pi−1 , pi ), (pi , pi+1 ) and l (·) is the length function normalized by the radius of the maximum inscribed circles of the hand shape. Finally, we sort the turn angles of vertices in P in a descending order and detect the fingertips from vertices where their turn angles are in the top five and larger than a prespecified threshold αT . According to [15], our DCE-based method has a complexity of O (NC log NC ). C. Hand Gesture Recognition In this section, we partition the evolved polygonal curve at the detected fingertips for hand gesture recognition. For testing hand gesture, the definitions of the evolved polygonal curve P and the detected fingertips T are the same as before. Here we represent P as p0 ∪ p1 ∪ · · · ∪ pNT , where pj is the jth partition of P with fingertips tj and tj+1 . For hand gesture template, let the set P={P 0 , P 1 , · · · , NP −1 } denote the polygonal curve templates with NT finP gertips, where P n is the nth template of P and NP is the total of templates in P. Let the ordered set T n = n number n n t0 , t1 , · · · , tNT −1 denote the fingertips with T n ⊂ P n , where tnj is the jth fingertip on P n . Similarly as before, we represent P n as pn0 ∪ pn1 ∪ · · · ∪ pnNT , where pnj is the jth partition of P n with fingertips tnj and tnj+1 . For template matching, let Dn denote the distance between P and P n and dnj the partition distance between pj and pnj . Here we have NT n D = dnj , (2) j=0

i.e., the distance between the polygonal curve of the testing hand gesture and the polygonal curve of the template is the total sum of their corresponding partition distances. Then, our DCE partition-based hand gesture recognition method is to find the label of polygonal curve template in P having a minimum distance from P . The key to implement the above-mentioned template matching is to compute the partition distance dnj . Here we use arc similarity measure presented in [15] for this computation as follows. 2 1 dnj = 0 α (pj (s)) − α pnj (s) ds

n (3) l(p ) l(p ) · max l pnj , l(pjj ) · max l (pj ) , l pnj , ( j) where α (· (s)) is the turning function defined in [15] and l (·) is the arclength function normalized by the length of the

100

80

True positive rate (%)

The key to implement the above-mentioned DCE is to define the visual contribution Ki . The larger the visual contribution of pi to C, the larger the value Ki . Here we use the perceptual distortion measure presented in [13] for this definition as follows. α (pi ) 2l (pi−1 , pi ) · l (pi , pi+1 ) · sin , (1) Ki = l (pi−1 , pi ) + l (pi , pi+1 ) 2

60

40

Curvature-based method [14] Our DCE-based method D T = 3S /5

20

0 0.01

0.02 0.03

0.05 0.1 0.2 0.3 False positive rate (%)

0.5

1

Fig. 3. ROC curves for fingertips detection on the NTU-Microsoft Kinect Hand Gesture Dataset [16].

polygonal curve. The first two items in (3) characterize the differences in the shapes and relative sizes of pj and pnj , while the third item in (3) characterizes their relative sizes with respect to the whole shape. That is to say, the value of the distance dnj between pj and pnj does not only depend on the differences between themselves, but also depends on their difference relative to the whole. III. E XPERIMENTS In this section, we conduct a series of experiments to substantiate the effectiveness and efficiency of our fingertips detection and hand gesture recognition method. We choose the challenging real-life NTU-Microsoft Kinect Hand Gesture Dataset [16]. This dataset contains 10 gestures from number 1 to 10, each of which has 100 cases with variations in rotation, scale, articulation, occlusion, etc. Here we set the pre-specified threshold KT = 0.4 as this can well discriminate non-salient fingertips from large shape variations and noise. All experiments are done on a 3.4-GHz Intel i7-4770 CPU with 8 GB of RAM. For the sake of comparison, we try our best to ensure that our implementation environments and parameter settings are the same as those in [6]. Firstly, we compare our DCE-based fingertips detection method with the curvature-based method [14] in terms of detection precision. Figure 3 shows the receiver operating characteristic (ROC) curves for fingertips detection. As we can see, our DCE-based method outperforms the curvaturebased method [14]. Specifically, when the detection threshold αT =3π/5, our method achieves a true positive rate of 99.48% at a false positive rate of 0.66%, which is sufficient to meet the requirements of practical applications. Therefore, in the following experiments, αT =3π/5. Secondly, we compare our DCE-based hand gesture recognition method with existing methods in terms of both accuracy and efficiency. Figure 4 shows the confusion matrix of our DCE-based method and Table I lists the mean accuracy and the mean time of Shape Context [17], Skeleton Matching [18], FEMD [6], Persistence Diagram (PD) [19], H3DF [9], [10], and our method. As we can see, our DCE-based hand gesture

1

96

ACKNOWLEDGMENT

4 100

2 3

99

4

3

5

89

3

1

97

6

4

7

5 1

1

95

1

1 99 100

8

R EFERENCES

100

9 10

100 1

Fig. 4.

The authors thank Shaojun Zhu from Rutgers University for helpful discussions. The authors also thank four anonymous reviewers for their constructive comments. This work is supported in part by the National Nature Science Foundation of China (grant No. 61501208) and the Scientific Research Foundation of Jianghan University to Dr. Zhongyuan Lai. The corresponding author is Zhongyuan Lai.

1

2

3

4

5

6

7

8

9

10

The confusion matrix of our hand gesture recognition method.

TABLE I T HE M EAN ACCURACY AND M EAN T IME FOR VARIOUS H AND G ESTURE R ECOGNITION M ETHODS .

Shape Context without bending cost [17] Shape Context with bending cost [17] Skeleton Matching [18] MNCD + FEMD [6] TD + FEMD [6] PSD + FEMD [8] fcenter (10 templates) + PD [19] Multi-functions (10 templates) + PD [19] fcenter (20 templates) + PD [19] Multi-functions (20 templates) + PD [19] H3DF + SVM [9] H3DF + SRC [10] Our DCE-based method

Mean Accuracy 83.2% 79.1% 78.6% 93.9% 93.2% 94.1% 86.4% 90.1% 87.6% 95.4% 95.5% 97.4% 97.5%

Mean Time 12.346s 26.777s 2.4449s 4.0012s 0.0750s 1.967s 0.0750s 0.3750s 0.1057s 0.5285s N.A. N.A. 0.0568s

recognition method achieves the highest mean accuracy of 97.5% using the shortest mean time of 0.0568 s, demonstrating an excellent performance in terms of both accuracy and efficiency. IV. C ONCLUSION In this paper, we propose a novel fingertips detection and hand gesture recognition method with a Kinect sensor. With the help of depth information, our method can accurately collect hand curves, and then detect fingertips and recognize hand gestures using DCE method. Experiments confirm that our method can achieve good performance at a low cost, thus has the potential to be applied in areas such as digital home entertainment, interactive television services, smart wear, and intelligent driving.

[1] S. Mitra and T. Acharya, “Gesture recognition: A survey,” IEEE Trans. Syst., Man, Cybern. C, vol. 37, pp. 311–324, May 2007. [2] J. P. Wachs, M. Kölsch, H. Stern, and Y. Edan, “Vision-based handgesture applications,” Commun. ACM, vol. 54, pp. 60–71, Feb. 2011. [3] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly, “Vision-based hand pose estimation: A review,” Comput. Vision Image Understand, vol. 108, pp. 52–73, Oct./Nov. 2007. [4] Kinect for XBOX 360, Microsoft, 2010. [5] Project Natal 101, Microsoft, 2009. [6] Z. Ren, J. Yuan, J. Meng, and Z. Zhang, “Robust part-based hand gesture recognition using kinect sensor,” IEEE Trans. Multimedia, vol. 15, pp. 1110–1120, Aug. 2013. [7] Z. Ren, J. Yuan, and W. Liu, “Minimum near-convex shape decomposition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, pp. 2546–2552, Oct. 2013. [8] C. Wang, Z. Lai, and H. Wang, “Hand gesture recognition based on perceptual shape decomposition with a kinect camera,” IEICE Trans. Inf. Syst., vol. E96-D, pp. 2147–2151, 2013. [9] C. Zhang, X. Yang, and Y. Tian, “Histogram of 3D facets: A characteristic descriptor for hand gesture recognition,” in Proc. IEEE FG’13, 2013, pp. 1–8. [10] C. Zhang and Y. Tian, “Histogram of 3D facets: A depth descriptor for human action and hand gesture recognition,” Comput. Vision Image Understand., vol. 139, pp. 29–39, Oct. 2015. [11] H. Liang, J. Hou, J. Yuan, and D. Thalmann, “Random forest with suppressed leaves for hough voting,” in Proc. ACCV’16, 2016. [12] L. J. Latecki and R. Lakämper, “Polygon evolution by vertex deletion,” in Proc. Int. Conf. on Scale-Space Theories in Computer Vision, 1999, pp. 398–409. [13] Z. Lai, W. Liu, F. Zhang, and G. Cheng, “Perceptual distortion measure for polygon-based shape coding,” IEICE Trans. Inf. Syst., vol. E96-D, pp. 750–753, Mar. 2013. [14] A. A. Argyros and M. I. Lourakis, “Vision-based interpretation of hand gestures for remote control of a computer mouse,” in Proc. ECCV Workshop HCI, 2006, pp. 40–51. [15] L. J. Latecki and R. Lakämper, “Application of planar shape comparison to object retrieval in image databases,” Pattern Recognit., vol. 35, pp. 15–29, Jan. 2002. [16] Z. Ren. (2011) NTU-Microsoft Kinect Hand Gesture Dataset. [Online]. Available: http://www.ntu.edu.sg/home/renzhou/HandGesture.htm [17] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, pp. 509–522, Apr. 2002. [18] X. Bai and L. J. Latecki, “Path similarity skeleton graph matching,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, pp. 1–11, Jul. 2008. [19] C. Li, M. Ovsjanikov, and F. Chazal, “Persistence-based structural recognition,” in Proc. IEEE CVPR’14, 2014, pp. 2003–2010.