A Fast Algorithm for Hand Gesture Recognition Using Relief

4 downloads 0 Views 389KB Size Report
a novel combination of voting theory and relief algorithm. This system is used in our mobile robot control project. Keywords-gesture recognition; complexion ...
2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery

A Fast Algorithm for Hand Gesture Recognition Using Relief Jianjie Zhang, Hao Lin, Mingguo Zhao Department of Automation Tsinghua University Beijing, China [email protected]

were fused to obtain a final result. Stergiopoulou [6] recognized 31 hand gestures by using an innovative selfgrowing and self-organized neural gas network. With this method, the topology of a gesture was extracted as features. The recognition was completed by probability-based method.

Abstract—The paper describes a gesture recognition algorithm which can effectively recognize single-hand gestures in real time under complex environment. The system involves a vocabulary of 20 gestures consisted by part of Chinese sign language letter spelling alphabet and digits. A new complexion model has been proposed to extract hand regions under a variety of lighting conditions. Real-time performance is due to a novel combination of voting theory and relief algorithm. This system is used in our mobile robot control project.

B. System Description The vocabulary of gesture includes common Chinese sign language letter spelling alphabet and digits, as shown in Fig. 1. The gesture “six” is very similar to the gesture “Y”, so in order to well distinguish both, when these two gestures are shown, the gesture “six” will rotate 90 degree, and so do some other similar gestures. When a gesture’s rotation angle is considerably large, such as 45 degree or more, it is often regarded as a different one. The experimental environment is in a relative complex condition, with a variety of lighting conditions, as shown in Fig. 2. The recognition system is realized by a multi-thread program. Three major threads are used for image capture, gesture extraction and gesture recognition respectively. The programming platform is LABVIEW, which is produced by National Instruments. Our system adopts queue mechanism provided by LABVIEW to realize communication and data transfer among these three threads. A gesture is reported to image process module when it is detected static in five successive frames. And then it is sent to recognition module to obtain the final result.

Keywords-gesture recognition; complexion model; relief algorithm

I.

INTRODUCTION

In the present, gesture recognition is an area of active current research in computer vision and pattern recognition. Mitra [1] has given a definition to gesture recognition “It is a process by which the gestures made by the users and recognized by the receiver”. It is an important humanmachine interaction technique in our daily life, due to a wide range of application [2], such as developing aids for the disabled or the old, recognizing sign languages, interacting with young children and so on. Our work is to develop a program for aiding the old and the disabled. A mobile robot is controlled by hand gestures that given by the old or the disabled to help these people complete a lot of actions. Due to real-time system requirements, a fast algorithm with efficient computation is proposed in this paper.

C. Organization The paper is organized as follows. Section 2 introduces the process of gesture image acquisition, and a new complexion model will be proposed. Section 3 describes a new algorithm for gesture recognition. Experiment results and analysis are shown in section 4. Section 5 shows conclusion and discussion.

A. Related Work A template-based hand pose recognition system was presented by Stenger [3]. Discriminative features of color and motion were used to extract likely hand regions. The system used a nearest neighbor classifier to recognize ten kinds of hand gestures. Lockton [4] took a deterministic boosting approach to distinguish between 46 single-hand gestures consisted by part of American sign languages. The system used RGB complexion model to extract hand regions and a wrist band to make all gesture images in the same scale. However, its experimental environment is simple. Chen and Tseng [5] proposed a multiple-angle gesture recognition system. Three cameras from left, right and front directions were used to capture a hand gesture. And three support vector machine classifiers were trained for the construction of hand gesture. Outputs of three classifiers 978-0-7695-3735-1/09 $25.00 © 2009 IEEE DOI 10.1109/FSKD.2009.210

Figure 1. The vocabulary of gestures

8

cannot work well when the condition of lighting changes, especially when the environment is quite dark or quite bright. In HSI complexion model, although hue and saturation can cluster well in some kinds of lighting conditions, however, the range of lighting is usually narrow. Through a variety of experiments, we find that hue and saturation of skin color always changes subtly with its intensity. So four complexion models are obtained due to the mean intensity of an image. It means that when the mean intensity of the image is in some interval, such as [0, 40], the complexion model corresponding to this interval will be used to detect skin-like pixels in this image. Through a few of tests on selecting proper intervals, four intervals are relative robust in a variety of lighting conditions. They are 0 to 40, 40 to 70, 70 to 100 and 100 to 255. The four complexion models are shown in Tab. I.

Figure 2. Complexion model (raw images above and images below processed by complexion model)

TABLE I. Mean in Intensity Plane 0 to 40 40 to 70 70 to 100 100 to 255

Complexion model Hue

Saturation

Intensity

0 to 90 0 to 80 0 to 70 0 to 50

20 to 98 20 to 100 15 to 100 0 to 255

10 to 80 15 to 90 30 to 160 40 to 255

Through this complexion model, a binary image is obtained. Skin-like pixels are represented by 1, while background pixels are represented by 0. The result shown in Fig. 2 proves the effectiveness of our adaptive complexion model in a variety of lighting conditions. Skin-like pixels are marked by red points.

Figure 3. The result of filtering (left image is before filtering and right one is after filtering)

B. Filter The binary image obtained from complexion model cannot be used to recognize. It needs further process. Firstly, skin-like pixels which are connected to the edge of the image are removed, because hand regions are seldom at the edge of image. Secondly, many separate skin-like pixels are removed by using eroding operation for 2 or 3 times. Thirdly, holes in the image are filled up by closing operation. Finally, the skin-like region which has the maximal area is extracted, because hand regions are always regions of interest and therefore always bigger than any other regions. The result shown in Fig. 3 proves that the filter has removed regions which are regarded as noise.

Figure 4. Normalization (right image is the result of normalization)

II.

ADAPTIVE COMPLEXION MODEL

GESTURE IMAGE ACQUISITION

Gestures are acquired using a Logitech camera observing relatively complex lab environment, under room lighting. The resolution of acquired images is 640*480 and the capture rate is 15fps. Demonstrators stand 60 to 100 centimeters away from the camera, because too near or too far will decrease the quality of gesture images. Demonstrators show their gestures by their right hands, and make sure that their right hands are not on the edge of gesture images. And in order to capture correct gestures of hands, it would be better for demonstrators to wear suits that cover their arms.

C. Gesture Normalization The scale of hand regions is always changing according to the distance between demonstrators and the camera. So it is necessary to normalize an image into a uniform size. The process of normalization is as follows: First, we calculate the centroid C of the hand region. Second, we find a rectangle whose centroid is C, and make sure that the rectangle should be the minimal one that is able to cover the whole hand region. Edges of the rectangle are parallel to ones of the original image. This step can eliminate effects which are imposed by translation on gestures. Third, we compress or amplify the size of images into 160*120 through linear interpolation. This step guarantees

A. Adaptive Complexion Model based on Mean Intensity In order to extract pixels of hands, the system adopts an adaptive complexion model to detect skin-like pixels. Traditional complexion models, such as RGB model [7][8],

9

like pixels have been valued by 1 and background pixels by 0. For the i th element in a gesture vector G to be recognized, if it is a skin-like pixel, the score it gives to category Dk is the maximal possibility of skin-like pixel

that all images are in the same scale. The result is shown in Fig. 4. And each 160*120 image is changed into a column vector (19200*1). Image vectors obtained through these steps above will be used for recognition. III.

among those i th elements in M kt , t = 1, 2, " . And if it

GESTURE RECOGNITION

belongs to background, the score it gives to category Dk is the maximal possibility of non-skin pixel among those i th elements in M kt , t = 1, 2, " . The score that it gives to

The recognition algorithm is a combination of voting theory [9] and relief algorithm [10]. Relief algorithm is used to select which pixels to vote for each category. The final result is produced by comparison among total scores that each category obtains.

category Dk can be represented by

⎧max( M kti ), ∀M kt ∈ Dk , if Gi is skin − like

A. Voting theory In voting theory, each voter gives a score to each candidate, and the one who gets the highest score will be the final winner. The recognition is just based on voting theory. Pixels of one gesture image are treated as voters, and they give scores to each gesture category that the image may belong to. And the final result is the category which obtains the highest 19200 score. The total score can be represented by S k = ∑ sik ,

sik = ⎨

sik

G gives to Dk . M kti is the i th element in the vector M kt . Gi is the i th element in G . Because gestures in the same category are always different to each other due to rotation, the clustering step is to put all similar gestures into a sub-cluster. And each category is actually divided into several sub-categories according to different appearances. As long as the number of training samples is large enough to involve a variety of rotations for each category, this system is robust to rotation.

i =1

Sk is the total score of the k th category and sik is the score that the i th pixel gives to the k th category. The key problem of recognition is how to produce a score. In other words, we need a criterion for each pixel to evaluate each category. The score is produced in the following process: For category Dk , we cluster its training samples into

B. Screening pixels based on relief It is a time-consuming job to calculate all 19200 (160*120) pixels, especially for some embedded systems. Birk [12] used PCA to realize screening. However, it must be based on the assumption that pixels have linear relationship. Lockton [4] took deterministic boosting, but the lost function in his paper did not well represent differences among categories. In fact, it is important to find pixels that have distinctive features to distinguish one category from all other categories. So relief algorithm [11] is adopted to choose distinctive pixels. Each pixel is given a weight to evaluate its contribution to classification. When calculating total scores, only pixels with high weights are calculated, while pixels with low weights will be ignored. Therefore it demands that weights should be able to distinguish different gesture categories and recognize the same gesture category. The higher the weight is, the more capable the pixel is to classify different categories. The algorithm is as follows: Input: Training samples X = {xi ∈ R d }, i = 1, 2, " , N , the number of random samples n S1: set weight vector w = [ w1 , " , wd ]T = 0 S2: for i = 1 to n S2a: select an element x from training samples randomly S2b: find the nearest sample h , which belongs to the same category as x and the nearest sample m , which belongs to the different category. S2c: for j = 1 to d

several sub-clusters Dkt , t = 1, 2, " . This process can be realized by a greedy algorithm or any other clustering approaches [11], such as leader-follower and sum-ofsquared-error-criterion. The greedy algorithm is as follows: For category Dk , k = 1, 2, " , 20

S1: for i = 1 to # Dk

S1a: if Dk is not empty set, randomly select an element x in Dk , put it to Dki and put all elements y ∈ Dk into Dki that make d ( x , y ) < α . Remove these elements from Dk . S1b: if Dk is an empty set, then break. # Dk is the number of elements in Dk , d ( x, y ) is the absolute distance between x and y , α is a threshold value. Each sub-cluster Dkt is represented by its mean image vector M kt =

1 # Dkt

⎩max(1 − M kti ), ∀M kt ∈ Dk , if Gi is non − skin is the score that the i th element in the gesture vector

∑ x , # Dkt is the number of

x∈Dkt

elements in Dkt . Actually each element in the vector M kt is the possibility that this element is a skin-like pixel, because skin-

10

19200*30% 19200*20% 19200*15% 19200*10%

w j = w j − diff ( j , x, h) / n + diff ( j , x, m) / n

diff ( j , x1 , x2 ) represents the absolute distance between sample x1 and sample x2 in the j th dimension. In fact, pixels with high weights make more sense than those with low ones when judging a gesture. Therefore, a threshold value should be found and those pixels whose weight is lower than the threshold value will be ignored. So for category Dk , its score Sk is represented by Sk =

d



i =1, wi >α

V.

which pixels are ignored and which are calculated. Through a variety of experiments, max( w) * 0.2 is chosen as the threshold value when both recognition speed and accuracy are taken into account. The recognition result will be discussed in the following section. EXPERIMENT RESULT

For each gesture category, 100 samples with a variety of small rotations (from -10 degree to 10 degree, randomly performed by demonstrators) are captured, 50 of which are treated as training set and 50 of which are treated as testing set. The n of relief algorithm is set 80% of the number of training samples. Through a few experiments, the average recognition results are shown in Table 2. From Tab. II, when the threshold value is max( w) * 0.2 , the recognition rate declines very little, however the time cost of calculating decreases close to 70%. Moreover, the recognition rate is also acceptable when only computing near 2500 pixels, which is 13% of 19200 pixels. And we also compare our results with Lockton [4]. His deterministic boosting algorithm is also applied in our gesture database. We select the same number of pixels for both two algorithms, and the results are shown in Tab. III. Although some results of deterministic boosting are better than those of relief, such as when calculating 40% of overall pixels, most results of deterministic boosting are weaker than those of relief. In fact, both these two algorithms are based on weak classifiers, some small fluctuation in recognition results is normal. However, the overall trend has demonstrated effectiveness of our algorithm. TABLE II. Threshold value no threshold max( w) * 0.1 max( w) * 0.2

near 6000

98.6%

max( w) * 0.3

near 4000

96.4%

max( w) * 0.4

near 2500

93.0%

TABLE III.

This research was performed for intelligent human-robot interface project funded by the National High Technology Research and Development Program of China (863 Program 20060AA040203). And it was also supported by Intelligent Robot Laboratory, Tsinghua University. REFERENCES [1]

[2] Recognition Rate 99.4% 99.2%

[3] [4] [5]

[6]

COMPARISON BETWEEN TWO ALGORITHMS

Number of Pixels for calculating 19200*50% 19200*40%

Deterministic Boosting 99.3% 98.7%

CONCLUSIONS AND DISCUSSION

ACKNOWLEDGMENT

RECOGNITION RESULT Number of Pixels for Calculating 19200 near 9000

97.9% 95.7% 94.4% 91.9%

The paper proposes a fast and simple algorithm for a hand gesture recognition problem. Given images of the hand, the algorithm segments the hand region by complexion model, and obtains a final recognition result by some highweight pixels’ voting. The effectiveness and robustness of this algorithm on gesture images has been demonstrated in the paper. And it is also the first to use relief algorithm in hand gesture recognition. The algorithm is applied in our mobile robot control project. Eight gesture categories with high recognition rate have been selected to control our robot. Eight kinds of actions have been defined according to eight gesture categories. Recognition results are sent to robot control program by TCP protocol. When the mobile robot receives a result, it will perform an action corresponding to this result. Our future work is to develop a program for online retraining and add more Chinese sign gestures into our vocabulary. For online retraining, once a gesture is misrecognized, it would be useful to add the misrecognized one into training samples and correct weights of some related pixels. Another work is to improve our complexion model. It works well when the mean intensity far from threshold points, such as 40, 70 and 100. While the mean intensity is close to these three threshold points, its performance will decrease a little. Therefore our complexion model needs further modification to be more robust.

sik , α is the threshold value to determine

IV.

97.9% 96.2% 93.5% 91.0%

Relief Algorithm

[7]

99.6% 99.0%

11

S. Mitra, T. Acharya, “Gesture recognition: a survey,” IEEE transactions on systems, man, and cybernetics—part C: applications and review, 2001, pp. 2127-2130. C.L. Lisetti, D.J. Schiano, “Automatic classification of single facial images,” Pragmatics Cogn, 2000, pp. 185-235. B. Stenger, “Template-based hand pose recognition using multiple cues cell,” ACCV 2006, 2006, pp. 551-560. R. Lockton, A.W. Fitzgibbon, “Real-time gesture recognition using deterministic boosting,” BMVC 2002, 2002, pp. 1-10. Y.T. Chen, K.T. Tseng, “Multiple-angle hand gesture recognition by fusing SVM classifiers,” Proceedings of the 3rd Annual IEEE Conference on Automation Science and Engineering, 2007, pp. 527530. E. Stergiopoulou., N. Papamarkos, “A new technique for hand gesture recognition,” ICIP 2006, 2006, pp. 2657-2660. A. Malima, E. Ozgur, and M. Cetin, “A fast algorithm for visionbased hand gesture recognition for robot control,” IEEE Conference on Signal Processing and Communications 2006, 2006, pp. 1-4.

[8]

J. Han, G.M. Award, and A. Sutherland, “Automatic Skin Segmentation for Gesture Recognition Combining Region and Support Vector Machine Active Learning,” Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition, 2006, pp. 57-64. [9] http://www.sci.wsu.edu/math/Lessons/Voting/. [10] I. Kononenko, “Estimating attributes: analysis and extensions of RELIEF,” ECML 1994, pp. 171–182, 1994. [11] R,O, Duda, P.E. Hart, and D.G. Stork, Pattern Classification Second Edition, 2001. [12] H. Birk, T. Moeslund, and C. Madsen, “Real-time recognition of hand alphabet gestures using Principal Component Analysis,” SCIA 1997, pp. 261–268, 1997.

12

Suggest Documents