a study on classifiers accuracy for hand pose recognition

BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publicat de Universitatea Tehnică „Gheorghe Asachi” din Iaşi Tomul LIX (LXIII), Fasc. 2, 2013 SecŃia AUTOMATICĂ şi CALCULATOARE

A STUDY ON CLASSIFIERS ACCURACY FOR HAND POSE RECOGNITION BY

CONSTANTINA RALUCA MIHALACHE and BOGDAN APOSTOL* “Gheorghe Asachi” Technical University of Iaşi, Faculty of Automatic Control and Computer Engineering

Received: May 17, 2013 Accepted for publication: June 21, 2013

Abstract. This paper presents a comparative study between accuracy rates obtained by using different classification architectures for hand pose estimation in RGB-D data. The segmentation of a hand pose is optimized by using depth data in correlation with the grey scale image obtained from a Kinect sensor. We define an observation model composed of feature vectors obtained by calculating the histograms of oriented gradients on colour and depth data and also fingertip positions. A contour tracking algorithm is applied to track the contour of the hand and find the fingertip positions. The most relevant features from the observation model are selected and are served as input to all the classifiers. For this work we have considered Linear, Random Forest (RF), Support Vector Machine (SVM) and Decision Trees (DT) classifiers for posture classification. Experimental results show a 84.18% recognition accuracy is achieved by using the RF classifier, a 79.29% recognition accuracy is achieved using the DT classifier and a 78.27% recognition accuracy for SVM classifier. The multinomial regression is also used for classification purpose but shows a poor 44.26% recognition accuracy. Key words: hand pose recognition, RGB-D, Kinect, histogram of oriented gradients (HOG) features, decision tree, random forest, support vector machines (SVM), linear. 2010 Mathematics Subject Classification: 65D18, 68T05. *

Corresponding authors; e-mail: [email protected]

70

Constantina Raluca Mihalache and Bogdan Apostol

1. Introduction The last few years witnessed an increasing interest in the field of Human Computer Interaction (HCI). Between the wide varieties of hand posture recognition techniques, marker less visual based recognition has brought to us non-restrictive systems for HCI (Rautaray & Agrawal, 2012). With the release of the Kinect sensor in 2010 the capturing of scene colour and depth data has become affordable. The device opens whole new opportunities for applications in the domain of computer vision and pattern recognition. These techniques have applications in sign language recognition, gestural communication and device automatic controlling. Classification of hand gestures is the problem of taking a given hand observation model for an instance whose category is unknown and finding a category that the model is closest to. Based on features extracted from the current model and the data learned from the training set a prediction is done for the current instance. The classification process is the last operation that a recognition system performs. Before classification we need to make an initial segmentation to extract the region of interest, correlate depth and colour data, extract hand features from the selected region and decide which features are the most relevant to the recognition process. In this paper we compare the classification accuracy of four different classifiers applied for the same dataset. The dataset contains features extracted for four different hand poses and a wide variety of camera angles.

a

b

c

d

Fig. 1 – Four hand poses classes contained by the training dataset: a − open hand, b − peace sign, c − ok pose, d − like sign.

The considered hand gestures are: open hand pose, peace pose, like pose and ok pose, as it is shown in Fig. 1. Decision Trees, Random Forests, Support Vector Machines and Multinomial Regression machine learning approaches are used for comparison purposes. The reminder of this paper is as follows: Section II explains the details of calculating and selecting the most important feature vectors obtained through fingertip recognition and calculating the histogram of gradients (HOG) on both colour and depth data; Section III reviews the four hand pose classification

Bul. Inst. Polit. Iaşi, t. LIX (LXIII), f. 2, 2013

71

techniques that we have chosen for our study; Section IV describes the setup for each classifier and compares the accuracy rates obtained through experiments run on the same feature dataset and Section V concludes with a summary and discussion. 2. Feature Extraction from RGB-D Data The input of the classification algorithm is represented by a set of the most important features extracted from the raw video data. The RGB-D format of the stream input captured by a Microsoft Windows Kinect sensor combines visual (RGB colours) and geometric information (depth) in a synchronized format that provides us with the possibility of extracting features from both. Segmentation is an important first step in the process of feature extraction, as it eliminates the background and keeps just the tracked object. In this paper we do a segmentation based only on the depth data and create a validity mask that can be applied on both depth and colour data to extract the hand region. From the segmented data we extract fingertip positions and features that characterize the local appearance of the hand. We track the contour of the hand in the valid depth data by classifying the pixels correspondent to the validity mask in either interior pixels or contour pixels. Contour tracking of objects in binary images is a well-known subject and solutions such as in (Ren et al., 2002; Chang et al., 2004; Yan & Min, 2011) scan the images pixel by pixel in different directions to find the pixels in the contours and have different starting points. In this paper we start bottom up and scan each line of the image for a valid contour point. A valid contour point is a point that has at least another point that is not in the validity mask and thus is an exterior point. After finding the first contour point we use a 3 × 3 neighbourhood to look for the next pixel of the contour. Once this is found we add it to an ordered list and keep the search direction of the last valid contour point. A k − curvature algorithm similar to the one in (Ryan, 2012) is used to find the curves along the tracked contour of the hand. For each point Pi that is a valid contour point the algorithm chooses to points Pi−k and Pi+k and calculates the angle ω between them. We consider that Pi is a valid curve point if ω is smaller than an empirically chosen value. In this paper we have used a k value of 20 and a ω value of 0.87 radians. The curvature points found are not all fingertip positions and these points can also correspond to valleys in the contour of the hand. We decide if a point is a fingertip by calculating the bisect between the PiPi−k and PiPi+k, and choosing the ones for which the bisect points to the interior of the hand, see Fig. 2. Methods like Harris-Corner detection, Scale Invariant Feature Transform (SIFT), and Histograms of oriented gradients (HOG) successfully extract features from visual data by aggregating gradients. These methods split the grayscale images into equal regions and apply masks to calculate gradients orientations.

72


Fig. 2 – Choosing fingertips by calculating bisect orientation of curvature points.

We calculate two sets of features corresponding to both visual and geometric data. From the colour image we obtain a grayscale image on which we compute HOG feature vectors. In order to obtain a well-defined grayscale image on which to apply HOG feature extraction we have to apply a transformation on the depth data. A Cumulative Distribution Function (CDF) is used to make a normal distribution of the depth values and represent them in a grayscale image. For the tests in this paper we split the grayscale images obtained from both colour and depth source into 6 × 6 pixel cells, organized in blocks composed of 3 × 3 cells and quantize all the gradient directions in the image into 9 valid directions (Mihalache & Apostol, 2013). We reduce the dimensionality of the HOG features extracted from RGB-D data and remove redundant features by using a Kernel Principal Component Analysis (KPCA) method. This method performs better than linear PCA as it ignores noise from input features and removes noise from test features by projecting the data onto the manifold (Cheng et al., 2009). For the experiments conducted in this paper we used a Gaussian Radial Basis Function (RBF) with a value for σ of 0.223. Similar to (Oikonomidis et al., 2011) we define the observation model that feeds the rest of the algorithm composed of two arrays containing the selection of the most important HOG features from the segmented image of the hand and the corresponding valid depth map, and an array of fingertip positions. 3. Hand Pose Recognition Classifiers From last few years, the task of hand pose recognition are extensively studied and rapid progress seems in this area, including the machine learning approaches such as Decision Trees, Random Forests, Support Vector Machines, and Liniar Multinomial Regression.


73

In this section we will detail the proposed method of handpose estimation using above mentioned classifiers and clearly present the training and testing steps. The training phase consists of three steps: (1) get RGB images and their associated depth information for different m hand poses from Kinect sensor, (2) reducing obtained HOG features of both colour and depth images using KPCA, (3) building input patterns and train the classifier. Training patterns are presented according our observation model where we define the set of p training patterns over m hand poses as Itrain= {Oij}, j = 1, m , i = 1, p . As we apply a supervised learning method for each used classifier we split this training set into m classes corresponding to the number of considered hand poses (Fig. 1). A training pattern that is considered an input in the trained classifier is a vector that results by applying KPCA on the HOG. We choose the set of lrgb (lrgb < nrgb) and the set of ld (ld < nd) eigenvectors which have the lrgb and ld largest eigenvalues (lrgb and nrgb, ld and nd are the number l of the largest eigenvalues and thedimension of the covariance matrix for the colour and depth HOG features, respective). Proposed pattern representation includes the first lrgb and ld principal components of HOG data applied on ith colour and depth image, respective. The last element in our training pattern is represented by the positions of identified fingers. The trained model is tested on q testing patterns (Itest = {Oij}, j = 1, m , i = 1, q , q >> p) over the same m hand poses. Testing is made offline and for each received image from Kinect sensor an observation model is created in order to estimate its hand position. 3.1. Decision Trees

Decision trees traditional algorithm is based on the recursive partitioning greedy algorithm which builds a decision tree in a top-down manner. The algorithm starts with the original set X as the root node, iterates through each unused attribute of the set X and computes the information gain IG, where IG(Y|X) = H(Y) − H(Y|X). The information gain is obtained by deducting conditional entropy H(Y|X) of the given attribute with the total entropy H(X): H (Y | X ) = −

m

n

 p( yi , x j )    p( x j ) 

∑∑ p( y , x ) log  i

j

i =1 j =1

H ( X ) =−

(1)

n

∑ p( x ) log ( p( x )) i

i =1

i

(2)

74


The method used for attribute selection is minimizing the value of entropy and maximizing the information gain. The process of decision tree generation by repeatedly splitting on attributes is equivalent to partitioning the initial training set into smaller training set, until the entropy of each of these subsets is zero. At any stage of this process, splitting on any attribute has the property that the average entropy of the resulting subsets will be less than that of the previous training set. Gain ratio (GR) heuristics (uncertainty coefficient) can also be used for choosing best feature and is calculated by dividing its information gain IG with its information value IV. IV (Y | X ) = −

n

∑ p( y, x ) log ( p( y, x )) j

j

(3)

j =1

GR (Y | X ) =

IG (Y | X ) IV (Y | X )

(4)

3.2. Random Forests

Random forests are an ensemble learning method for classification that operate by building a collection of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. Let n pairs of random variables (X, Y), (X1, Z1),…, (Xn, Zn), where X is a feature vector that takes its values in ℜ d and Y (the label) is a binary {0; 1}valued random variable. We define a random forest (Biau et al., 2008), with m trees a classifier consisting of a set of randomized base tree classifiers gn(x, Z1),…, gn(x, Zm) which are identically distributed random vectors, independent conditionally on X, Y and Dn, where Dn is a training dataset and is the collection (X1, Z1),…, (Xn, Zn). The randomizing variable shows how the successive cuts are performed when building the tree such as selection of the node and the coordinate to split, as well as the position of the split. The random forest classifier takes a majority vote among the random tree classifiers. If m is large, the random forest classifier is well approximated by the averaged classifier. A random tree classifier gn(x, Z) is constructed as follows. All nodes of the tree are associated with rectangular cells such that at each step of the construction of the tree, the collection of cells associated with the leaves of the forms a partition of [0,1]d . The root of the random tree is [0,1]d itself. At each step of the construction of the tree, a leaf is chosen uniformly at random. The split variable J is then selected uniformly at random from the d candidates x(1),…, x(d). Finally, the selected cell is split along the randomly chosen variable at a random location, chosen according to a uniform random variable on the length of the chosen side of the selected cell. The procedure is repeated k times, where k ≥ 1 is a deterministic parameter, fixed before hand by the user, and


75

possibly depending on n. The randomized classifier gn(x, Z) takes a majority vote among all Yi for which the corresponding feature vector Xi falls in the same cell of the random partition as x. 3.3. Support Vector Machines

Support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification analysis (Cheng et al., 2009). Given a set of training patterns (x1, y1),…, (xn, yn) in ℜn ×ℜ , according to undefined probability distribution P(x, y), and an error function V(y, f(x)) where f(x) is the predicted output instead of ideal output y for the input x our problem consists in finding the function f that minimizes the error: n

∫ V ( y , f ( x ) ) P ( x, y ) d d x

y

(5)

1

Having non-linear separable training patterns we map them using a Ф(x) kernel function into a higher dimension space in order to make them linearly separable. The goal of SVM is to find a hyper plane w · x − b = 0, which separates the two different samples accurately by maximizing the 2 bilateral blank area maximum. Thus, our problem resumes to, w 1  min Φ( w ) = w 2 

2

=

1  w ⋅ w  where yi (w ⋅ x i − b) ≥ 1, i = 1, n . 2  3.4. Multinomial Regression

The generalized linear modeling technique of multinomial logistic regression can be used to model unordered categorical response variables. Multinomial Regression is useful for situations in which is desired to classify subjects based on values of a set of predictor variables. This type of regression is similar to logistic regression, but it is more general because the dependent variable is not restricted to two categories. Parameter estimation is performed through an iterative maximum-likelihood algorithm. Let the response variable Y have r categories, and X1,…, Xk be explanatory variables, yi = (yi1,…, yir) be the response values in ith subgroup having multinomial distribution Mn = (ni, ρi1,…, ρir), βj = (β0j, β1j,…, βkj)' be regression coefficients for the jth response category with respect to the j*th (reference) one, and xi = (xi1,…, xik)' be actual values of explanatory variables for the ith subgroup, then general multinomial regression model is:

76


 pij   = x′β , j ≠ j * log i j p*  ij 

(6)

and log-likelihood function is:  l ( β ) = log   i =1  n

∑

 + r y !  j =1 ij 

y i!

∏

n

r

∑∑ y

ij

i =1 j =1

(

log( pij ), β = β ′j , j = 1,..., r , j ≠ j *

′

)

(7)

4. Experimental Results This section will present some experimental hand pose estimation results that are obtained by applying the classifiers described in the previous sections. We compare the accuracy rates of four different classification algorithms which are: Decision Trees, Random Forests, Support Vector Machines and Linear (Multinomial Regression). The performance of the classification algorithms is affected by the quality of data source. Features that are redundant or are irrelevant to hand pose estimation are already taken out through a previous KPCA filtration. The same definite set of filtered features is fed to each of the classification algorithms. In this paper we used features obtained from 1000 RGB-D images for building the training set and an additional set of 300 RGB-D images for the testing data. Images contain four different valid postures of the hand (open hand, peace sign, ok pose, like sign) and also images that show no hand object. Experimental results were obtained by using the Rattle library from R (Zhao, 2012). The Rattle package provides a graphical user interface for data mining that uses the R language. The optimal parameters were chosen empirically and are: a) 20 min splits, 10 min buckets, 30 max depth and a complexity of 0.01 for Decision Trees classifier; b) 400 trees, a number of 10 variables tried at each split for Random Forests classifier; c) Gaussian Radial Basis kernel function, a value of 0.06 for sigma and a cost=1 for Support Vector Machine classifier; d) 1000 max iterations for Linear classifier; We can see the performance of these four algorithms by looking at the confusion matrixes shown in Tables 1÷4 for the tests conducted for each of the classifiers on the test dataset.

77

Actual

Actual

Actual

Actual


Table 1 Confusion Matrix for Decision Trees Classifier Predicted Open Peace Like OK pose pose pose pose Open pose 28 0 0 0 Peace pose 0 22 3 0 Like pose 0 2 22 0 OK pose 3 9 3 9 Average Table 2 Confusion Matrix for Random Forest Classifier Predicted Open Peace Like OK pose pose pose pose Open pose 28 0 0 0 Peace pose 0 25 0 0 Like pose 0 3 16 3 OK pose 0 6 3 16 Average

Accuracy 100% 88% 91.67% 37.5% 79.29%

Accuracy 100% 100% 72.73% 64% 84.18%

Table 3 Confusion Matrix for Support Vector Machines Classifier Predicted Open Peace Like OK Accuracy pose pose pose pose Open pose 28 0 0 0 100% Peace pose 0 25 0 0 100% Like pose 0 10 11 0 52.38% OK pose 0 8 3 17 60.71% Average 78.27% Table 4 Confusion Matrix for Linear (Multinomial Regression) Classifier Predicted Open Peace Like OK Accuracy pose pose pose pose Open pose 28 0 0 0 100% Peace pose 0 9 0 16 36% Like pose 0 9 6 6 28.57% OK pose 6 12 3 3 12.5% Average 44.26%

78


We can see in Fig. 3 a comparative representation of accuracy detection rates for all classifiers and each hand pose. One can easily see that all classifiers obtain 100% accuracy rates for open hand pose. Both peace hand pose and ok pose are best detected by SVM and RF classifiers, while like pose is best detected by DT and RF classifiers.

Fig. 3 – Accuracy detection rates for each hand pose.

Fig. 4 shows the average accuracy rates of the compared classifiers. As figures shows the Random Forest classifier obtains the best average recognition rate.

Fig. 4 – Comparison of classifiers average accuracy rates.


79

3. Conclusions In this paper we presented a comparative study on four well known classifiers for pattern recognition. We applied these algorithms on the same dataset with application in the domain of hand gesture recognition. As we previously detailed we used four hand gesture classes and compared the accuracy rates for each pose recognition and the overall accuracy rates. In the process of creating the dataset we used RGB-D data from a Kinect sensor and extracted features from both hand shape and aspect. By applying KPCA algorithm we chose the most relevant features from the multitude of extracted HOG features from both colour and depth images. Experimental results show a 84.18% recognition accuracy is achieved by using the RF classifier, a 79.29% recognition accuracy is achieved using the DT classifier and a 78.27% recognition accuracy for SVM classifier. The multinomial regression is also used for classification purpose but shows a poor 44.26% recognition accuracy. It is shown that random forest classifier outperforms the other three algorithms.

REFERENCES Biau G., Devroye L., Lugosi G., Consistency of Random Forests and Other Averaging Classifier. The Journal of Machine Learning Research, 9, 2015−2033, 6/1/2008. Chang F., Chen C., Lu C., A Linear-Time Component-Labelling Algorithm Using Contour Tracing Technique. Computer Vision and Image Understanding, 206−220, 2004. Cheng P., Li W., Ogunbona P., Kernel PCA of HOG Features for Posture Detection. International Conference: Image and Vision Computing New Zealand, 415−420, 2009. Mihalache C.R., Apostol B., Hand Pose Estimation Using HOG Features From RGB-D Data. In System Theory, Control and Computing (ICSTCC), 2013 17th International Conference on, 2013. Oikonomidis I., Kyriazis N., Argyros A., Efficient Model-Based 3D Tracking of Hand Articulations Using Kinect. Proceedings of the British Machine Vision Conference, 1-11, 2011. Rautaray S.S., Agrawal A., Vision Based Hand Gesture Recognition for Human Computer Interaction: a Survey. Artificial Intelligence Review, 1−54, 2012. Ren M., Yang J., Sun H., Tracing Boundary Contours in a Binary Image. Image and Vision Computing. 125−131, 2002. Ryan D.J., Finger and Gesture Recognition with Microsoft Kinect. Master’s Thesis in Computer Science (TN-IDE), 2012. Yan L., Min Z., A New Contour Tracing Automaton in Binary Image. 2011 IEEE International Conference on Computer Scienceand Automation Engineering (CSAE), 577−581, 2011.

80


Zhao Y., R and Data Mining: Examples and Case Studies. Published by Elsevier in December 2012.

STUDIU PRIVIND PRECIZIA CLASIFICATORILOR PENTRU RECUNOAŞTEREA POSTURII MÂINII (Rezumat) În ultimii ani interacŃiunea om-calculator (HCI) a devenit un domeniu de interes tot mai studiat. Captarea secvenŃelor video precum şi corelarea acestora cu informaŃiile de adâncime (RGB-D) au devenit accesibile odată cu lansarea în 2010 a senzorului Kinect. Această lucrare prezintă un studiu comparativ între preciziile obŃinute la testare prin aplicarea mai multor tehnici de clasificare folosind date ce reprezintă caracterisiticile extrase din informaŃii de tip RGB-D. Se defineşte un model de observare format din vectori de caracteristici obŃinuŃi prin calculul histogramelor orientărilor gradienŃilor (HOG) atât pe datele de culoare cât şi pe datele de adâncime. Tot în acest model se introduc şi informaŃii despre numărul şi poziŃia degetelor. Sunt selectate cele mai importante caracteristici ale modelului de observare şi utilizate ca date de intrare pentru fiecare dintre clasificatorii consideraŃi. În această lucrare s-au considerat următorii clasificatori pentru recunoaşterea posturii mâinii: Linear, Random Forests (RF), Support Vector Machines (SVM) şi Decision Trees (DT). Rezultatele experimentale arată o precizie de recunoaştere de 84,18% prin utilizarea clasificatorului RF, o precizie de recunoaştere de 79,29% pentru clasificatorul DT şi o precizie de recunoaştere 78,27% pentru clasificatorul SVM. Se utilizează, de asemenea, un clasificator liniar care obŃine o precizie de recunoaştere de doar 44,26%.