A Multimodal Approach for Recognizing Human ...

3 downloads 0 Views 822KB Size Report
[7] M. Popa, A. K. Koc, L. J. M. Rothkrantz, C. Shan, and P. Wiggers,. “Kinect Sensing of Shopping related Actions,” in Constructing Ambient. Intelligence, vol. 277 ...
A Multimodal Approach for Recognizing Human Actions Using Depth Information Ali Seydi Keceli, Ahmet Burak Can Department of Computer Engineering Hacettepe University, Ankara/Turkey {aliseydi, abc}@cs.hacettepe.edu.tr

Abstract—Human action recognition using depth information is a trending technology especially in human computer interaction. Depth information may provide more robust features to increase accuracy of action recognition. This paper presents an approach to recognize basic human actions using the depth information from RGB-D sensors. Features obtained from a trained skeletal model and raw depth data are studied. Angle and displacement features derived from the skeletal model were the most useful in classification. However, HOG descriptors of gradient and depth history images derived from depth data also improved classification performance when used with skeletal model features. Actions are classified with the random forest algorithm. The model is tested on MSR Action 3D dataset and compared with some of the recent methods in literature. According to the experiments, the proposed model produces promising results. Keywords—Action Recognition, Pattern Recognition, Random Forest, Microsoft Kinect, Depth Maps

I.

INTRODUCTION

Real time calculation of depth maps using RGB-D sensors has lower computational complexity than stereo camera systems. Therefore, RGB-D sensors are widely used in human action recognition, object recognition, augmented reality, and environment modeling problems in the recent years. Especially, game consoles supporting depth sensors and real time depth map generation increased interactive gaming experience by allowing recognition of user actions. Depth data provide more information about human silhouette and increase the accuracy in action recognition. Gesture and activity recognition is studied extensively for 2D image videos. In these studies, grayscale intensity, color, texture, and motion based features are used in common to recognize actions. Widespread use depth sensors enabled researchers to work with valuable features obtained from depth maps. Although some studies in this field [1,2,5,21,20,24] use only depth information to recognize actions, some studies [4,7] combine both features extracted from RGB video data and depth sensors. Valuable features of depth data, low cost and real time performance of depth sensors make the studies in this field still attractive. In this paper, an approach to recognize basic human actions using depth information is presented. Features derived from both a trained skeletal model and raw depth data are combined to increase recognition accuracy. A joint skeleton model is

derived from depth maps using Shotton et al.’s method [3]. First set of features are calculated by using this skeletal model. These features are angles between some important joints and displacements of these joints. Histograms of some joint angle values, total displacements in x, y, z dimensions in 3D space are calculated. Additionally, HOG descriptors of gradient images (GI) and depth history images (DHI) are acquired from raw depth data. GI is obtained from depth images by summing up gradient images in x,y,z dimensions. DHI is calculated by summing up all depth images of entire depth sequence. These features are studied in various combinations and classified with random forest (RF) algorithm [19]. The developed models are tested on MSR-Action3D [5] dataset. Best results are obtained with joint angle and displacement features from the skeletal model and DHI features from the depth data. The test results show that the proposed model’s classification accuracy can reach to 95.4% and 98.0% in Test-1 and Test-2, and 89.4±6.86 in the cross subject test. This indicates that the model has a good performance in recognizing basic actions. The outline of the paper is as follows. Section 2 presents the related research on action recognition using depth information. Section 3 presents the feature set and details of the proposed method. In Section 4, results of the experiments are presented. Results of the study are discussed in the last section. II.

RELATED WORK

Sung et al. [1] use body posture features, hand position features and motion information obtained from PrimeSense API [6] joint skeleton model to recognize complex actions. Body posture features are angles between body and hands and between body and feet. Hand position features are the maximum and minimum positions of the right and left hands in the last 60 frames. Motion features are changes of eleven joints depending on time. Actions are recognized by using a double layer hidden markov model (HMM) [14]. After recognizing poses in the the first layer, complex activities are classified in the upper layer HMM. Li et al. [5] use projections of depth maps to recognize human activities. After computing projections of the depth map in there Cartesian plane, edge detection is applied on projections. Then, 3D edge points are selected for pose estimation. An action graph is constructed to recognize actions. A pose sequence for all actions and a transposition matrix to define probability of transpositions between actions are

obtained. The recognized poses are clustered as prominent poses. Actions are combinations of these prominent poses. Xia et al. [20] propose a method to recognize human activities as view invariant. Their approach is based on mapping of 3D joint locations from the Kinect sensor to a spherical coordinate system. Histograms of joint locations on this spherical coordinate system provide a view invariant posture representation. In another study by Raptis et al. [22], the Kinect sensor is used to recognize dance gestures. In this study, Shotton et. al’s joint skeleton model [3] is used to extract features. (u,r,t) components are extracted as a result of principle component analysis. Wrist and knee joints are considered as the first degree joints. Second degree joints are hands and feet. The angle between left shoulder and left wrist and u-r components are computed as features. Similarly angles between left hand and left wrist and r component are calculated. In recognition step, dynamic time warping [23] method is used to recognize dancing figures. Wang et al. [18] use features acquired from both joint skeleton and depth maps. Relative positions of joints are used as skeletal features. Additionally, local occupancy patterns (LOP) are calculated from depth map points around the skeletal joint coordinates. The local regions around joints are divided into grids. Points in the grids are passed through a sigmoid normalization function. These features are presented as Fourier Temporal Pyramid (FTP) to obtain equal size feature sets in each frame. Finally, actionlets, a selected subset of joints that describes an activity, are obtained for action recognition. After training a classifier for each joint, best descriptive joint set for all actions is found. Wang et al. [16] also proposed random occupancy patterns (ROP) to deal with occlusion and noise problems while recognizing actions. ROP based features are acquired from randomly sampled 4D subvolumes (I(x,y,z,t)). Sum of the pixels in a subvolume is defined and passed through a sigmoid normalization function. After sampling discriminative subset of subvolumes, an Elastic-Net [11] regularization is applied to select a sparse subset of discriminative features. Finally the features are classified with a SVM classifier. In Yang et al.’s method [21], depth maps of an entire sequence are projected on orthogonal planes and accumulated to generate depth motion maps (DMM). HOG features of each projection are computed separately. Finally a linear SVM classifier is trained by using HOG features and the model is tested with MSR-Action-3D. Yang et al. [24] propose another method to recognize human actions using 3D skeletal joint model obtained from RGB-D cameras. Their feature set is based on differences of skeletal joints. Pair wise differences between a joint and others in a frame are calculated first. Then, differences between a frame and its preceding frame are computed in an action sequence. Finally, the differences between a frame and the initial frame are computed. After these differences are computed, normalization is used to set the range of data. PCA [25] is applied to reduce redundancy. Actions in MSR-Action-3D data set are classified with Naive-BayesNearest-Neigbour method [9].

Oreifej et al. [12] proposed descriptors acquired from 4D volumes for activity recognition. First a 4D volume is constructed from depth sequences and defined as (x,y,t,z). Then normals of the 4D surface are computed. It is proposed that the normal orientation captures more information than the gradient orientation. After the surface normals computed using finite gray-value difference over all voxels in depth sequences, a histogram of oriented 4D surface normals (HON4D) is constructed. Additionally non-uniform quantization is done on bins of HON4D to find optimal bins of histograms to avoid over fitting. Additional projectors are derived from vertex vectors of 4D space to make 4D normals more discriminative. Consequently a SVM classifier is trained for recognition. Doliotis et al. [4] proposed a method to track hand gestures by combining RGB values and depth information. In this method, hand position is detected first by using a combination of skin and motion detection techniques. For skin detection, a histogram of RGB values is used. Human skin color range is predefined and candidate regions are selected by using this information. Then hand motion is detected by finding the difference between consecutive frames. The regions that gives maximum score values for motion and skin detection are taken as hand patches (the subwindow that includes this patch considered as hand). In the next step, depth information is used. Connected component analysis is applied on the depth image. The connected component that has minimum mean value is chosen from the first five largest components. An intersection between the chosen component and one of the subwindows from the previous step is searched. The intersected subwindow determines the hand position. Recognition process is accomplished with dynamic time warping [23] method by using 2D position information of hand Popa et al. [7] combines features from RGB video data and depth data to recognize activities of customers in a shop, such as searching products, examination, buying, and putting into basket actions. First background subtraction is applied on RGB video data to obtain human silhouettes. Same process is also applied to depth information. A motion matrix is constructed with differences of consecutive silhouettes. In this matrix, changing points are defined with 1, non-changing points are defined with 0 and starting positions are defined with -1. Moment values [8] of this matrix are used as features. These are done with RGB camera data. Additionally, a similar approach to motion matrix is applied on depth data. However, convergency and divergency of a person obtained from depth information (changes in the distance from the depth sensor) are used instead of silhouettes. By using all of these features, recognition stage is implemented with support vector machines, K-nearest neighbor algorithm, linear discriminant classifier method, and hidden markov model method. III.

METHOD

In this study, actions are modeled with 3D joint skeleton features and HOG descriptors obtained from the depth map. Complex activities are not considered in this study and concentrated on recognizing basic actions, e.g., walk, wave, sit, pick up something, turn around, punch, and kick. Generally, tracking human body movements and generating a skeleton model are the first steps of human activity recognition methods

[2,3]. As a first step of this work, joint and body tracking is done by using the joint skeleton model obtained from Shotton et al.’s method [3]. In this skeleton construction method, depth information gained from the Kinect sensor and positions of all pixels are used as features to train a Random Forest Classifier. Training set includes 300.000 images. The first step in determination of joint positions is recognition of body parts. 2D positions are transformed to 3D positions using the depth information and then positions are found by using mean-shift [13] algorithm. This approach estimates coordinates of human joints in real time and produces a skeleton model in 3D coordinate space. The skeleton model is shown in Fig. 1. This model provides 3D coordinates of 20 body joints. Three types of features are extracted from the skeleton model and raw depth images. Joint angle histograms and displacement of joints are the features obtained from the skeleton model. On the other hand, HOG descriptors of gradient images and depth history image are obtained from the depth data. After extracting features from this skeleton model, our approach classifies actions with Random Forest (RF) algorithm. In the following sections, these feature sets and classification of actions are explained.

2 1

6 5 7

3

8 4

Fig. 1. Calculated joint angles on the 20-joint skeleton model

𝑐𝑜𝑠 2 α + 𝑐𝑜𝑠 2 β + 𝑐𝑜𝑠 2 γ = 1

(3)

Considering the unit vector of a line segment, cosine values of the line segment can be defined as λ, µ and v values, which are projections of the unit vector on x, y, and z axises.

A. Joint Angles As the first type of features, we observed 3D angle values between all joints. However, we found that the most useful angles are between shoulder-elbow-arm and crotch-knee-foot in our experiments. Other angles are not very useful to recognize actions. When calculating joint angles, 3D coordinates of elbow, shoulder, hand wrist, knee, and foot wrist joints are considered. Fig. 1 shows the joint angles used in the skeleton model. Angle values in Fig. 1. are calculated in 3D coordinate space for all frames of capture data sets. Then all observed joint angle values are put into histograms, which are used as feature sets in classification.

The angle between two line segments corresponds the angle between unit vectors of these line segments even they are not intersected. Let (λ1, µ1, v1) and (λ2, µ2, v2) represent

Pairs of 3D coordinates of consequent joints are taken as line segments to calculate joint angles. A line segment is a line part of between two points P1 and P2 in the 3D coordinate space. If it is a directional line segment, it corresponds to P1P2 vector, where positive direction is towards from P 1 to P2. Since an angle can be defined with two line segments, two joint pairs are used to define joint angles.

To calculate the angles given in Fig. 1, coordinates of joint pairs are used. Each angle defined as P1P2 and P2P3 line segments, which have a common point P2. Using equation (5), all joint angles in Fig. 1 are calculated for all frames during the activity. After computing angles for all frames, a histogram for each joint angle is constructed. Observed joint angle values during the whole activity are stored with these histograms. After testing various histogram bin numbers, maximum classification accuracy is reached with the bin number 10. Thus, we set the bin number of histograms to 10 in the final results.

Before finding the angle between two line segments, directed cosine values of segments are calculated. Let P1(x1,y1,z1) and P2(x2,y2,z2) be two points in 3D coordinate space that define P1P2 line segment. Cosine values of P1P2 line segment are calculated as follows: cos 𝛼 =

𝑥2 −𝑥1 𝑑

cos 𝛽 =

𝑦2 −𝑦1 𝑑

cos 𝛾 =

𝑧2 −𝑧1

𝑑 = √(𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2 + (𝑧2 − 𝑧1 )2

𝑑

(1)

(2)

In equation (1) α, β, γ are angle values in x, y, and z dimensions respectively. d value in equation (2) is the Euclidian distance between two points. An unknown cosine value could be found with equation (3) if other two values are given.

two line segments and Ɵ be the angle between them. Ɵ angle can be defined as in equation (4) cos Ɵ = λ1 λ2 + µ1 µ2 + v1 v2

(4)

Hereby the angle between two line segments that have α1, β1, γ1 and α2, β2, γ2 directed angles can be given as below [17]. cos Ɵ = cos 𝑎1 cos 𝑎2 + cos β1 cos β2 + cos γ1 cos γ2

(5)

Histograms of all joint angle values are concatenated in a one dimensional array to obtain a compact representation of angles. All histograms are placed on the one-dimensional array as in the order of angles in Fig. 1. The order of angles is important hence changes in a joint angle have an effect on the success of the trained models. B. Displacements of Joints Joint angles might have similar distributions in some activities, such as checking watch and crossing arms. Therefore, joint angle information may not provide enough information to distinguish some actions. How much each joint moves in x, y, z axes might be important in some actions.

+

+

+

+

+

+

+

+

FY

FX

FZ

Fig. 2. Displacement of left wrist joint in waving

In some actions, the subject moves and changes its location in the capture. Therefore, a reference joint is selected when calculating displacement values. Each joint’s coordinate values are calculated relative to this joint. Displacements of joints are calculated using these relative coordinate values. Therefore, even the subject moves, displacements are calculated independent from the subject’s location. The central hip joint shown in Fig. 2 is selected as the reference point. Relative coordinate values of all joints are calculated except hand and foot joints. In Shotton et al.’s method, coordinates of hand and feet joints are frequently detected erroneously, although wrist and ankle joints are detected more robustly. Due to this noise, only displacements of wrist and ankle joints are considered to detect hand and foot movements. Relative coordinates are calculated by subtracting x-y-z coordinate values of each joint from the coordinate values of the central hip joint. Euclidean distance of each joint between consequent frames is calculated in x-y-z dimensions by using the relative coordinates. Then total displacements of a joint in x-y-z dimensions are calculated by summing up Euclidean distances of the joint among consequent frames. Displacements in x-y-z dimensions are calculated separately to distinguish actions that differ according to dimension. For example, in hand waving and punching actions, both wrist and elbow joints are moving but in different dimensions. Considering displacements in x-y-z dimensions separately provides more information to distinguish these actions. In addition to displacements in x-y-z dimensions, total displacement of each joint in 3D coordinate space is considered as another feature. Fig. 2 shows displacements of left wrist joint among consequent frames for hand waving action. C. Gradient Image In addition to features from skeletal joint model, we investigated features from depth image sequences. As an alternative feature, we studied gradient features from depth information and computed gradient images (GI) of depth image sequences. For each depth image, a GI is computed

HOG Features

GI

Fig. 3. Gradient Image of hand-wave action

in x, y, and z dimensions. Calculation of gradients in x, y, and z dimensions is done as shown in equation (6). In Equatin-6, D is the three dimensional depth data which contains N frames of depth sequence data. After gradient computation in three dimensions, 3 dimensional components that involve changes in different directions are obtained. These components are summed up as shown in equations (7-9). Then, summed gradient components are summed with each other to compute final GI (Equation 10). Fig. 3 shows calculation of GI for hand-wave action. [𝐹𝑥 , 𝐹𝑦 , 𝐹𝑧 ] = ∇(D) =

𝜕𝐹 𝜕𝑥

𝑖̂ +

𝜕𝐹 𝜕𝑦

𝑗̂ +

𝜕𝐹 𝜕𝑧

𝑘̂

(6)

𝐹𝑥𝑡𝑜𝑡𝑎𝑙 = ∑𝑁 𝑖=1 𝐹𝑥𝑖

(7)

𝐹𝑦𝑡𝑜𝑡𝑎𝑙 = ∑𝑁 𝑖=1 𝐹𝑦𝑖

(8)

𝐹𝑧𝑡𝑜𝑡𝑎𝑙 = ∑𝑁 𝑖=1 𝐹𝑧𝑖

(9)

𝐺𝐼 = 𝐹𝑥𝑡𝑜𝑡𝑎𝑙 + 𝐹𝑦𝑡𝑜𝑡𝑎𝑙 + 𝐹𝑧𝑡𝑜𝑡𝑎𝑙

(10)

To characterize the shape and local appearance, histogram of oriented gradients (HOG) descriptors [10] are computed from GI and used in action classification. HOG computation is made with 32x32 non overlapping cells and 8 gradient orientations. HOG features are usually accounted for changes in contrast and illumination but in our case they are accounted for changes in the depth. First step in HOG feature extraction is the gradient computation. Gradient in vertical and horizontal directions are calculated with filter kernels. Then orientation binning is done which involves cell histogram creation. In the

Table I Performance of our method, compared with other approaches +

+

+

Weigthed Sum of Depth Maps

DHI

HOG Features

Fig. 4. DHI of Hand-Wave action

next step, block descriptors are extracted by grouping cell histograms. The final HOG descriptor is a vector of components of the normalized cell histograms. D. Depth History Image Inspiring from motion history images [15] in 2-D videos, we also constructed depth history images (DHI) from depth image sequences. DHI is a static image template where pixel intensity is a function of the frequency and depth information. DHI is constructed by computing weighted sum of all depth frames in a sequence. DHI calculation is expressed in equation (11). 𝐷𝑖 is the ith depth image in the sequence and N is the total number of frames in sequence and 𝑖 is the index number of the frame in the sequence A sample DHI calculation of hand wave action is shown in Fig. 4. 𝐷𝐻𝐼 = ∑𝑁 𝑖=1(𝐷𝑖 ∗ 𝑖)

(11)

More recent depth images will be represented with more intensity values in the final DHI image. Thus, DHI does not only stores information about where the action happens; it also stores information about which phase of the action is more recent. E. Action Classification After obtaining all features, classification of various basic actions are implemented. In the first experiments, histograms of joint angles, displacements, GI and DHI features are studied separately. Additionally, these features are considered together in some experiments to observe their use in recognition. random forest (RF) algorithm is used in model training and testing. Matlab environment is used to extract feature values from capture samples. Orange tool is used to classify actions. For RF algorithm, 100 trees are trained with a depth of minimum 8. On average, the best results are obtained with a depth of nearly 50. The models are tested on MSR-Action3D [5] dataset. This dataset contains 20 different types of actions from 10 subjects and 567 capture samples.

Test-1

Test-2

CrossSubject Test

Bag of 3D Points [5]

91.6

94.2

74.7*

HOJ3D [20]

96.2

97.2

79.0*

EigenJoints [24]

95.8

97.8

83.3*

Yang et al. [21]

N/A

N/A

85.5*

Wang et al. [18]

N/A

N/A

88.2*

HON4D [12]

N/A

N/A

82.15±4.18

Joint Angles+Displacements

95.3

97.7

88.2±7.39

Joint Angles +Displacements+GI

93.1

97.7

89.0±7.06

Joint Angles +Displacements+DHI

95.4

98.0

89.4±6.86

IV.

EXPERIMENTS

Classification accuracy results of the experiments with various features sets and some of the previous works are shown in Table I. In Test-1, 33% of the data set is used for training and the remaining 67% is used for testing. In Test-2, 67% of data set is used for training and the remaining 33% is used for testing. In Test-1 and Test-2, each experiment is executed 10 times and averages of these runs are presented as the final classification accuracy results. Table-1 also gives the classification accuracy results for the cross-subject test in MSR Action-3D. In our cross-subject test, all C(10,5) combinations of subjects are tested and mean and standard deviation of accuracy values are calculated. The first five approaches in Table I do not consider all combinations and consider only one subset of subjects for training set. Thus results of these approaches (marked with a star) represent the mean classification accuracy of a single subset combination. Table I lists classification accuracies of three different models derived from our features. First model uses only histogram of joint angles and displacements of joints. As it can be seen from Table 1, this model gives close or better classification accuracy to some of the state-of-the-art methods. Classification accuracy of the cross-subject test is low comparing to Test-1 and Test-2 methods. Various characteristics (speed, body shape, etc.) of different subjects while performing actions cause this situation. Thus actions of different subjects may have varying feature values, which make classification harder. However, the first model still produces 88.2% mean accuracy in the cross subject test. To improve the classification performance, we added HOG descriptors of GIs to the feature set in the second model. Using GI features improved the accuracy in the cross subject test, but a decrease in Test-1 is observed. However, in the last model, adding HOG descriptors of DHIs to the feature set improved the performance in all test cases. As it can be observed from Table I, the third model produces the best results among the methods we know in the literature.

As shown in Table I, most studies use one subset of subjects for the cross subject test. With this approach, the best method we know, Wang et al [18], have 88.2% classification accuracy, which seems to have similar performance with our approach. However, various subsets of subjects may cause different classification accuracies. Since we test all subject subset combinations in the cross subject test, our approach considers variance of training sets more comprehensively. We obtain 89.4±6.86% mean accuracy with this testing approach. Only HON4D use this testing approach and obtain 82.15±4.18% accuracy, which is lower than our accuracy results. These results show that our approach provide a robust performance independent from the testing set. We also studied GI and DHI features alone in the experiments. When only HOG descriptors of GIs are used as features, 45.7% accuracy is obtained in cross-subject test. With the HOG descriptors of DHIs, classification accuracy is 50.2% in cross-subject test. Thus HOG descriptors from GI and DHI are not discriminative enough in recognizing actions. These experiments showed that our most useful features are histograms of joint angles and displacements of joints obtained from the skeletal model. V.

CONCLUSION

We presented an approach to recognize human actions using various features obtained from depth maps. While some angle and displacement features are derived from a trained skeleton model, some HOG features are derived directly from depth maps. We observed that histograms of joint angles and displacement of joints obtained from the trained skeleton model are the most useful in classification. HOG descriptors of gradient images and depth history images derived directly from depth maps were not as useful as the skeleton model features. However, when HOG features from depth maps are used with the skeleton model features, classification accuracy improves. Especially, HOG features from depth history images with histograms of joint angles and displacement of joints produce one of the best classification accuracies in the literature. Additionally, computational requirements of our approach are very low comparing to complex methods, which makes our approach suitable for future real time applications. In the future work, we are planning to discover more features from depth and RGB images to increase classification accuracy. Furthermore, recognizing actions while they are happening is another direction of our future work.

[4]

[5]

[6] [7]

[8]

[9]

[10]

[11] [12]

[13]

[14]

[15] [16]

[17] [18]

[19] [20]

[21]

[22]

REFERENCES [1]

[2]

[3]

J. Sung, C. Ponce,B. Selman, A. Saxena, 2 “ Human Activity Detection from RGBD Images”, AAAI workshop on Pattern, Activity and Intent Recognition, 2011 L. Xia, C.-C. Chen, and J. K. Aggarwal, “Human detection using depth information by Kinect,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on, 2011, pp. 15–22. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” CVPR 2011, pp. 1297–1304, 2011..

[23] [24]

[25]

P. Doliotis, A. Stefan, C. McMurrough, C. Eckhard,V. Athisos, “Comparing Gesture Recognition Using Color and Depth Information”, Conference on Pervasive Technologies Related to Assistive Environments (PETRA) ,2011 W. L. W. Li, Z. Z. Z. Zhang, and Z. L. Z. Liu, “Action recognition based on a bag of 3D points,” Comput. Vis. Pattern Recognit. Work. (CVPRW), 2010 IEEE Comput. Soc. Conf., 2010. Internet: www.primesense.org. Last access: 14.06.2013 M. Popa, A. K. Koc, L. J. M. Rothkrantz, C. Shan, and P. Wiggers, “Kinect Sensing of Shopping related Actions,” in Constructing Ambient Intelligence, vol. 277, 2012, pp. 91–100. S. P. Prismall, “Object reconstruction by moments extended to moving sequences”, Phd thesis, Department Electronic and Computer Science, University of Southhampton,2005 R. Behmo, P. Marcombes, A. Dalalyan, and V. Prinet, “Towards Optimal Naive Bayes Nearest Neighbor,” in Proceedings European Conference Computer Vision, 2010, pp. 171–184. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” 2005 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1, 2005. H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. R. Stat. Soc. Ser. B, vol. 67, pp. 301–320, 2005. O. Oreifej, Z. Liu, W. A. Redmond, ,”HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences”, In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2013 D. Comaniciu, P. Meer, and S. Member, “Mean Shift : A Robust Approach Toward Feature Space Analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, pp. 603–619, 2002. L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77. pp. 257–286, 1989. J. W. Davis, “Hierarchical motion history images for recognizing human motion,” Proc. IEEE Work. Detect. Recognit. Events Video, 2001. J. Wang, Z: Liu, J. Chorowski, Z. Chen & Y. Wu, ”Robust 3d action recognition with random occupancy patterns”, in Computer Vision– ECCV, 2012, pp. 872-885. Springer Berlin Heidelberg Artin, M. 1991. Algebra, Prentice Hall, ISBN 978-0-89871-510-1 Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1290–1297. L. Breiman, “Random forests,” Mach. Learn., vol. 45, pp. 5–32, 2001. L. Xia, C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3D joints”, in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on,2012,pp 20–27. X. Yang,C. Zhang, Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients”, in Proceedings of the 20th ACM international conference on Multimedia ,2012, pp. 1057-1060 M. Raptis, D. Kirovski, and H. Hoppe, “Real-time classification of dance gestures from skeleton animation,” Proc. 2011 ACM SIGGRAPHEurographics Symp. Comput. Animat. SCA 11, vol. 1, p. 147, 2011. L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, vol. 103. 1993, p. 507. X. Yang and Y. Tian, “EigenJoints-based action recognition using Native-Bayes-Nearest-Neighbor,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, 2012, pp. 14–19. J. I T, “Principal Component Analysis,2nd ed,” vol. 98. 2002, p. 487