Action Recognition using Multi-layer Depth Motion Maps ... - IEEE Xplore

1 downloads 0 Views 334KB Size Report
feature based method for human action recognition using depth image sequence. Fist, Layered Depth Motion maps (LDM) are utilized to capture the temporal ...
Action Recognition using Multi-layer Depth Motion Maps and Sparse Dictionary Learning Chengwu Liang #1 , Enqing Chen #2 , Lin Qi #3 , Ling Guan ∗#1 # 1



School of Information Engineering, Zhengzhou University, Zhengzhou, China

[email protected],

2

[email protected],

3

[email protected]

Department of Electrical and Computer Engineering, Ryerson University, Toronto, Canada 1

[email protected]

Abstract—In this paper, we propose a new spatio-temporal feature based method for human action recognition using depth image sequence. Fist, Layered Depth Motion maps (LDM) are utilized to capture the temporal motion feature. Next, multiscale HOG descriptors are computed on LDM to characterize the structural information of actions. Then sparse coding is applied for feature representation. Extending Sparse fisher Discriminative Dictionary Learning (SDDL) model and its corresponding classification scheme are also introduced. In SDDL model, the sub-dictionary is updated class by class, leading to class-specific compact discriminative dictionaries. The proposed method is evaluated on public MSR Action3D datasets and demonstrates great performance, especially in cross subject test.

I. I NTRODUCTION As one of the active research topics in computer vision, human action recognition has been widely studied. It is a key issue in natural human computer interaction, virtual reality, video surveillance, video retrieval, gaming and smart assistive living [1][2][3][4][5]. It is a challenging problem due to different hitches like occlusions, view changes of cameras, cluttering, large variations in human motion and appearance and so on. Successful action recognition is characterized by three aspects: effective feature extraction, appropriate feature representation or description, and suitable classifier. Traditional action recognition methods are based on RGB videos. With the release of RGB-depth sensor (e.g. Microsoft Kinect), depth images and RGB videos can be captured simultaneously. Depth images provide the 3D structural information, which gives us a new view for human action recognition. Moreover, the positions of human body skeleton joints can be extracted from single depth image [6]. Based on the data source, the existing approaches are divided into two categories: traditional color RGB video-based and the more recent depth images-based. Many RGB video-based methods have been studied, such as motion history image (MHI) method [7], space-time interest points (STIP) [8], bag of words (BOW) [9], trajectory-based methods [10] and hidden conditional random field (HCRF) [11]. However, a key limitation of BOW is that it is not able to capture adequate spatial and temporal information due to the local nature of the model. Trajectory-based methods [10] are sufficient to distinguish actions by the tracking of human joint positions but computationally expensive. Although significant c 2015 IEEE 978-1-4673-7478-1/15$31.00

Fig. 1. Examples of the depth images for actions of “High wave” (top) and “Draw X” (below).

progress has been made by RGB video-based methods, human action recognition still faces challenges, such as occlusions, large intra-class variations and illumination changes. Depth images that provide the 3D information of the scene may mitigate these challenges. Action recognition approaches based on depth images feature have grown rapidly recently [3]. In this paper, what we focus on is human action recognition based on the original depth data. Based on depth images (Fig.1), a new spatio-temporal image motion feature, i.e. Sparse coding-based multi-Layer Depth Motion maps feature (ScLDM) and multi-scale HOG descriptors, are proposed to characterize the local spatial structure (shape) of an action and the local temporal change of human motion. By preserving temporal local motion and spatial structure of the human action occurring in a video sequence, and utilizing the complementary nature of the two types of information, the proposed multi-layered depth motion feature and multi-scale structure feature effectively capture the primary characteristics of a depth action. In addition, human action recognition has to deal with the problem of sparsity of high-dimensional data distribution and the overlap of feature subspaces which may lead to performance degradation in recognition, especially for certain similar actions. Dictionary learning classification method is one way to solve this case. In order to cope with the intersection of

different feature sets, different from conventional Orthogonal Matching Pursuit (OMP) methods and K-SVD algorithm [12], SDDL is coupled with ScLDM for action recognition. By adding certain constraints and a discriminative term to the sparse coefficients in Sparse Dictionary Learning (SDL) model, the sparse coefficients have small intra-class variances and large inter-class variances, leading to more discriminative representation. With the introduction of the SDDL model to action recognition, the new feature representation method we proposed is effective for depth images-based human action recognition. The experiment results demonstrate that human action recognition performance is improved by the proposed method.

LDMfront LDMSide LDMTop y HOG(s=3) HOG(s=2)

HOG(s=1)

x

(a)Golf swing

Multi-scale HOG

HOG(s=3)

z

HOG(s=2)

Front view :x-y projection Side view :y-z projection Top view :x-z projection Projection to three planes

HOG(s=1)

(b)High throw Multi-layered LDM

Multi-scale HOG

Fig. 2. The feature extraction from a depth action video. The LDMv of actions (a) “Golf swing” and (b) “High throw”.

II. R ELATED W ORKS According to the features extracted for action recognition, the current methods based on depth sensor are roughly divided into three categories: original depth features-based, skeleton features-based and the fusion of different features-based. For feature extraction based on original depth data, the first work of action recognition is [13]. They employed a bag of 3D points from the original 3D depth map and Gaussian mixture models to describe a salient posture, and proposed to model the dynamics of actions by action graph. In this method, each depth map is projected onto three orthogonal Cartesian planes to select the representative 3D points. Ni et al. [14] proposed a Three-Dimensional Motion History Images (3DMHIs) approach and a Depth-Layered Multi-Channel STIPs (DLMC-STIPs) framework. However, this framework augments depth images as the auxiliary information for extracting the STIP in RGB channel [3]. To address the noisy and missing values problem in depth videos, Xia et. al [15] used noise suppression functions to extract Depth STIP (DSTIP) and proposed a self-similarity depth cuboid feature (DCSF) to boost the performance. Based on depth maps, 3D local occupancy feature is used individually in 3D spatio-temporal volume [16][17], in which data points could be projected to the 4D (x, y, z, t) space, for activity recognition. Yang et al. [18] employed Depth Motion Maps (DMM) feature and Histogram of Oriented Gradient (HOG) descriptor to characterize body shape and motion information. This method has good performance and is computationally simple. Chen et al. [19] utilized DMM and collaborative representation classifier to achieve real-time action recognition. Histogram of oriented 4D normals (HON4D feature) [20] was used for activity recognition from depth sequences. 3D motion trial model and pyramid HOG (3DMTM-PHPG) [21] was proposed to represent the actions of depth maps. For skeleton features-based methods, some researchers use skeleton tracker [6] to construct various skeleton joins features. EigenJoints feature descriptor [22] based on the differences of skeleton joints, and Naive Bayes Nearest Neighbour (NBNN) were used for action recognition. Xia et al. [23] proposed to use histograms of 3D joint locations (HOJ3D) as a representation of static postures, and applied discrete hidden markov models (HMMs) for action recognition. However, a limitation

of these methods is that 3D joint positions extracted by [6] are not optimal due to the challenges caused by occlusions or cluttering. By fusing spatio-temporal feature from colour images and 3D skeleton joints feature from depth maps, [24] achieved good recognition results. In [25], three STIP based features and six local descriptors are evaluated for depth-based action recognition. They proposed two schemes to refine STIP features and a fusion approach to evaluate the performance of combining STIP with skeleton features. However, these methods generate a considerable amount of data and have high computational complexity. III. S PARSE CODING - BASED L OCAL D EPTH M OTION F EATURES (S C LDM) In [18][19][26], based on depth images, Depth Motion Maps (DMM) are used to characterize the 3D local motion and shape information. Although it is computationally simple and has demonstrated good performance, DMM [18][19] calculated from the entire sequence may not capture the transitional motion cues. Previous motion history may get overwritten when a more recent action occurs at the same location. This observation motivates us to divide a depth action sequence into several temporal layers and calculate the individual depth motion maps within each layer to better capture the detailed temporal local motion cues. By introducing ScLDM both in temporal multi-layer and spatial multi-scale, the procedure of obtaining depth motion feature and feature representation becomes more effective. It has three components, multi-Layer local temporal Depth Motion map features extraction (LDM), multi-scale HOG descriptors on LDM and sparse coding-based feature representation. A. Multi-layer Depth Motion Maps Feature For a depth action sequence, it contains 3D depth information. We first project the 3D depth frame onto three orthogonal 2D planes [13], as showed in Fig.2. Each plane is a view, corresponding to a projected map, denoted by DMv , where v ∈ {f ront, side, top}. LDM features are extracted from three views to characterize the depth motion feature of an action. For each projected map, its motion history map energy LDMvL is

obtained by computing the difference between two maps at different temporal interval L. Each fixed temporal interval is a layer. Then the binary map of motion energy is obtained, which indicates motion regions, i.e. where movement occurs in each temporal interval. It provides a clue of the action category being performed. At each layer, we stack the motion energy through entire video sequences to generate the local depth motion map features for each projection view, then concatenate the three motion energy terms to form LDM L feature: LDMvL =

max−b X

 |DMvi+L − DMvi | ≥ ε ,

i=a L

LDM =

(1)

L L T [LDMfLront , LDMside , LDMtop ]

where i is the frame index, max is the number of video frames, a is the starting frame index and b is the ending frame index, L is the temporal interval (in frames), L with fixed value is a layer. When L = 1, LDM L is equal to DMM described in [18]. ε is the noise threshold, N is the number of layers. As noted in [19], for each depth video sequence, at the beginning and the end, the subjects were mostly at standstill position with only small body movements, which did not contribute to the motion characteristics. The first a frames and the last b frames were removed. The smaller L is, the more depth motion map characteristic details between frames we can have. Different depth action video samples have different duration time. For a short video clip, in each projection view, the difference between frames with smaller L could obtain more motion information of an action. Intuitively, for a long video clip, its motion information is well preserved by large L. Then we concatenate the N layers LDM L feature to form LDM feature. LDM = [LDM L=L1 , LDM L=L2 , ..., LDM L=LN ]T

(2)

In our experiment, we set a = b =5, the number of layers as 3 and L1 = 1, L2 = 3, L3 = 5, ε = 30 or 50. B. Multi-scale Structure Feature (HOG Pyramid) Although multi-layered LDM features are able to encode accurate motion information in temporal dimension, they lack structural information of the action. Simultaneously, multilayered LDM features are actually pixel-based features. One disadvantage is that the feature dimension could be fairly high. To better fit the sparse representation classification framework, we build a compact representation based on the HOG descriptor [27]. The edge and gradient information captured by the HOG descriptor effectively describe the action appearance and motion orientations. Structural information of LDM features is composed of different sacles. To encode the structural information of LDM features, HOG descriptors at all three scales of the spatial pyramid are computed on LDM to characterize the multiscale action shapes and motion orientations. The principal component analysis (PCA) or Random Projection (RP) is then employed to reduce the high dimensionality of the feature vectors.

C. Sparse coding-based Feature Representation Upon extracting features from both the temporal motion feature (i.e., the LDM feature) and the spatial shape feature (i.e., the multi-scale HOG feature), the next step is feature representation. Sparse coding plays an important role in human perception and has been used in pattern recognition and computer vision tasks. A predefined dictionary containing the training samples of all classes directly is used to code the query image[28]. In this paper, the sparse coding is applied on the extracted features to learn representative features. The whole feature representation process is referred to as Sparse coding LDM feature and multi-scale HOG feature (ScLDM). Assume that there are K class actions. Let A = [A1 , A2 , ..., Ai , ..., AK ] be the extracted feature sets of all training samples, where Ai is the feature set of training samples from class i. A structured dictionary D = [D1 , D2 , ..., Di , ..., DK ] is learned, with Di being a class-specific sub-dictionary associated with class i. Let y be the feature of a query sample. Sparse representation codes y over the dictionary D, thus y ≈ DX, where X = [X1 ; X2 ; ...; Xi ; ...; XK ] ∈ Rn is a vector of sparse coefficients and Xi contains sparse sub-coefficients associated with class i. The calculated sparse codes for one feature correspond to the responses of that feature to all the atoms in the dictionary. This is formulated as n o ˆ = arg minX ky − DXk2 + λ kXk X (3) 2 1 where λ is scalar parameters. IV. ACTION R ECOGNITION USING SDDL Predefined dictionary [28] has high coding complexity and is unable to fully exploit the discriminative information hidden in the training samples. Although there are more generalized ways of dictionary learning, such as K-SVD [12], such methods are not suitable for classification tasks because they only guarantee that the learned dictionary can faithfully represent the training samples, not guarantee that the learned dictionary is discriminative, which is important for action recognition. In this paper, by using combined sparse coding LDM feature and multi-scale HOG feature (ScLDM) extracted from depth image, we construct an SDDL model, which was descriped in [29] for image classification, to classify different actions. To the best of our knowledge, it is the first time this modal is applied to depth data-based action recognition. One advantage of SDDL is that the dictionary can be learned class by class offline and tested online. When a new class of training sets is added, the dictionary updates itself with incremental parts, without repeating the entire training process. Ideally, the dictionary D in Eq. 3 not only faithfully represents the query samples, but also has powerful discriminative power for action recognition. The SDDL model for action recognition is given below   r (A, D, X) J(D, X) = argmin(D,X) +λ1 kXk1 + λ2 f (X) (4) s.t. kdn k2 = 1, ∀n

where r (A, D, X) is the data fidelity term, kXk1 is the sparse constrains, f (X) is a discrimination term imposed on the coefficient matrix X, λ1 and λ2 are scalar parameters. Each atom kdn k2 of D is constrained to have a unit l2 -norm. A. Data Fidelity Term and Discriminative Coefficient Term i h Set Xi = Xi1 ; ...; Xij ; ...; XiK , where Xij is the representation coefficient of Ai over Dj . First of all, the whole dictionary D should represent Ai as faithfully as possible, namely Ai ≈ DXi = Di Xi1 + ... + Di Xii + ... + DiK . Second, since Di is a class-specific sub-dictionary associated with class i, it is expected that Xii has the most significant coefficients in Xi and other Xij (j 6= i) should be as small as possible, i.e. Xi is block sparse. The data fidelity term r(A, D, X) is formulated as

2

2 r (Ai , D, Xi ) = kAi − DXi kF + Ai − Di Xii F K

2

X (5)

j +

Dj Xi j=1,j6=i

F

The discriminative constraint f (X) makes class-specific sub-dictionary efficiently represent the corresponding action but less efficiently for other kinds of action. Therefore, it leads to generate smaller and more compact dictionaries. This is formulated as 2

f (X) = tr (SW (X)) − tr (SB (X)) + η kXkF

(6)

where SW (X) is the intra-class scatter and SB (X) is the 2 inter-class scatter of X, respectively. kXkF is an elastic term, making f (X) convex and stable for optimization. In Eq. 6, we set η = 1. B. The Classification Strategy The classification strategy is inspired by the recent success of [29] in face recognition. When the information of extracted action feature is not enough for the representation of the sample feature space, the learned sub-dictionary Di may not be able to faithfully represent the query samples of this class. We need collaborative sparse representation (CRC) on the whole dictionary D, called Global Classifier (SDDL-GC), similar to [29]. On the other hand, in the test stage the l1-norm regularization on the representation coefficient may be relaxed to l2-norm regularization for faster speed. When the information of extracted action feature is enough for the representation of the sample feature space, the subdictionary Di is able to well span the subspace of class i. In this case, we can represent y locally over each sub-dictionary Di instead of the whole dictionary D, which is called Local Classifier (SDDL-LC)[29]. V. E XPERIMENT S ETTINGS AND R ESULTS

TABLE I T HREE ACTION S UBSETS OF MSR ACTION 3D DATASET Action Set 1 (AS1) Horizontal wave (How) Hammer (Hamm) Forward punch (Fp) High throw (Ht) Hand clap (Hcla) Bend (Bend) Tennis serve (Tse) Pickup throw (Pt)

Action Set 2 (AS2) High wave (Hiw) Hand catch (Hcat) Draw x (Dx) Draw tick (Dt) Draw circle (Dc) Two hand wave (Thw) Forward kick (Fk) Side boxing (Sb)

Action Set 3 (AS3) High throw (Ht) Forward kick (Fk) Side kick (Sk) Jogging (Jog) Tennis swing (Tsw) Tennis serve (Tse) Golf swing (Gs) Pickup throw (Pt)

TABLE II P ERFORMANCE EVALUATION OF THE PROPOSED METHODS AND SOME EXISTING METHODS WITH THREE TESTS (T EST ONE , T EST T WO , C ROSS S UBJECT T EST ) ON THREE SUBSETS (AS1, AS2, AS3) Tests AS1One AS2One AS3One Overall AS1Two AS2Two AS3Two Overall AS1Cross AS2Cross AS3Cross Overall Average

[13] 89.5 89.0 96.3 91.6 93.4 92.9 96.3 94.2 72.9 71.9 79.2 74.7 86.8

[23] 98.5 96.7 93.5 96.2 98.6 97.9 94.9 97.1 88.0 85.5 63.5 79.0 90.8

[18] 97.3 92.2 98.0 95.8 98.7 94.7 98.7 97.3 96.2 84.1 94.6 91.6 94.9

[22] 94.7 95.4 97.3 95.8 97.3 98.7 97.3 97.7 74.5 76.1 96.4 82.3 91.9

M1 97.3 96.1 98.7 97.4 98.7 98.7 100 99.1 96.2 83.2 91.6 90.5 95.7

M2 94.7 96.1 94.7 95.1 98.7 98.7 98.7 98.7 97.2 83.2 92.9 91.1 95.0

M3 98.0 93.5 99.3 97.0 97.3 98.7 100 98.7 97.2 88.5 93.8 93.2 96.3

M4 97.3 93.5 98.7 96.5 98.7 98.7 100 99.1 97.2 89.4 94.5 93.7 96.4

2 or 3 times. There are 567 depth action sequences in total. The resolution is 320x240. 20 actions are divided into three subsets, AS1, AS2, AS3. Each subset includes 8 actions with certain overlaps. AS1 and AS2 were actions with similar movement, while AS3 was complex actions. Note that AS1 and AS2 have small inter-class variations, while AS3 has large intra-class variation. We use this dataset to evaluate the proposed method using the same experiment settings as described in [13]. As for each subset, there are three different tests. In Test One, 1/3 of the subset is used for training and the rest for testing; in Test Two, 2/3 of the subset is used for training and the rest for testing; and in Cross Subject Test, half subjects, namely subjects 1, 3, 5, 7, 9 (if existed) were used for training. As described in [13], the samples used for training were fixed. In Test One or Test Two, for each actio and each subject, the first One or first Two action videos were choosed as training samples. In each projected map, the foreground region is normalized to a fixed size. This normalization is able to reduce intra-class variations caused by subject heights and motion extents. In SDDL, the number of dictionary atoms is set to the number of training samples. The parameters of SDDL-GC are λ1 = 0.005, λ2 =0.05, γ =0.005, ω =0.05; parameters of SDDL-LC are λ1 = 0.1, λ2 =0.001, µ1 =0.1, µ2 =0.005.

A. Datasets and Experiment Settings

B. Action Recognition Results on Three Subsets and Analysis

MSR Action3D [13] is a public dataset with sequences of depth maps captured by a RGB-Depth sensor (Kinect), as shown in Table I. It includes 20 action categories performed by 10 subjects. For each subject, each action was performed

As shown in Table II, the recognition rates of Test One and Test Two are higher than Cross Test, and those of Test Two, in general, are the highest. In Cross subject Test, because the test samples and training samples come from different subjects,

Action

Action 0.0

0.0

0.0

0.0

0.0

0.0

Hiw

100.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

Ht

63.6

0.0

0.0

0.0

0.0

0.0

0.0

0.0

Hamm

0.0

91.7

0.0

0.0

0.0

0.0

0.0

0.0

Hcat

0.0

75.0

0.0

0.0

0.0

0.0

0.0

0.0

Fk

0.0

100.0

0.0

0.0

0.0

0.0

0.0

0.0

Fp

0.0

8.3

100.0

0.0

0.0

0.0

0.0

0.0

Dx

0.0

8.3

78.6

0.0

40.0

0.0

0.0

0.0

Sk

0.0

0.0

90.9

0.0

0.0

0.0

0.0

0.0

Ht

0.0

0.0

0.0

90.9

0.0

0.0

0.0

0.0

Dt

0.0

8.3

21.4

100.0

6.7

0.0

0.0

0.0

Jog

0.0

0.0

0.0

100.0

0.0

0.0

0.0

0.0

Hcla

0.0

0.0

0.0

0.0

100.0

0.0

0.0

0.0

Dc

0.0

0.0

0.0

0.0

53.3

0.0

0.0

0.0

Tsw

9.1

0.0

0.0

0.0

93.3

0.0

0.0

0.0

Bend

0.0

0.0

0.0

0.0

0.0

100.0

0.0

0.0

Thw

0.0

0.0

0.0

0.0

0.0

100.0

0.0

0.0

Tse

9.1

0.0

9.1

0.0

6.7

93.3

0.0

0.0

Tse

0.0

0.0

0.0

0.0

0.0

0.0

93.3

0.0

Fk

0.0

0.0

0.0

0.0

0.0

0.0

100.0

0.0

Gs

0.0

0.0

0.0

0.0

0.0

0.0

100.0

0.0

Pt

0.0

0.0

0.0

9.1

0.0

0.0

6.7

100.0

Sb

0.0

8.3

0.0

0.0

0.0

0.0

0.0

100.0

Pt

18.2

0.0

0.0

0.0

0.0

6.7

0.0

100.0

How

Hamm

Fp

Ht

Hcla

Bend

Tse

Pt

Hiw

Hcat

Dx

Dt

Dc

Thw

Fk

Sb

Ht

Fk

Sk

Jog

Tsw

Tse

Gs

Pt

Fig. 3.

AS3 Recogniton Rate

0.0

AS2 Recogniton Rate

AS1 Recogniton Rate

Action

100.0

How

Confusion matrices using the proposed method on Cross Test in subsets AS1 (left), AS2 (middle) and AS3 (right).

TABLE III E VALUATION OF METHODS ON THE C ROSS S UBJECT T EST Methods

Year

Bag-of-3D-points [13] Actionlet ensemble [16] HOJ3D & DHMM [23] STOP [17] DMM HOG & SVM [18] HON4D [20] DSTIP+DCSF & SVM [15] Eigen joints & NBNN [22] 3DMTM-PHOG & SVM [21] DMM & SDDL-GC+PCA DMM & SDDL-LC+RP ScLDM & SDDL-GC ScLDM & SDDL-LC

2010 2012 2012 2012 2012 2013 2013 2014 2014 Ours Ours Ours Ours

Data Channel DEP, SK DEP DEP SK DEP DEP DEP DEP SK DEP DEP DEP DEP DEP

Cross Test 74.7 88.2 79.0 84.8 91.6 88.9 85.8 82.3 90.7 90.5 91.1 93.2 93.7

the-art results. 98 SVM SDDL−GC SDDL−LC

96 94 92 90 88 86 84 82 80

Single layered LDM

ScLDM (LDM+Multi Scale Spatial HOG)

DEP, SK denote depth images, skeleton joints respectively.

(a) Classifiers comparison and subjects chose to perform actions freely on their own styles, it has big intra-class variations and is more challenging. Using the same experimental setup, four different configurations of the proposed method (M1, M2, M3, M4) are compared with state-of-the-art on three subsets of MSR Action3D. The results are summarized in Table II. First, we use M1 (DMM with SDDL-GC) and M2 (DMM with SDDL-LC) to evaluate how well the DMM feature described in [18] fit with the SDDL model. Then, we evaluate the performance of the proposed ScLDM feature with the two classification strategies, M3 (ScLDM+SDDL-GC) and M4 (ScLDM+SDDL-LC). Table II clearly shows that with the same settings, the average recognition rates by the four configurations of the proposed method outperform [13] [18], [22] and [23]. We also observe that when using combined sparse coding LDM feature and multi-scale HOG feature (ScLDM), the performance of SDDL-GC and SDDL-LC are both better than single layer LDM feature (i.e. DMM). Among the four configurations, M4 achieves the best performance. C. Cross Subject Test and Analysis We further studied Cross Subject Test. The results are shown in Table III. The results indicate that the proposed four methods are, in general, better than the other methods. The highest accuracy is 93.7% by M4, the second highest accuracy is 93.2% by M3. Both of them outperform state-of-

98 Single layered LDM ScLDM

96 94 92 90 88 86 84 82 80

SVM

SDDL_GC

SDDL_LC

(b) Features comparison Fig. 4. Average recognition rates of two features (single layered LDM (L = 1) and ScLDM) with three classification methods (SVM, SDDL-GC, and SDDL-LC) on the Cross subject Test: (a) Classifiers comparison, (b)Features comparison

The confusion matrices of Cross Test based on ScLDM (L=1) and SDDL-GC are showed in Fig.3. In AS1 and AS3, most actions are recognized correctly, despite the fact that different people may perform the same action differently, leading to large intra-class variation in each class. In AS2, most state-of-the-art methods achieved very low accuracies for some of the similar actions, such as “Hand catch”, “Draw x”, “Draw tick” and “Draw circle”. But for these four actions,

75%, 79%, 100%, 53% are achieved by the proposed method, indicating the effectiveness and distinctiveness of combining multi-layered LDM features in time and multi-scale HOG features in space. D. Performance evaluation of SDDL and SVM classifiers on Cross Subject Test As illustrated in Fig.4, using only the single layered LDM feature, without computing multi-scale structure feature (HOG pyramid), the cross test accuracies are 90.5% (SDDL-GC classifier), 91.1% (SDDL-LC classifier), which outperform 82.3% using DMM HOG-SVM in [18] and 90.7% using 3DMTM-SVM in [21] respectively. This demonstrates the effectiveness of the proposed sparse representation with incremental discriminative dictionary learning. By imposing Fisher discriminative criterion in dictionary learning, the performance of human action recognition is greatly enhanced. VI. C ONCLUSIONS In this paper, based on depth images, we proposed an effective method to extract features of human actions. Using multi-layered LDM and multi-scale HOG, we successfully captured temporal local image motion feature and the action structure and shape feature respectively. We then combined these complementary information as a holistic descriptor to form an effective representation for human actions. We also introduced Sparse Discriminative incremental Dictionary Learning (SDDL) model to action recognition, which is effective by learning a sparse dictionary class by class. This dictionary makes the sparse coding coefficients sufficiently discriminative for classification of similar actions. The experimental results on MSR Action3D dataset demonstrate the effectiveness of ScLDM features and the classification performance of SDDL. ACKNOWLEDGMENT This work is supported by National Natural Science Foundation of China (NSFC, No. 61331021), the Key International Collaboration Program of NSFC (No. 61210005) and the Canada Research Chair Program. R EFERENCES [1] J. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 16, 2011. [2] C. Chen, R. Jafari, and N. Kehtarnavaz, “Improving human action recognition using fusion of depth camera and inertial sensors,” HumanMachine Systems, IEEE Transactions on, vol. 45, no. 1, pp. 51–61, 2015. [3] J. Aggarwal and L. Xia, “Human activity recognition from 3d data: A review,” Pattern Recognition Letters, vol. 48, no. 0, pp. 70 – 80, 2014. [4] R. D. Green and L. Guan, “Quantifying and recognizing human movement patterns from monocular video images-part II: applications to biometrics,” Circuits and System for Video Techonology, IEEE Transactions on, vol. 14, no. 2, pp. 191–198, 2004. [5] M. J. Kyan, G. Sun, H. Li, L. Zhong, P. Muneesawang, N. Dong, B. Elderand, and L. Guan, “An approach to ballet dance training through ms kinect and visualization in a cave virtual reality environment,” ACM Transactions on Intelligent Systems and Technology (TISI), vol. 6, no. 2, p. 23, 2015. [6] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” in CVPR, June 2011, pp. 1297–1304.

[7] A. Bobick and J. Davis, “The recognition of human movement using temporal templates,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, no. 3, pp. 257–267, Mar 2001. [8] I. Laptev and T. Lindeberg, “On space-time interest points,” Internal Journal Computer Vision, vol. 64, no. 2, pp. 107–123, 2005. [9] J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” International Journal of Computer Vision, vol. 79, no. 3, pp. 299–318, 2008. [10] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Computer Vision (ICCV), 2013 IEEE International Conference on, Dec 2013, pp. 3551–3558. [11] Y. Wang and G. Mori, “Hidden part models for human action recognition: Probabilistic versus max margin,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 7, pp. 1310–1323, July 2011. [12] M. Aharon, M. Elad, and A. Bruckstein, “k -svd: An algorithm for designing overcomplete dictionaries for sparse representation,” Signal Processing, IEEE Transactions on, vol. 54, no. 11, pp. 4311–4322, Nov 2006. [13] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” in CVPR Workshops (CVPRW), 2010 IEEE Computer Society Conference on, June 2010, pp. 9–14. [14] B. Ni, G. Wang, and P. Moulin, “Rgbd-hudaact: A color-depth video database for human daily activity recognition,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, Nov 2011, pp. 1147–1153. [15] L. Xia and J. Aggarwal, “Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera,” in CVPR, 2013 IEEE Computer Society Conference on. IEEE, 2013, pp. 2834–2841. [16] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in CVPR, 2012 IEEE Conference on, pp. 1290–1297. [17] A. W. Vieira, E. R. Nascimento, G. L. Oliveira, Z. Liu, and M. F. M. Campos, “Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences.” in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications(CIARP), vol. 7441. Springer, 2012, pp. 252–259. [18] X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients,” in 20th ACM international conference on Multimedia, 2012, pp. 1057–1060. [19] C. Chen, K. Liu, and N. Kehtarnavaz, “Real-time human action recognition based on depth motion maps,” Journal of Real-Time Image Processing, pp. 1–9, 2013. [20] O. Oreifej and Z. Liu, “Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences,” in CVPR, 2013 IEEE Conference on. IEEE, 2013, pp. 716–723. [21] B. Liang and L. Zheng, “3D motion trail model based pyramid histograms of oriented gradient for action recognition,” in ICPR, 2014 22nd International Conference on, pp. 1952–1957. [22] X. Yang and Y. Tian, “Effective 3d action recognition using eigenjoints,” Journal of Visual Communication and Image Representation, vol. 25, pp. 2–11, Jan. 2014. [23] L. Xia, C.-C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in CVPRW, 2012 IEEE Computer Society Conference on. IEEE, 2012, pp. 20–27. [24] J. Luo, W. Wang, and H. Qi, “Spatio-temporal feature extraction and representation for rgb-d human action recognition,” Pattern Recognition Letters, vol. 50, no. 0, pp. 139–148, 2014. [25] Y. Zhu, W. Chen, and G. Guo, “Evaluating spatiotemporal interest point features for depth-based action recognition,” Image and Vision Computing, vol. 32, no. 8, pp. 453 – 464, 2014. [26] C. Chen, R. Jafari, and N. Kehtarnavaz, “Action recognition from depth sequences using depth motion maps-based local binary patterns,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), jan 2015, pp. 1092–1099. [27] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005. IEEE Computer Society Conference on, vol. 1, 2005, pp. 886–893. [28] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 2, pp. 210–227, 2009. [29] M. Yang, L. Zhang, X. Feng, and D. Zhang, “Sparse representation based fisher discrimination dictionary learning for image classification,” International Journal of Computer Vision, pp. 1–24, 2014.

Suggest Documents