3D ACTION RECOGNITION USING DATA VISUALIZATION AND CONVOLUTIONAL NEURAL NETWORKS Mengyuan Liu† , Chen Chen‡ and Hong Liu†∗ †
Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, China ‡ Center for Research in Computer Vision, University of Central Florida, USA
[email protected] [email protected] [email protected]
ABSTRACT It remains a challenge to efficiently represent spatial-temporal data for 3D action recognition. To solve this problem, this paper presents a new skeleton-based action representation using data visualization and convolutional neural networks, which contains four main stages. First, skeletons from an action sequence are mapped as a set of five dimensional points, containing three dimensions of location, one dimension of time label and one dimension of joint label. Second, these points are encoded as a series of color images, by visualizing points as RGB pixels. Third, convolutional neural networks are adopted to extract deep features from color images. Finally, action class score is calculated by fusing selected deep features. Extensive experiments on three benchmark datasets show that our method achieves state-of-the-art results. Index Terms— 3D action recognition, data visualization, skeleton data, convolutional neural networks 1. INTRODUCTION Human action recognition has been used in various applications, e.g., human-computer interaction and intelligent surveillance [1, 2]. An intuitive way to analyse human actions is to estimate human poses from 2D images, facing semantic ambiguities induced by cluttered backgrounds and loss of depth data [3, 4]. The development of RGB-D cameras, in particular the Kinect [5], opens up opportunities in addressing above problems [6, 7, 8, 9, 10]. With the implementation of capturing skeletons from Kinect in realtime [11], recent works focus on skeleton-based action recognition [12]. Recently, convolutional neural networks (ConvNets) has been applied on skeleton data [13, 14], and the extracted deep features outperform most previous hand-crafted features This work is supported by National High Level Talent Special Support ProgramNational Natural Science Foundation of China (NSFC, No.61340046, 61673030, 61672079, U1613209), Specialized Research Fund for the Doctoral Program of Higher Education (No.20130001110011), Natural Science Foundation of Guangdong Province (No.2015A030311034), Scientific Research Project of Guangdong Province (No.2015B010919004). Hong Liu∗ is the Corresponding author.
c 978-1-5090-6067-2/17/$31.00 2017 IEEE
which are shallow and dataset-dependent. However, action recognition using skeleton data is still challenging for two main reasons. First, skeleton data suffers from problems like noise, performers’ speeds and habits. Second, it is difficult to efficiently represent spatial-temporal data [15]. To encode spatial-temporal data, [13, 14] convert a skeleton sequence as a color image, which implicitly involves local coordinates, joint labels and time labels of skeleton joints. However, these methods either overemphasize the spatial or the temporal domain. This paper intends to propose a compact yet discriminative skeleton sequence representation. We represent a skeleton sequence as a series of color images which encode both spatial and temporal domains in an unbias manner. Compared with [13, 14], our method captures more abundant spatial-temporal cues, since the generated color images extensively encode both spatial and temporal cues. After representing skeleton sequences as a set of color images, we modify the AlexNet [16] to extract deep features from these images. Final judgement of action class is decided by decision-level fusion. The main contributions are summarized as follows: • We model a skeleton sequence as a set of five dimensional points, which are further encoded as a series of color images using data visualization. These color images can efficiently describe both spatial and temporal information of skeleton joints. • After 3D actions are encoded as images, previous works on the popular convolutional neural networks can be applied for 3D action recognition. We propose a multistream convolutional neural networks model to extract deep features from color images. The action class score is calculated by fusing selected deep features. • We achieve highest accuracies on three benchmark datasets. On NTU RGB+D dataset, our method even outperforms most recent LSTM-based methods, i.e., ST-LSTM [17], verifying the efficiency of our method to represent the spatial-temporal skeleton joints. The remainder of this paper is organized as follows. Section 2 briefly reviews related work. Section 3 provides the
5D Space
2D Color Images
2D Space: Horizontal and Vertical Locations
x
j:
x
x
x
x
y
y
y
z
z
f
k:
y
z
f
n
z
f
n
f
n
n
k o
(r,g,b)
y j
3D Space: RGB Colors
z f
r:
z
y
y
y
x
x
x
x
x
x
n
g:
f
f
z
z
f
z
z
y
y
y
b:
n
n
n
f
n
n
f
n
f
z
c:
1
2
3
4
5
6
7
8
9
10
1
9 10
Fig. 1. Pipeline of the proposed joint visualization method. Each skeleton joint is mapped as a point (x, y, z, f, n) in a five dimensional space, i.e. three dimensions of coordinates, one dimension of time label and one dimension of joint label. These points are encoded as ten types of color images, by mapping each point as a pixel with different settings. Take the first type of image (c = 1) as an example, the point is encoded as a pixel with (x, y) as its coordinates and (z, f, n) as its color values. skeleton visualization method. Section 4 describes the structure of multi-stream CNN model. Section 5 reports the experimental results. Section 6 concludes the paper. 2. RELATED WORK ConvNets has achieved promising performances in many computer vision tasks, especially image-based recognition. Many methods [18, 19, 15, 20] have been developed to encode video sequences as images, which are further explored by ConvNets. Simonyan et al. [18] proposed a two-stream ConvNet architecture incorporating spatial and temporal networks. They chose one frame as input for the spatial network and accumulated inter-frame optical flows as multiframe dense optical flows which serve as inputs for the temporal network. Observing that [18] only sampled up to ten consecutive frames at inference time, Yue et al. [19] aggregated strong CNN image features over long periods of a sequence using multiple pooling methods. Bilen et al. [15] proposed a dynamic image representation, which is a single RGB image generated by applying approximate rank pooling operator on raw image pixels of a sequence. Wang et al. [20] accumulated motions between projected depth maps as depth motion maps (DMM), which served as inputs for ConvNet. To enhance the textures of DMMs, they coded DMMs into pseudo-color images by applying rainbow transform on pixel values of DMMs. Generally speaking, these methods apply operators, e.g. subtraction, rank pooling and accumulation, on raw pixels of a sequence to encode a sequence as an image. Despite their efficiency, these operators may lead to the loss of distinct temporal information. To alleviate this problem, Du et al. [13] concatenated skeleton joints in each frame according to their physical connections, and used three coordinates of each joint as the corresponding color values of each pixel. The generated image directly reflects the temporal evolutions and joint labels and implicitly involves the local coordinates of skeleton joints.
Wang et al. [14] projected local coordinates of skeleton joints on to three orthogonal planes. On each plane, 2D trajectories of joints constructed a color image, where the time labels and joint labels were mapped to the colors. The generated image directly reflects the local coordinates of joints and implicitly involves the temporal evolutions and joint labels. Generally speaking, these methods either overemphasize the spatial information or the temporal information. This paper presents a new data visualisation method to encode skeleton joints. Since we treat spatial and temporal cues in an unbiased manner, our method outperforms [13, 14].
3. SKELETON VISUALIZATION Data visualization [21] refers to the techniques used to communicate data or information by encoding it as visual objects (e.g. points, lines or bars) contained in graphics. One goal of data visualization is to communicate information in high dimensional space clearly and efficiently to users. Diagrams used for data visualization include bar chart, histogram, scatter plot, stream graph, tree map and heat map. Heat map is a graphical representation of data where the individual values contained in a matrix are represented as colors. We propose a new type of heat map to visualize spatialtemporal skeleton joints as a series of color images. The key idea is to express a 5D space as a 2D coordinate space and a 3D color space. As shown in Fig. 1, each joint is firstly treated as a 5D point (x, y, z, f, n), where (x, y, z) mean the coordinates, f means the time label and n means the joint label. Function Γ is defined to permute elements of the point: (j, k, r, g, b) = Γ (ˆ x, yˆ, zˆ, f, n), c , (1) where c indicates that function Γ returns the c-th type of ranking. We use j and k as local coordinates and use r, g, b as the color values of location (j, k). To this end, r, g, b are normalized to [0, 255]. Using the c-th type of ranking, three gray
ReLU LRN max pooling
1
fc
ReLU LRN max pooling
fc fc
softmax
ReLU max pooling ReLU
0.5
x
y
f
5
visualization
x
0
1
-0.5 4
0.5 0 3.5
-0.5 3
-1
y
z (b)
(a)
ReLU
z
n (c)
3 3
5
11
3
3
3
3
27
13
13 13
27
54
13 256
256
54
ReLU dropout 0.5
ReLU dropout 0.5
11
drinking
256
eating
64
224
4096
224
# of action classes
4096
reading fusion
clapping
3
2
x
3
x
4
x
5
x
A skeleton sequence:
y
ReLU LRN max pooling
visualization
1
fc
ReLU LRN max pooling
fc
5
3 3
5
54
6
7
8
y
y
f
n
f
n
f (d)
4096
4096
# of action classes
Fig. 3. Proposed skeleton-based action recognition using multi-stream CNN
f
n
256
3
10
z
256
13
256
224
z
9
z
13
64
224
z
hugging
13
13 27
54
3
3
27
ReLU dropout 0.5
ReLU dropout 0.5
3
3
softmax punching
ReLU
11
11
y
fc
ReLU max pooling ReLU
n
Fig. 2. Illustration of color images generated by different data visualization methods. (a) shows skeletons of an action “throw”. (b), (c) and (d) respectively shows color images generated by [14], [13] and our method.
ly reflects the temporal evolution of skeleton joints. In (d), those sub-figures highlighted by red bounding boxes provide distinct spatial and temporal distributions, which have never been explored by previous works, e.g. [14] and [13]. 4. MULTI-STREAM CNN FUSION
images
G B IR c , Ic and Ic are constructed as: R B Ic (j, k) IG c (j, k) Ic (j, k) =
r g b ,
(2)
R where IR c (j, k) stands for the pixel value of Ic on location (j, k). Thus, the c-th color image is formulated as: G B (3) Ic = I R c Ic Ic .
Operating function Γ on the point (x, y, z, f, n) can generate 5 × 4 × 3 × 2 × 1 = 120 types of ranking. Each type of ranking corresponds to a color image. However, generating so many images needs huge time and computation cost. Moreover, these images may contain redundant information. For example, two images share the same color space (z, f, n) while their coordinate spaces are respectively denoted as (x, y) and (y, x). We observe that one image can be transformed to the other by rotating 90 degrees. In other words, both images encode the same spatial-temporal cues of skeleton joints. For another example, two images share the same coordinate space (x, y) while their color spaces are respectively denoted as (z, f, n) and (z, n, f ). We observe that both images are the same in shapes and slight different in colors, indicating that most of the spatial-temporal cues which they encoded are the same. Generally, we argue that permutation in the coordinate space or the color space will generate similar images. Therefore, this paper uses ten types of ranking shown in Fig. 1. These ranking results ensure that each element of the point (x, y, z, f, n) can be assigned to the coordinate space and the color space. Fig. 2 (d) shows the ten color images extracted from an action “throw”, where both spatial and temporal information of skeleton joints are encoded in these images. Fig. 2 also compares our method with [14] and [13], which can be considered as two specific cases of our visualization method. As can be seen, images in (b) are similar to sub-figure #1, #2, #5 in (d). These images mainly reflect the spatial distribution of skeleton joints. The image in (c) is similar to sub-figure #10 in (d). This image main-
To obtain more discriminative feature from spatio-temporal skeleton joints, we propose a multiple CNN-based model to extract deep features from color images generated in previous section. Inspired by two-stream deep networks [18], the proposed model (shown in Fig. 3) involves 10 modified AlexNet [16], where each CNN uses one type of color images as input. The posterior probabilities generated from each CNN are fused as the final class score. For an input sequence I m , we obtain a series of color 10 images: {Im c }c=1 . Each image is normalized to 224 × 224 pixels to take advantage of pre-trained models. Mean removal is adopted for all input images to improve the convergence speed. Then, each color image is processed by a CNN. For the image Im c , the output Υc of the last fully-connected (f c) layer is normalized by the softmax function to obtain the posterior probability: l eΥc prob(l | Im ) = , (4) L c Υk c k=1 e which indicates the probability of image Im c belonging to the l-th action class. L is the number of total action classes. The objective function of our model is to minimize the maximum-likelihood loss function: M L L(Ic ) = − ln δ(l − s) prob(l | Im (5) c ), m=1
l=1
where function δ equals one if l = s and equals zero otherwise, s is the real label of Im c , M is the batch size. For sequence I, its class score is formulated as: 10 1 prob(l | I) = prob(l | Ic ), (6) 10 c=1
where prob(l | I) is the average of the outputs from all ten CNN and prob(l | Ic ) is the probability of image Ic belonging to the l-th action class. To explore the complementary property of deep features generated from each CNN, we introduce
1
a weighted fusion method:
Original Samples: train Original Samples: test Synthesized Samples: train Synthesized Samples: test Synthesized+Pre-trained+Fine-tuning: train Synthesized+Pre-trained+Fine-tuning: test
0.8
0.7
error rate
10 1 prob(l | I) = ηc prob(l | Ic ), (7) 10 c=1 where ηc equals to one or zero, indicating whether the c-th CNN is selected or not. Therefore, prob(l | I) is the fused class score based on the selected CNNs. The method of choosing parameter ηc is discussed in Section 5.6. In following, these two types of fusion strategies are respectively named as average fusion and weighted fusion methods.
0.9
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
80
100
120
140
160
180
200
training epoch
5. EXPERIMENTS The proposed method is evaluated on three public benchmark datasets: MSRC-12 dataset [22], UTKinect-Action dataset [23] and NTU RGB+D dataset [24]. The MSRC12, UTKinect-Action and NTU RGB+D datasets have 20, 20 and 25 joint labels, respectively. These datasets contain noisy skeletons, speed variations and similar actions. 5.1. Datasets and Protocols The MSRC-12 dataset [22] contains 594 sequences, including 719359 frames (nearly 6 hour 40 minutes). It is collected by 30 subjects performing 12 actions. Following [14], we use sequences performed by odd subjects for training and even subjects for testing. The UTKinect-Action dataset [23] contains ten actions performed two times by ten subjects, generating 200 skeleton sequences. Following [25], we use crosssubject validation with subjects 1,2,3,4,5 for training and subjects 6,7,8,9,10 for testing. The NTU RGB+D dataset [24] is the largest dataset for skeleton-based action recognition. It contains 60 actions performed by 40 subjects from different views, generating 56,880 skeleton sequences. Following the cross-subject protocol in [24], we split the dataset into 40, 320 training samples and 16, 560 testing samples. 5.2. Implementation Details In our model, each ConvNets contains five convolutional layers and three full-connected layers. The first and second fullyconnected layer contains 4096 neurons, and the number of neurons in the third one is equal to that of actions. Filter sizes are set to 11 × 11, 5 × 5, 3 × 3, 3 × 3, 3 × 3. Max pooling and ReLU neuron are adopted and the dropout regularisation ratio is set to 0.5. The network weights are learned using the minibatch stochastic gradient descent with the momentum value set to 0.9 and weight decay set to 0.00005. Learning rate is set to 0.001 and maximum training cycle is set to 200. In each cycle, a mini-batch of 50 samples is constructed by randomly sampling 50 images from training set. The implementation is derived from the MatConvNet toolbox [26] based on one NVIDIA GeForce GTX 1080 card. When we use Original Samples or Synthesized Samples for training, all layers are randomly initialized from [0 0.01]. When we use Synthesized + Pre-trained + Fine-tuning for training, the third
Fig. 4. Convergence curves on MSRC-12 dataset. The first type of color image is used as inputs for ConvNets. Error rate almost converges when the training epoch equals to 200. Table 1. Experimental results on MSRC-12 dataset (crosssubject protocol [14]) Method ConvNets [13] ELC-KSVD [27] Cov3DJ [28] JTM [14]
Accuracy 84.46% 90.22% 91.70% 93.12%
Original Samples Synthesized Samples Synthesized + Pre-trained + Fine-tuning
91.22% 92.57% 94.59%
fully-connected layer is initialized from [0 0.01] and other layers are initialized by pre-trained models on ILSVRC-2012. Synthesized samples are generated by mirror operation. 5.3. MSRC-12 Dataset Table 1 compares performances of various methods on MSRC-12 dataset. ConvNets [13] and JTM [14] are most related to our method. In [13], a skeleton sequence is arranged as a color image, which mainly captures the temporal evolutions of skeleton joints. In [14], the frame-to-frame motions of skeleton joints are accumulated as three joint trajectory maps (JTM), which can efficiently encode the spatial distributions of skeleton joints. By extracting deep features, ConvNets achieves 84.46% and JTM achieves 93.12% on this dataset. Our method with Original Samples for training achieves 91.22%. Synthesized Samples achieves slightly higher accuracy than Original Samples, since more samples are synthesized for training. Similar to JTM, we use pre-trained model and fine-tuning method, which outperforms ConvNets and JTM by 10.13% and 1.47%, respectively. The improvements verify that our method can encode more abundant spatial-temporal cues of skeleton joints. Fig. 4 shows the convergence curves on MSRC-12 dataset. 5.4. UTKinect-Action Dataset Table 2 compares performances of various methods on UTKinect-Action dataset. Among previous skeleton-based
Table 3. Experimental results on NTU RGB+D dataset (cross-subject protocol [24]) Method HON4D [32] SNV [33] Skeletal Quads [34] LARP [31] HBRNN-L [12] Dynamic Skeletons [35] Deep LSTM [24] Part-aware LSTM [24] ST-LSTM [17]
Accuracy 30.56% 31.82% 38.62% 50.08% 59.07% 60.20% 60.69% 62.93% 69.20%
Original Samples Synthesized Samples Synthesized + Pre-trained + Fine-tuning
71.47% 73.89% 78.56%
methods, LARP [31] achieves highest accuracy of 97.08%. Our method with Synthesized + Pre-trained + Fine-tuning for training achieves highest accuracy of 98.99%, which outperforms LARP by 1.91% and outperforms JTM by 8.08%. Traditionally, the LSTM is special designed to model temporal information, while the ConvNets can barely encode temporal evolutions. Our method outperforms the most recent LSTM-based method, i.e., ST-LSTM [17]. This result shows that our method enables traditional ConvNets to efficiently capture temporal information. 5.5. NTU RGB+D Dataset Table 3 shows the performances of various methods on NTU RGB+D dataset. The depth-based methods, i.e. HON4D [32] and SNV [33] works poorly, due to the various view point changes. Since this dataset provides rich samples for training deep models, the RNN-based methods, i.e. Deep RNN [24] and Deep LSTM [24], achieve high accuracies. Part-aware LSTM [24] achieves higher accuracy than Deep LSTM, since the Part-aware LSTM captures structure of human bodies. By extending LSTM to the spatial-temporal domain, ST-LSTM [17] achieves 69.20%, the highest accuracy among previous works. Our method with Synthesized + Pretrained + Fine-tuning for training achieves highest accuracy of 78.56%, which outperforms ST-LSTM by 9.36%. This improvement indicates the discriminative power of our method to jointly encode both spatial and temporal cues.
MSRC-12
0.9899
0.9596
0.8586
0.8182
0.7475
0.8990
Type 4 Type 8 Weighted Fusion
0.8384
0.7879
0.7778
0.6970
0.8990 0.7071
0.9459
Type 3 Type 7 Average Fusion
0.9189
0.8311 0.6047
0.7432
0.8007
Type 2 Type 6 Type 10
0.7601
0.7466
94.95% 95.96% 98.99%
0.6419
Original Samples Synthesized Samples Synthesized + Pre-trained + Fine-tuning
0.8630
Accuracy 85.80% 90.91% 95.00% 97.08%
0.8074
Method DCSF [30] JTM [20] ST-LSTM [17] LARP [31]
Type 1 Type 5 Type 9
0.6014
Table 2. Experimental results on UTKinect-Action dataset (cross-subject protocol [29])
UTKinect-Action
Fig. 5. Decision level fusion of probabilities from ten channels. The red bar stands for the Proposed Method, and other bars stand for the results using single channel data. 5.6. Evaluation of Decision Level Fusion Fig. 5 shows the effect of fusion methods, where the yellow and red bars respectively stand for average fusion and weighted fusion, and other bars stand for the results using single type of color images. The average fusion outperforms all individual types on both MSRC-12 and UTKinect-Action datasets. This result indicates that data from different types show complementary property to each other. It is interesting to find that the significance of each type varies from different datasets. For example, the 6-th type outperforms the 7-th channel on MSRC-12 dataset, while the result is opposite on UTKinectAction dataset. This observation motivates us to apply the weighted fusion method, which select proper type of features for fusion. In practice, a five-fold validation method is applied on training samples to learn values of ηc , using which best accuracy is achieved. In the bottom of Figure 5, the selected features are labeled with “tick” and the discarded features are labeled with “cross”. With these selected features, the weighted fusion outperforms the average fusion on both datasets, verifying the effect of feature selection. 5.7. Evaluation of Computation Time On the MSRC-12 dataset, the computation time of the Proposed Method using first type of color image is tested. The average computational time required for extracting a color image is 0.04639 seconds on a 2.5GHz machine with 8GB RAM, using Matlab R2014a. We train our ConvNets model based on one NVIDIA GeForce GTX 1080 card. The total training time on the MSRC-12 dataset is 2823.98 seconds. The speed of our method can be increased when implementing it with C++ language and multiple GPUs. 6. CONCLUSIONS This paper presents a new data visualization method to describe skeleton sequences as a series of color images, which are compact and distinct to encode both spatial and temporal cues of skeleton joints. Further, we modify the AlexNet and propose a multi-stream ConvNets-based model to extract deep features from color images. Finally, two types of de-
cision level fusion methods are used to fuse deep features and generate class score. Experimental results on MSRC-12, UTKinect-Action and NTU RGB+D datasets show that our method outperforms state-of-the-art methods for skeletonbased recognition. On the largest NTU RGB+D dataset, our method with ConvNets achieves 9.36% higher than the most recent ST-LSTM method, verifying the efficiency of our method to encode spatial-temporal skeleton joints. The high performance on the challenge NTU RGB+D dataset also indicates that our method can properly tackle with noisy skeletons. Future work focus on the application of our method to intelligent surveillance for the elders. 7. REFERENCES [1] Hong Liu, Mengyuan Liu, and Qianru Sun, “Learning directional cooccurrence for human action classification,” in ICASSP, 2014, pp. 1235–1239.
[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105. [17] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang, “Spatio-temporal LSTM with trust gates for 3D human action recognition,” in ECCV, 2016, pp. 816–833. [18] Karen Simonyan and Andrew Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014, pp. 568–576. [19] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici, “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015, pp. 4694–4702. [20] Pichao Wang, Wanqing Li, Zhimin Gao, Chang Tang, Jing Zhang, and Philip Ogunbona, “Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring,” in ACM MM, 2015, pp. 1119–1122. [21] Will J Schroeder, Bill Lorensen, and Ken Martin, The visualization toolkit, 2004.
[2] Mengyuan Liu, Hong Liu, and Qianru Sun, “Action classification by exploring directional co-occurrence of weighted STIPs,” in ICIP, 2014, pp. 1460–1464.
[22] Simon Fothergill, Helena Mentis, Pushmeet Kohli, and Sebastian Nowozin, “Instructing people for training gestural interactive systems,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2012, pp. 1737–1746.
[3] Meng Ding and Guoliang Fan, “Multilayer joint gait-pose manifolds for human gait motion modeling,” IEEE Trans. Cybern., vol. 45, no. 11, pp. 2413–2424, 2015.
[23] Lu Xia, Chia-Chih Chen, and JK Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in CVPRW, 2012, pp. 20–27.
[4] Yong Xu, Jixiang Dong, Bob Zhang, and Daoyun Xu, “Background modeling methods in video analysis: A review and comparative evaluation,” CAAI Transactions on Intelligence Technology, vol. 1, no. 1, pp. 43–60, 2016.
[24] Amir Shahroudy, Jun Liu, Tian Tsong Ng, and Gang Wang, “NTU RGB+D: A large scale dataset for 3d human activity analysis,” in CVPR, 2016, pp. 1010–1019.
[5] Zhengyou Zhang, “Microsoft kinect sensor and its effect,” IEEE Multimedia, vol. 19, no. 2, pp. 4–10, 2012. [6] Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz, “A survey of depth and inertial sensor fusion for human action recognition,” Multimed. Tools and Appl., pp. 1–21, 2015. [7] Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz, “Improving human action recognition using fusion of depth camera and inertial sensors,” IEEE Trans. Human-Mach. Syst., vol. 45, no. 1, pp. 51–61, 2015. [8] Mengyuan Liu and Hong Liu, “Depth Context: A new descriptor for human activity recognition by using sole depth sequences,” Neurocomputing, vol. 175, pp. 747–758, 2016. [9] Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz, “A real-time human action recognition system using depth and inertial sensor fusion,” IEEE Sens. J., vol. 16, no. 3, pp. 773–781, 2016. [10] Mengyuan Liu, Hong Liu, and Chen Chen, “3D action recognition using multi-scale energy-based global ternary image,” IEEE Trans. Circuits Syst. Video Technol., 10.1109/TCSVT.2017.2655521, 2017. [11] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew Fitzgibbon, Mark Finocchio, Andrew Blake, Mat Cook, and Richard Moore, “Real-time human pose recognition in parts from single depth images,” Commun. ACM, vol. 56, no. 1, pp. 116–124, 2013. [12] Yong Du, Wei Wang, and Liang Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in CVPR, 2015, pp. 1110–1118. [13] Yong Du, Yun Fu, and Liang Wang, “Skeleton based action recognition with convolutional neural network,” in ACPR, 2015, pp. 579–583. [14] Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li, “Action recognition based on joint trajectory maps using convolutional neural networks,” in ACMMM, 2016, pp. 102–106. [15] Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould, “Dynamic image networks for action recognition,” in CVPR, 2016, pp. 3034–3042.
[25] Yu Zhu, Wenbin Chen, and Guodong Guo, “Evaluating spatiotemporal interest point features for depth-based action recognition,” Image Vision Comput., vol. 32, no. 8, pp. 453–464, 2014. [26] Andrea Vedaldi and Karel Lenc, “Matconvnet: Convolutional neural networks for matlab,” in ACM MM, 2015, pp. 689–692. [27] Lijuan Zhou, Wanqing Li, Yuyao Zhang, Philip Ogunbona, Duc Thanh Nguyen, and Hanling Zhang, “Discriminative key pose extraction using extended lc-ksvd for action recognition,” in DICTA, 2014, pp. 1–8. [28] Mohamed E Hussein, Marwan Torki, Mohammad A Gowayyed, and Motaz El-Saban, “Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations,” in IJCAI, 2013, pp. 2466–2472. [29] Yu Zhu, Wenbin Chen, and Guodong Guo, “Fusing spatiotemporal features and joints for 3d action recognition,” in CVPRW, 2013, pp. 486–491. [30] Lu Xia and JK Aggarwal, “Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera,” in CVPR, 2013, pp. 2834–2841. [31] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in CVPR, 2014, pp. 588–595. [32] Omar Oreifej and Zicheng Liu, “Hon4D: Histogram of oriented 4D normals for activity recognition from depth sequences,” in CVPR, 2013, pp. 716–723. [33] Xiaodong Yang and Ying Li Tian, “Super Normal Vector for human activity recognition with depth cameras,” IEEE Trans. Pattern Anal. Mach. Intell., 10.1109/TPAMI.2016.2565479, 2016. [34] G. Evangelidis, G. Singh, and R. Horaud, “Skeletal quads: Human action recognition using joint quadruples,” in ICPR, 2014, pp. 4513– 4518. [35] Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang, “Jointly learning heterogeneous features for RGB-D activity recognition,” in CVPR, 2015, pp. 5344–5352.