3D Action Recognition Using Multi-temporal Skeleton Visualization

7 downloads 10920 Views 189KB Size Report
pose a skeleton visualization method, which efficiently en- ... Index Terms— 3D action recognition, skeleton data, vi- ... With the depth data from Kinect sensor,.
3D ACTION RECOGNITION USING MULTI-TEMPORAL SKELETON VISUALIZATION Mengyuan Liu1 , Chen Chen2 , Fanyang Meng1,3 and Hong Liu1∗ 1

Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, China 2 Center for Research in Computer Vision, University of Central Florida, USA 3 Shenzhen Institute of Information Technology, China

[email protected] [email protected] [email protected] [email protected]

ABSTRACT Action recognition using depth sequences plays important role in many fields, e.g., intelligent surveillance, contentbased video retrieval. Real applications require robust and accurate action recognition method. In this paper, we propose a skeleton visualization method, which efficiently encodes the spatial-temporal information of skeleton joints into a set of color images. These images are served as inputs for convolutional neural networks to extract more discriminative deep features. To enhance the ability of deep features to capture global relationships, we extend the color images into multi-temporal version. Additionally, to solve the effect of view point changes, a spatial transform method is adopted as a preprocessing step. Extensive experiments on NTU RGB+D dataset and ICME2017 challenge show that our method can accurately distinguish similar actions and shows robustness to view variations. Index Terms— 3D action recognition, skeleton data, visualization, convolutional neural networks 1. INTRODUCTION Human action recognition acts as a core task for many applications, including human-computer interaction and intelligent surveillance [1]. Estimating human poses is an intuitive attempt to analyse human actions. However, it is a tough task to estimate human poses from traditional color images, due to semantic ambiguities induced by cluttered backgrounds and loss of depth data. Recently, the RGB-D sensors, especially the Kinect sensor opens up opportunities in addressing above problems [2, 3, 4]. With the depth data from Kinect sensor, Shotton et al. developed a real-time and accurate skeleton estimation method [5], which facilities the skeleton-based action recognition task. Skeleton-based recognition methods This work is supported by National High Level Talent Special Support Program, National Natural Science Foundation of China (No. 61340046, 61672079, 61673030, U1613209), Specialized Research Fund for the Doctoral Program of Higher Education (No. 20130001110011), Natural Science Foundation of Guangdong Province (No. 2015A030311034). Hong Liu∗ is the Corresponding author.

c 978-1-5090-6067-2/17/$31.00 2017 IEEE

can be roughly divided into three categories, i.e., hand-crafted methods, RNN-based methods and CNN-based methods. Hand-crafted methods design hand-crafted features to represent spatial-temporal skeleton joints and use time series models to model the global temporal evolution [6, 7, 8]. Yang et al. [6] used joint differences to combine static postures and overall dynamics of joints. To reduce redundancy and noise, they obtained EigenJoints representation by applying Principal Component Analysis (PCA) to the joint differences. Beyond using joint locations or the joint angles to represent a human skeleton, Vemulapalli et al. [7] modeled the 3D geometric relationships between various skeleton joints using rotations and translations in 3D space. Ofli et al. [8] represented spatio-temporal skeleton joints as sequence of the most informative joints (SMIJ). At each time instant, the most informative skeletal joints which show highly relation to the current action are selected to denote the current skeleton. The dynamic motion cues among skeleton joints are modeled by linear dynamical system parameters (LDSP). However, handcrafted features can barely effectively model complex spatialtemporal distributions, since these features are usually shallow and dataset-dependent. To model complex temporal evolutions of actions, recurrent neural networks (RNN) models and Long-Short Term Memory (LSTM) neurons have been used to model temporal evolutions of skeleton sequences [9, 10, 11]. Du et al. [9] proposed an end-to-end hierarchical RNN to encode the relative motion between skeleton joints. In terms of body structure, the skeleton joints are divided into five main parts, which are fed into five independent subnets to extract local features. Since LSTM is able to learn representations from long input sequences using special gating schemes, many works chose LSTM to learn complex dynamics of actions. To extract the derivatives of internal state (DoS), Veeriah et al. [10] proposed a differential RNN by adding a new gating mechanism to the original LSTM. By collecting a large scale dataset, Shahroudy et al. [11] showed that LSTM outperforms RNN and some handcrafted features. To learn the common temporal patterns of partial joints independently, they proposed a part-aware LSTM which has part-based memory sub-cells

and a new gating mechanism. However, RNN-based methods trend to overstress the temporal information. Observing the success of convolutional neural networks (CNN) in the field of image recognition, many works have been developed to encode video sequences as images, which are further explored by CNN [12, 13]. Wang et al. [12] accumulated motions between projected depth maps as depth motion maps (DMM), which are served as inputs for CNN. Bilen et al. [13] proposed a dynamic image representation, which is a single RGB image generated by applying approximate rank pooling operator on raw image pixels of a sequence. Generally, these methods apply operators, e.g. accumulation, rank pooling, on raw pixels of a sequence to convert a sequence to an image. Despite the efficiency, these operators roughly compress original data, leading to the loss of information. To solve above problem, this paper presents a multitemporal skeleton visualization method, which encodes the spatial-temporal skeleton joints as multi-temporal color images. Our main contributions are as follows: • A spatial transform is developed to effectively cope with view variations. This method eliminates the effect of view variations. • An skeleton visualization method is proposed to represent a skeleton sequence as a series of color images, which implicitly describe spatial-temporal skeleton joints in a compact yet distinctive manner. • Color images are converted into multi-temporal versions, which facilities CNN model to capture multiscale information. The remainder of this paper is as follows. Section 2 presents the spatial transform method. Section 3 provides the visualization method. Section 4 describes the multi-temporal version of color images and multi-stream CNN model. Section 5 reports the experimental results and discussions. Section 6 concludes the paper. 2. SPATIAL TRANSFORM Given a skeleton sequence I with F frames, the n-th skeleton joint on the f -th frame is formulated as pfn = (xfn , ynf , znf )T , where f ∈ (1, ..., F ), n ∈ (1, ..., N ), N denotes the total number of skeleton joints in each skeleton. The value of N is determined by some skeleton estimation algorithms. We use the joint configuration in the NTU RGB+D dataset [11], where N equals to 25. The ”hip center” (1), ”hip right” (17) and ”hip left” (13) are used for building spatial transform matrix, which contains three steps. First, the average location of ”hip center” is transformed to the origin of the new coordinate system. Second, the average direction from ”hip left” to ”hip right” is defined as v, which is rotated to the X direction of the new coordinate system. Third, the Z direction keeps unchanged, and the Y direction of the new coordinate system is established by X and Z directions.

0.6

z

0.4

0.8

0.2

0.6

0

0.4

-0.2

0.2

-0.4

0

z

-0.6

-0.2

-0.8

-0.4

-1

-0.6

-1.2

-0.8

-1.4 3.2

x

3

2.8

-0.2

-1 0.2

0.2

0

x

y

hip left hip right v

hip center

0.5 0 -0.2

(a) Before

0 -0.5

y

(b) After

Fig. 1. Spatial transform Specifically, each joint pfn contains five components, i.e. three local coordinates x, y, z, time label f and joint label n. Since three coordinates x, y, z are sensitive to view variations, we transform them to view invariant values x ˆ, yˆ, zˆ by:       β γ T [ˆ x, yˆ, zˆ, 1]T = P Rα x , 0 P Ry , 0 P Rz , d [x, y, z, 1] , (1) where the transform matrix P is defined as:   R d P(R, d) = , (2) 0 1 4×4 where R ∈ R3×3 is a rotation matrix, d ∈ R3 is a translation vector given as: F 1  f d=− p1 , (3) F f =1

which moves the original origin to the “hip center”. The average direction from ”hip left” to ”hip right” is defined as: v=

F 1  f (p17 − pf13 ), F

(4)

f =1

The angles between v and [1, 0, 0] is established by:    β   γ  T [1, 0, 0, 1]T = P Rα x , 0 P Ry , 0 P Rz , 0 [v, 1] ,

(5)

where α, β and γ are calculated to implement Formula 1. Figure shows the transformed skeleton, where translation and rotations are removed to some extent. 3. SKELETON VISUALIZATION We propose a new type of heat map to visualize spatiotemporal skeleton joints as a series of color images. The key idea is to express a 5D space as a 2D coordinate space and a 3D color space. Each joint is firstly treated as a 5D point (x, y, z, f, n), where (x, y, z) mean the coordinates, f means the time label and n means the joint label. Function Γ is defined to permute elements of the point:   (j, k, r, g, b) = Γ (x, y, z, f, n), c , (6) where c indicates that function Γ returns the c-th type of ranking. As shown in Table 1, we adopt 10 types of ranking, where

Fusion

Table 1. Ten types of ranking method Element j k r g b

1 x y z f n

2 x z y f n

3 x f y z n

4 x n y z f

1

5 y z x f n

6 y f x z n

7 y n x z f

8 z f x y n

9 z n x y f

10 f n x y z

Softmax

Softmax

Softmax

Layer L

Layer L

Layer L

Layer 2

Layer 2

Layer 2

Layer 1

Layer 1

Layer 1

(b) Level 1

(c) Level 2

0.8 1

0.5

0.6

0.8

0.4

0.6

0.5

0.2

0

-0.2

0.4

0

0

0.2 0

-0.4 -0.5

-0.6

-0.5

-0.4

-0.8 -1 0.5

-0.5

-0.2

0

0.5

0

0.2

-0.6

-1 0.5

-1 0.5 0

0 -0.5

-0.8 0 -0.5

-0.5

Time

-0.2

-0.2

0

0.2

-1 0.2

0.5 0 -0.2

0 -0.5

(a) Level 0

(a) A skeleton sequence 1

2

3

4

5

6

7

8

9

10

Fig. 3. Multi-stream CNN using multi-temporal images

(a) Various view variations

(b) Color images generated by skeleton visualization

(b) noisy skeletons

Fig. 2. Skeleton visualization

Fig. 4. Samples from NTU RGB+D dataset

the corresponding color images are shown in Fig. 2. These images can capture both spatial distribution and temporal evolution of skeleton joints easily.

“Level 2” means the subsequence from the F/2-th frame to the end. Color images extracted from different levels have specific local patterns, indicating multi-scale information.

4. MULTI-TEMPORAL COLOR IMAGES

5. EXPERIMENTS

We use multi-stream CNN model to extract deep features from color images generated in previous section. Inspired by two-stream deep networks [14], the proposed model (shown in Fig. 3) involves 10 modified AlexNet [15], where each CNN uses one type of color images as input. The posterior probabilities generated from each CNN are fused as the final class score. Specifically, each CNN contains five convolutional layers and three f c layers. The first and second f c layers contain 4096 neurons, and the number of neurons in the third one is equal to the total number of action classes. Filter sizes are set to 11 × 11, 5 × 5, 3 × 3, 3 × 3, 3 × 3. Local Response Normalisation (LRN), max pooling and ReLU neuron are adopted and the dropout regularisation ratio is set to 0.5. Despite of the discriminative power of deep features extracted from color images, the CNN model is limited to capture large scale of temporal information from images. To solve this problem, we may implement CNN model with different scale of convolutional filters. However, this way makes the implementation of CNN model more complicated. This paper converts original skeleton sequence into multi-temporal sequences, and then respectively extract deep features from these sequences. As shown in Fig. 3, multi-temporal color images are fused by CNN model. Given a sequence with F frames, “Level 0” means original sequence; “Level 1” means the subsequence from beginning to the F/2-th frame;

5.1. Datasets The NTU RGB+D dataset [11] contains 60 actions performed by 40 subjects from various views (Fig. 4 (a)), generating 56880 skeleton sequences. This dataset also contains noisy skeleton joints (see Fig. 4 (b)), which bring extra challenge for recognition. Following the cross subject protocol in [11], we split the 40 subjects into training and testing groups. Each group contains samples captured from different views performed by 20 subjects. For this evaluation, the training and testing sets have 40320 and 16560 samples, respectively. Following the ICME2017 Large Scale 3D Human Activity Recognition Challenge, we use samples in NTU RGB+D dataset for training and use samples provided in http://rose1.ntu.edu.sg/Datasets/login.asp?DS=3 for testing. 5.2. Implementation and results The network weights are learned using the mini-batch stochastic gradient descent with the momentum value set to 0.9 and weight decay set to 0.00005. Learning rate is set to 0.001 and the maximum training cycle is set to 200. In each cycle, a mini-batch of 50 samples is constructed by randomly sampling 50 images from training set. The implementation is based on MatConvNet [20] with one NVIDIA GeForce GTX 1080 card and 8G RAM. We take advantage of pre-trained

Table 2. Performances on the NTU RGB+D dataset Proposed Method Level 0 Level 1 Level 2 Level 0+1+2

Accuracy 79.89% 66.20% 72.17% 81.04%

Table 4. Comparisons on the NTU RGB+D dataset Related Methods ST-LSTM + TG [16] STA-LSTM [17] Skeleton Visualization [18] Enhanced Skeleton Visualization [19] Multi-temporal Skeleton Visualization

Accuracy 69.20% 73.40% 78.56% 80.03% 81.04%

Table 3. Results on the ICME2017 challenge Proposed Method Level 0 Level 0+1+2

Accuracy 77.38% 81.23%

models on large scale image datasets such as ImageNet, and fine tune our model. Specifically, we fine tune our model by initializing the third f c layer from [0, 0.01] and initializing other layers from pre-trained model on ILSVRC-2012 (Large Scale Visual Recognition Challenge 2012). Table 2 and Table 3 show that our method using multiple levels achieves the best performances on the NTU RGB+D dataset and the ICME2017 challenge. Three levels are used here, taking both performance and time cost into consideration. Table 4 compares our method with related works, where our method achieves 7.64% higher than STA-LSTM [17] verifying the effectiveness of spatial transform and skeleton visualization methods. Our method also achieves 2.48% and 1.01% higher than original and enhanced skeleton visualization-based methods [18, 19], which shows that multi-temporal color images can properly capture multiscale temporal information. 6. CONCLUSION This paper presents a distinctive and robust method to encode the spatial-temporal information of skeleton sequences. Specifically, a multi-temporal skeleton visualization method is used to represent skeleton sequences as compact color images, which are further explored by multi-stream CNN model to extract more distinctive deep features. To solve the problem of viewpoint variations, a spatial transform is introduced, which increases the robustness of our method. Experimental results on NTU RGB+D and ICME2017 challenge show that our method work well against view point changes and noisy data, and that the multi-temporal version do increase the ability of CNN model to capture multi-scale information. 7. REFERENCES [1] Hong Liu, Mengyuan Liu, and Qianru Sun, “Learning directional cooccurrence for human action classification,” in ICASSP, 2014, pp. 1235–1239.

[4] Mengyuan Liu and Hong Liu, “Depth Context: A new descriptor for human activity recognition by using sole depth sequences,” Neurocomputing, vol. 175, pp. 747–758, 2015. [5] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew Fitzgibbon, Mark Finocchio, Andrew Blake, Mat Cook, and Richard Moore, “Real-time human pose recognition in parts from single depth images,” Commun. ACM, vol. 56, no. 1, pp. 116–124, 2013. [6] Xiaodong Yang and YingLi Tian, “Effective 3d action recognition using eigenjoints,” Journal of Visual Communication and Image Representation, vol. 25, no. 1, pp. 2–11, 2014. [7] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in CVPR, 2014, pp. 588–595. [8] Ferda Ofli, Rizwan Chaudhry, Gregorij Kurillo, Ren´e Vidal, and Ruzena Bajcsy, “Sequence of the most informative joints (smij): A new representation for human skeletal action recognition,” Journal of Visual Communication and Image Representation, vol. 25, no. 1, pp. 24–38, 2014. [9] Yong Du, Wei Wang, and Liang Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in CVPR, 2015, pp. 1110–1118. [10] Vivek Veeriah, Naifan Zhuang, and Guo Jun Qi, “Differential recurrent neural networks for action recognition,” in ICCV, 2015, pp. 4041–4049. [11] Amir Shahroudy, Jun Liu, Tian Tsong Ng, and Gang Wang, “NTU RGB+D: A large scale dataset for 3d human activity analysis,” in CVPR, 2016, pp. 1010–1019. [12] Pichao Wang, Wanqing Li, Zhimin Gao, Chang Tang, Jing Zhang, and Philip Ogunbona, “Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring,” in ACM MM, 2015, pp. 1119–1122. [13] Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould, “Dynamic image networks for action recognition,” in CVPR, 2016, pp. 3034–3042. [14] Karen Simonyan and Andrew Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014, pp. 568–576. [15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105. [16] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang, “Spatio-temporal LSTM with trust gates for 3D human action recognition,” in ECCV, 2016, pp. 816–833. [17] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu, “An end-to-end spatio-temporal attention model for human action recognition from skeleton data,” in AAAI, 2016. [18] Mengyuan Liu, Hong Liu, and Chen Chen, “3D action recognition using data visualization and convolutional neural networks,” in ICME, 2017.

[2] Zhengyou Zhang, “Microsoft kinect sensor and its effect,” IEEE Multimedia, vol. 19, no. 2, pp. 4–10, 2012.

[19] Mengyuan Liu, Hong Liu, and Chen Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognit., 10.1016/j.patcog.2017.02.030, 2017.

[3] Mengyuan Liu, Hong Liu, and Chen Chen, “3D action recognition using multi-scale energy-based global ternary image,” IEEE Trans. Circuits Syst. Video Technol., 10.1109/TCSVT.2017.2655521, 2017.

[20] Andrea Vedaldi and Karel Lenc, “Matconvnet: Convolutional neural networks for matlab,” in ACM MM, 2015, pp. 689–692.