Implementation of Human Action Recognition. System Using Multiple Kinect Sensors. Beom Kwon, Doyoung Kim, Junghwan Kim, Inwoong Lee, Jongyoo Kim,.
Implementation of Human Action Recognition System Using Multiple Kinect Sensors Beom Kwon, Doyoung Kim, Junghwan Kim, Inwoong Lee, Jongyoo Kim, Heeseok Oh, Haksub Kim, and Sanghoon Lee(B) Department of Electrical and Electronic Engineering, Yonsei University, Seoul, Korea {hsm260,tnyffx,junghwan.kim,mayddb100,jongky,angdre5, khsphillip,slee}@yonsei.ac.kr http://insight.yonsei.ac.kr/
Abstract. Human action recognition is an important research topic that has many potential applications such as video surveillance, humancomputer interaction and virtual reality combat training. However, many researches of human action recognition have been performed in single camera system, and has low performance due to vulnerability to partial occlusion. In this paper, we propose a human action recognition system using multiple Kinect sensors to overcome the limitation of conventional single camera based human action recognition system. To test feasibility of the proposed system, we use the snapshot and temporal features which are extracted from three-dimensional (3D) skeleton data sequences, and apply the support vector machine (SVM) for classification of human action. The experiment results demonstrate the feasibility of the proposed system.
Keywords: Human action recognition port vector machine
1
· Multiple kinect sensors · Sup-
Introduction
Human action recognition is one of the actively researched topics in computer vision, because of its potential applications, including the areas of video surveillance, human-computer interaction, and virtual reality combat training. In the past few years, most studies on human action recognition assume that only a single camera is considered for recognizing human action [1–3]. However, it is impractical to apply the methods of [1–3] to human action recognition system because the above methods are vulnerable to partial occlusion due to fixed view point. In addition, the pose ambiguity problem in a two-dimensional (2D) image still remains. For this reason, the authors in [4,5] proposed a new multi-view human action recognition method by using depth information. The authors in [6] proposed a RGB/depth/skeleton information based multi-view human action recognition c Springer International Publishing Switzerland 2015 Y.-S. Ho et al. (Eds.): PCM 2015, Part I, LNCS 9314, pp. 334–343, 2015. DOI: 10.1007/978-3-319-24075-6 32
Implementation of Human Action Recognition System
335
method. However, it is still not suitable to apply these methods to practical human action recognition system due to its high complexity. In order to alleviate the complexity problem of the conventional human action recognition method, instead of using depth information, we use threedimensional (3D) skeleton data which are obtained from Kinect sensor in realtime. Many studies on multi-view human action recognition including [4–6] have been carried out by using Kinect sensor due to its convenience and cheap price. In addition, Kinect sensor provides real-time skeleton data of user without any usage of particular markers, which are attached to human body. However, Kinect sensor captures the user’s position and movements under the assumption that the user faces Kinect sensor. Therefore, if the user does not face Kinect sensor, it may provide inaccurate position values. In addition, the inaccurate position values may lead to poor performance of the multi-view skeleton integration. In order to obtain accurate skeleton data in a multi-view environment, the authors in [7] construct a silhouette of user by using depth images obtained from four Kinect sensors, and then they extract the skeletal representations of user from the silhouette. However, the high complexity of this method impedes its practical implementation. In [8], the integrated skeleton is obtained from four Kinect sensors by using an averaging method. In order to perform the integration, the authors in [8] select points for every joint, which satisfy the criterion of the average distance between them less than a given threshold value. Then, the point of the integrated skeleton is computed as the average of the selected points. However, it is impractical to apply this method to human action recognition system because of the low accuracy of the integrated skeleton. To improve accuracy, we in [9] proposed a weighted integration method. The proposed method enables skeleton to be obtained more accurately by assigning higher weights to skeletons captured by Kinect sensors in which the user faces forward [9]. In this paper, we propose a multi-view human action recognition system that utilizes 3D skeleton data of user to recognize user’s actions. In the proposed system, six Kinect sensors are utilized to capture the whole human body and skeleton data obtained from six Kinect sensors are merged by using the multiview integration method of [9]. In addition, snapshot features and temporal features are extracted from integrated 3D skeleton data, and then utilized as the input of the classifier. To classify the human actions, support vector machine (SVM) is employed as the classifier. The experimental results show that the proposed system achieves high accuracy, so it can be stated that the proposed system is feasible to recognize human actions in practical. The reminder of this paper is organized as follows. In Sect. 2, we present the proposed human action recognition system. In Sect. 3, the experimental results are demonstrated. Finally, we conclude this paper in Sect. 4.
2
Proposed Human Action Recognition System
In this section, the proposed human action recognition system is explained. Figure 1 shows the block diagram of the proposed human action recognition
336
B. Kwon et al.
Fig. 1. Block diagram of the proposed human action recognition system.
Fig. 2. 3D skeleton model derived from Kinect sensor.
system. In the proposed system, 3D skeleton model derived from Kinect sensor is utilized for the recognition of human action. As shown in Fig. 2, this model is composed of a set of 20 joints and includes spatial coordinates information of each joint. 2.1
Multi-view Skeleton Integration
Figure 3 (a) gives the schematic illustration of the omnidirectional six Kinects system. In order to capture the whole human body, six Kinects sensors are arranged in a ring with a radius of 3 m at 60◦ intervals as shown in Fig. 3 (b). In addition, each Kinect sensor is placed at a height of 1 m above the ground by using a tripod. Skeleton data captured with a Kinect sensor depends on the coordinate system of the Kinect sensor. Therefore, in order to integrate skeleton data obtained from six Kinect sensors, a camera matrix of each Kinect sensor must first be obtained through calibration. In addition, skeleton data obtained from six Kinect sensors must be transformed into a common coordinate system. In order to obtain the camera matrix of each Kinect sensor, the calibration method of [10] is employed. Let (Xi , Yi , Zi ) be a coordinate in the coordinate
Implementation of Human Action Recognition System
337
Kinect 1 Kinect 6
Kinect 2 60˚ 3m Treadmill
Kinect 5
Kinect 3 Kinect 4
(a)
(b)
Fig. 3. Schematic illustration (a) and top view (b) of the omnidirectional six Kinects system.
system of ith Kinect sensor and (Xc , Yc , Zc ) be a coordinate in the common coordinate system. The coordinate of each Kinect sensor is transformed into a common coordinate system as follows: ⎡ ⎤ ⎡ ⎤ Xc Xi ⎣ Yc ⎦ = R ⎣ Yi ⎦ + T, (1) Zc Zi where R is the rotation matrix and T is the translation matrix. Kinect sensor captures the user’s position and movements under the assumption that the user faces Kinect sensor. Therefore, Kinect sensor may provide inaccurate position values when the user does not face Kinect sensor. In the omnidirectional six Kinects system, since it is impossible that the user faces all Kinect sensors, the inaccurate position values may lead to poor performance of multi-view skeleton integration. In order to improve performance of multi-view skeleton integration, the weighted integration method of [9] is employed. 2.2
Snapshot Feature Extraction
In the snapshot feature extraction step, various features (such as joint velocities, angles and angular velocities) are extracted from the integrated skeleton data. The features are calculated as follows: – Joint Velocities: the joint velocity is computed from two consecutive records of joint positions and frames. The velocity of the joint j at the frame n can be calculated as follows: Vj (n) =
Pj (n) − Pj (n − 1) , Δn
(2)
where Pj (n) = [xj (n) yj (n) zj (n)]T is a 3D coordinate vector expressing the position of the joint j at the frame n, the superscript [·]T indicates the tranpose of a vector, and Δn is a interval of time between frame n and frame (n − 1). Since the frame rate of Kinect sensor is set to 30 fps, Δn is 1/30 = 33.33 ms [11].
338
B. Kwon et al.
Fig. 4. Description of the angles derived from skeleton model.
– Angles: the angle is computed from the positions of three joints by using the method in [12] (see Chap. 5). The description of the angles derived from skeleton model is presented in Fig. 4, where Ak is a value of angle k. – Angular Velocities: the angular velocity is computed from two consecutive records of angles and frames. The velocity of the angle k at the frame n can be calculated as follows: Wk (n) =
Ak (n) − Ak (n − 1) . Δn
(3)
Through the snapshot feature extraction step, the feature vector for each frame contains, 94 float values (3 × 20 joint velocities, 17 angles, and 17 angular velocities). 2.3
Temporal Feature Extraction
In this step, in order to capture the temporal characteristics of human action, a buffer is used to store the snapshot features over L frames. In addition, by using the stored snapshot features, we calculate the following temporal features: – Average of Joint Velocities: the average velocity of joint j at frame n can be calculated as follows: j (n) = 1 V L
n
Vj (l).
(4)
l=n−(L−1)
– Average of Angles: the average of angle k at frame n can be calculated as follows:
Implementation of Human Action Recognition System
k (n) = 1 A L
n
Ak (l).
339
(5)
l=n−(L−1)
– Average of Angular Velocities: the average velocity of angle k at frame n can be calculated as follows: k (n) = 1 W L
n
Wk (l).
(6)
l=n−(L−1)
Through the temporal feature extraction step, 94 float values (3 × 20 average of joint velocities, 17 average of angles, and 17 average of angular velocities) for each frame are added to the feature vector. Then, the input of the classifier is a vector of 188 float values including 94 snapshot features. 2.4
Classification
In this step, SVM with radial basis kernel is used to classify human actions. SVM constructs a maximal-margin hyperplane in a high dimensional feature space, by mapping the original features through a kernel function [13]. Then, by using the maximal-margin hyperplane, SVM classifies the features. In the next section, we evaluate the performance of the proposed system using SVM. In the experiment, we employ the multi-class SVM implemented in OpenCV library [14].
3
Experiment and Results
In the experiment, we test the feasibility of the proposed human action recognition system. Figure 5 shows our experiment environment. To test feasibility of the proposed system, we recorded a database containing 16 types human actions in four different scenarios. In scenario 1(2), user walks clockwise(counterclockwise) around the semicircle. In scenario 3(4), user walks in a crouching posture clockwise(counter-clockwise) around the semicircle. Figure 6 shows the path used in each scenario. Figure 7 shows the type of human actions which are contained in our database. The database contains 8354 frames (1816 frames of scenario 1, 1827 frames of scenario 2, 2310 frames of scenario 3, and 2401 frames of scenario 4) for each Kinect sensor. Figure 8 shows the results of our experiments about scenarios 1 and 2. The average accuracy in scenario 1 is 87.75 % and its performance varies between 80 % to 91 %. The average accuracy in scenario 2 is 89 % and its performance varies between 80 % to 98 %. Figure 9 shows the results of our experiments about scenarios 3 and 4. The average accuracy in scenario 3 is 87 % and its performance varies between 84 % to 93 %. The average accuracy in scenario 4 is 90.75 % and its performance varies between 81 % to 96 %. As shown in Figs. 8 and 9, the proposed human action recognition system achieves 88.625 % the average accuracy rate, so it can be stated that the proposed system is feasible to recognize human actions.
340
B. Kwon et al.
Fig. 5. A partial view of our experiment environment.
Treadmill
Treadmill
Front
Front
(a)
(b)
Fig. 6. The path used in the experiment. (a) Clockwise and (b) counter-clockwise.
Fig. 7. Description of the action labels and names.
Implementation of Human Action Recognition System
341
Fig. 8. Confusion matrix for walking. (a) Clockwise and (b) counter-clockwise.
Fig. 9. Confusion matrix for walking in a crouching posture. (a) Clockwise and (b) counter-clockwise.
4
Conclusion
In this paper, we proposed a human action recognition system using multiple Kinect sensors. To integrate the multi-view skeleton data, a weighted integration method is used. For recognizing human action, we use joint velocities, angles
342
B. Kwon et al.
and angular velocities as the snapshot features. In addition, in order to capture the temporal characteristics of human action, we use average of joint velocities, average of angles and average of angular velocities as the temporal features. We apply SVM to classify human actions. The experiment results demonstrate the feasibility of the proposed system. Acknowledgments. This work was supported by the ICT R&D program of MSIP/IITP. [R0101-15-0168, Development of ODM-interactive Software Technology supporting Live-Virtual Soldier Exercises]
References 1. Lv, F., Nevatia R.: Single view human action recognition using key pose matching and viterbi path searching. In: Computer Vision and Pattern Recognition, IEEE (2007) 2. Liu, H., Li, L.: Human action recognition using maximum temporal inter-class dissimilarity. In: The Proceedings of the Second International Conference on Communications, Signal Processing, and Systems, pp. 961–969. Springer International Publishing (2014) 3. Papadopoulos, G.T., Axenopoulos, A., Daras, P.: Real-time skeleton-trackingbased human action recognition using kinect data. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds.) MMM 2014, Part I. LNCS, vol. 8325, pp. 473–483. Springer, Heidelberg (2014) 4. Cheng, Z., Qin, L., Ye, Y., Huang, Q., Tian, Q.: Human daily action analysis with multi-view and color-depth data. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012 Ws/Demos, Part II. LNCS, vol. 7584, pp. 52–61. Springer, Heidelberg (2012) 5. Ni, B., Wang, G., Moulin, P.: RGBD-HuDaAct: a color-depth video database for human daily activity recognition. In: Fossati, A., Gall, J., Grabner, H., Ren, X., Konolige, K. (eds.) Consumer Depth Cameras for Computer Vision, pp. 193–208. Springer, London (2013) 6. Liu, A.A., Xu, N., Su, Y.T., Lin, H., Hao, T., Yang, Z.X.: Single/multi-view human action recognition via regularized multi-task learning. Neurocomputing 151, 544– 553 (2015). Elsevier 7. Berger, K., Ruhl, K., Schroeder, Y., Bruemmer, C., Scholz, A., Magnor, M.A.: Markerless motion capture using multiple color-depth sensors. In: Vision Modeling, and Visualization, pp. 317–324 (2011) 8. Haller, E., Scarlat, G., Mocanu, I., Tr˘ asc˘ au, M.: Human activity recognition based ´ on multiple kinects. In: Bot´ıa, J.A., Alvarez-Garc´ ıa, J.A., Fujinami, K., Barsocchi, P., Riedel, T. (eds.) EvAAL 2013. CCIS, vol. 386, pp. 48–59. Springer, Heidelberg (2013) 9. Junghwan, K., Inwoong, L., Jongyoo, K., Sanghoon, L.: Implementation of an omnidirectional human motion capture system using multiple kinect sensors. In: Computer Science and Engineering Conference, Transactions on Fundamentals of Electronics, Communications and Computer Sciences, IEICE (2015) (submitted) 10. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000). IEEE
Implementation of Human Action Recognition System
343
11. Parisi, G.I., Weber, C., Wermter, S.: Human action recognition with hierarchical growing neural gas learning. In: Wermter, S., Weber, C., Duch, W., Honkela, T., Koprinkova-Hristova, P., Magg, S., Palm, G., Villa, A.E.P. (eds.) ICANN 2014. LNCS, vol. 8681, pp. 89–96. Springer, Heidelberg (2014) 12. Caillette, F., Howard, T.: Real-time Markerless 3-D Human Body Tracking. University of Manchester (2006) 13. Castellani, U., Perina, A., Murino, V., Bellani, M., Rambaldelli, G., Tansella, M., Brambilla, P.: Brain morphometry by probabilistic latent semantic analysis. In: Jiang, T., Navab, N., Pluim, J.P.W., Viergever, M.A. (eds.) MICCAI 2010, Part II. LNCS, vol. 6362, pp. 177–184. Springer, Heidelberg (2010) 14. Support Vector Machines - OpenCV 2.4.9.0 documentation. http://docs.opencv. org/2.4.9/modules/ml/doc/support vector machines.html