Two Person Interaction Detection Using Kinect Sensor - Springer Link

4 downloads 1311 Views 432KB Size Report
Abstract This proposed work explains a noble two-person interaction ... than 90 %) with the virtue of multi-class support vector machine for rotation invari-.
Two Person Interaction Detection Using Kinect Sensor Sriparna Saha, Amit Konar and Ramadoss Janarthanan

Abstract This proposed work explains a noble two-person interaction modelling system using Kinect sensor. Here a pentagon for each person is formed taking the three dimensional co-ordinate information with the help of Microsoft’s Kinect sensor. Five Euclidean distances between two pentagon vertices corresponding to two persons are considered as features for each frame. So the body gestures of two persons are analysed employing pentagons. Based on these, eight interactions between two persons are modelled. This system produces the best recognition rate (greater than 90 %) with the virtue of multi-class support vector machine for rotation invariance case and for rotation variance phenomenon, the recognition rate is greater than 80 %. Keywords Body gesture · Euclidean distance · Kinect sensor · Pentagon

1 Introduction Full body tracking is an emerging field of human–computer interaction. Human body gesture realization is highly important for surveillance, cybernetics, information gathering from video, gaming purposes and for many more areas. In today’s era, human body gesture [1, 2] is modelled using the skeleton structure of persons. The human motion [3, 4] is analysed using different sensors. The rapid development in the field of skeleton detection is possible due to Microsoft’s Kinect sensor [5–7]. S. Saha (B) · A. Konar Electronics and Telecommunication Engineering Department, Jadavpur University, Kolkata, India e-mail: [email protected] A. Konar e-mail: [email protected] R. Janarthanan Computer Science and Engineering Department, TJS Engineering College, Chennai, India e-mail: [email protected] © Springer India 2015 M.K. Chakraborty et al. (eds.), Facets of Uncertainties and Applications, Springer Proceedings in Mathematics & Statistics 125, DOI 10.1007/978-81-322-2301-6_13

167

168

S. Saha et al.

Complex body gestures, such as pushing, hugging are successfully tracked using Kinect sensor. Kinect sensor comprises with RGB camera [8, 9] and 3D depth sensor [10–12]. In [13], two-person interactions are simulated using Kinect sensor. This device is feasible as it is a low-cost device with high efficiency. But the problem of [13] is that the algorithm stated there does not incorporate view invariance interactions. All the two-persons interactions modelled there are parallel to the Kinect sensor. But in our novel work, each interaction is captured in three different angles (0◦ , 45◦ and 135◦ ). In the proposed work, eight two-person interactions are taken into account, namely approaching, departing, kicking, punching, hugging, shaking hands, exchanging and pushing. We have created a pentagon for each person for each frame of a sequence. The recognition of body gesture for each monocular frame [14–16] is the aim of this algorithm. Due to differences in interactions of different individuals based on age, sex and physical built, their gestures for a specific sequence vary greatly from each other. This leads to fuzziness in the input. As the Kinect sensor models the human body using 20 body joint co-ordinates in three dimension, the pentagon vertices produced by the algorithm also have threedimensional information. The vertices are configured with the help of average values from head, right hand, left hand, right leg and left leg for a single person. As we are modelling interactions for two persons, for each frame two dissimilar pentagons are configured. The Euclidean distances between similar vertices of two different persons are taken as features for this algorithm. For each interaction, 6 s stream of information of skeleton co-ordinates are taken and five pairs of persons have acted for the preparation of the dataset. For the recognition purpose, multi-class support vector machine (SVM) [17, 18] is utilised. Also a comparative study with multilayered perceptron [19, 20] and k-nearest neighbours (k-NN) [21, 22] algorithms are performed. Experimentally, it is found that recognition rate for SVM outperforms for both the rotation invariance and rotation variance cases. Total time required for two-person interaction recognition is always less than 3 s using Matlab 2011b. In this paper, Sect. 2 overviews the fundamental ideas about Kinect sensor, SVM, perceptron and k-NN. Section 3 clearly explains the proposed algorithm, whereas Sects. 4 and 5 illustrates the experimental results and performance analysis. Section 6 concludes with idea about future work.

2 Fundamental Ideas The subsections below explain the Kinect sensor and multi-class support vector machine algorithm briefly.

Two Person Interaction Detection Using Kinect Sensor

169

Fig. 1 Kinect sensor

2.1 Kinect Sensor Kinect sensor, consisting of RGB (red, green, blue) camera, infrared (IR) projector, IR camera and microphone, is capable of full human body tracking up to two persons at a time [5, 6]. It looks like a webcam as displayed in Fig. 1. It detects 3D representation of an object using depth sensor [10, 11], which consists of infrared laser [12]. Kinect sensor produces RGB video as the output using 8-bit VGA resolution camera [7]. It tracks the human body using 20 body joint co-ordinates within a finite amount of range 1.23.5 m [8, 9]. Voice gestures can also be recorded with the help of microphone array. A light emitting diode (LED) is present in front of the Kinect sensor to ensure that Kinect is running properly. Kinect sensor brings forth 3D information about human. Thus 3D co-ordinates of 20 body joints are also received. Figure 2a demonstrates a human standing at 0◦ angle with Kinect sensor, while Fig. 2b, c clearly explains the scenario when the human is standing at an angle 45◦ and 135◦ with respect to the Kinect sensor. In our proposed work, we take account the interaction between two persons with 0◦ angle with respect to Kinect sensor, as well as the actions take place at two other angles with respect to Kinect sensor. When the two-person interaction encountered with parallel with Kinect sensor (i.e., with 0◦ angle), then the case is conducted as rotation invariance. With the change of angle, the case leads to rotation variance. In Fig. 2, pentagon which is created for the proposed work is marked with red dotted lines with black star vertices. The formation of pentagon is explained in Sect. 3.1.

2.2 Multi-class Support Vector Machine Support vector machine (SVM) [17, 18], also known as support vector networks, is a supervised learning algorithm based on concept of dividing the set of inputs by a hyperplane. This algorithm is widely applicable in classification, regression analysis and pattern recognition problems. Here we have used this for classification of multi-class data points. The simplest form of SVM is linear SVM, which works on the principle of separating two classes of data by constructing a hyper plane. These classes are specified

170

S. Saha et al.

Fig. 2 Human skeleton in three different angles with Kinect sensor a 0◦ , b 45◦ , c 135◦

by support vectors, within the training data points. The distance margin between the support vectors are taken into account and the aim is to maximise this distance. As linear SVM can be successfully used only where the data are linearly separable, this limitation can be overcome by mapping the data into a larger dimensional space using a kernel function, to make it linearly-separable. A frequently used kernel functions is the radial basis function kernel.

3 Proposed Algorithm In the proposed algorithm, each person is modelled using a pentagon and the skeleton produced for the total interaction sequence is broke into frames. Thus the body gestures are extracted for each monocular frame [14–16].

3.1 Pentagon Formation The five vertices of the pentagon are formed using (1)–(5). The Kinect sensor is capable of modelling human body during motion with 20 3D joint co-ordinates. Here we require 18 body joints information, i.e. spine and hip centre joints are neglected for this proposed work. The first vertex is formed by averaging head (H) and shoulder centre (SC) joints. The second vertex is created using mean values of

Two Person Interaction Detection Using Kinect Sensor

171

shoulder right (SR), elbow right (ER), wrist right (WR) and hand right (HR). In the same way, mean values of shoulder left (SL), elbow left (EL), wrist left (WL) and hand left (HL) are noted to produce vertex 3. Vertices 4 and 5 are due to the average values of leg co-ordinates for right and left legs. Hip right (HR), knee right (KR), ankle right (AR) and foot right (FR) are taken into account for representation of vertex 4. Similarly, hip left (HL), knee left (KL), ankle left (AL) and foot left (FL) are marked and vertex 5 is obtained using the mean value of those joints. H + SC 2

(1)

ver tex 2 =

SR + E R + W R + H R 4

(2)

ver tex 3 =

SL + E L + W L + H L 4

(3)

ver tex 4 =

H R + K R + AR + F R 4

(4)

ver tex 5 =

H L + K L + AL + F L 4

(5)

ver tex 1 =

As Kinect sensor creates 3D information for each skeleton joint, thus each vertex obtained using the above equations also have three dimensions. In Fig. 2, the calculated pentagon vertices using the above equations are shown using black stars and red dotted lines in the figure picturise the edges of the pentagons.

3.2 Calculation of Five Euclidean Distances For each person, a definite pentagon is formed at the ith particular frame. Let the left and right persons are represented using L and R respectively. Then the Euclidean distance (ED) between vertex no j (which can be in between 1 and 5) is calculated by (6).   E D =  L i, j − Ri, j 

(6)

172

S. Saha et al.

4 Experimental Results All the videos are conducted for 6 s duration. As Kinect sensor captures video at 30 frames/second rate, thus total 180 frames are processed for each interaction between two persons when the interaction is performed with 0◦ angle with respect to camera. We have carried out this experiment also for rotation invariant cases, i.e. with 45◦ and 135◦ angles with Kinect sensor. Thus we have for each interaction three different skeleton data. Five different pairs have participated in this proposed work. The starting and ending positions are neutral. Figures 3 and 4 demonstrates the eight instructions for frame no 50, 100 and 150. Twenty body joints for each person are marked using red stars. The skeletons of the persons are represented via blue lines, whereas green dotted lines describe the pentagon formed for each person. Black stars are the vertices of the pentagons. The red lines are the depiction of Euclidean distances between the vertices of the two pentagons, which are the essence of this paper. For all the interactions shown in Figs. 3 and 4, the right person is acting and the left person is reacting to the situation. Total dataset is broken into 4:1 ratio for testing and training purposes, respectively. Table 1 presents the experimental values obtained for five Euclidean distances for frame no 50, 100 and 150. This system acquires a recognition rate of 93.7, 81.3 and 90.4 % with the virtue of multi-class support vector machine, multi-layered perceptron and k-nearest neighbours algorithm, respectively, when the actions are performed with 0◦ angle with Kinect sensor. When the angle of interaction varies with respect to kinect sensor, i.e. when the angle is 45◦ or 135◦ , then the average performance degrades to 81.3, 69.7 and 80.4 %, respectively, for SVM, perceptron and k-nearest neighbours.

Fig. 3 Two person interaction modelling for approaching, departing, kicking and punching interactions

Two Person Interaction Detection Using Kinect Sensor

173

Fig. 4 Two person interaction modelling for hugging, shaking hands, exchanging and pushing interactions

Table 1 The result of five Euclidean distances for frame no 50, 100 and 150 Interaction Frame no 50 Frame no 100 Frame no 150 Approaching Departing Kicking Punching Hugging Shaking Hands Exchanging Pushing

1.8254 1.7639 1.8385 1.8536 1.8701 0.5025 0.4567 0.5678 0.5227 0.5550 1.1443 1.0948 1.2887 0.9518 1.2529 1.0687 1.2488 0.8550 1.0523 0.9950 0.6978 0.6296 0.6832 0.7329 0.8629 0.4456 0.2101 0.5654 0.4409 0.5648 1.1194 0.8327 1.1592 1.1008 1.1855 1.2713 1.2022 1.3092 1.2965 1.3638

1.5966 1.5704 1.6382 1.7058 1.5064 1.2556 1.1610 1.2340 1.1691 1.0693 1.0961 1.1099 1.3010 0.7541 1.1838 0.8364 0.5797 0.9160 0.9285 0.8306 0.5584 0.4213 0.5985 0.6992 0.6730 0.2841 0.1434 0.6302 0.3566 0.4889 0.8227 0.4646 0.5806 0.8967 1.0744 1.2740 0.6237 0.8901 1.3449 1.3964

0.4663 0.5595 0.4822 0.4500 0.4758 1.8221 1.7328 1.9395 1.9082 1.7377 1.2494 1.0866 1.4201 0.7646 1.2355 0.8520 0.6177 0.8544 0.8736 0.8059 0.9491 0.4275 0.7425 0.7994 0.9934 0.5084 0.1890 0.6283 0.3732 0.6152 0.8275 0.6171 0.7957 0.8966 1.0728 1.2260 1.0299 1.3138 1.2863 1.3465

Recognition rate comparison is picturised in Fig. 5. Here darker colour bars are for rotation invariance cases and lighter colour bars are for rotation variance cases. Average computational time for SVM, perceptron and k-NN are 2.573, 2.820 and 2.934 s correspondingly in an Intel Pentium Dual Core processor running Matlab R011b for both the two cases of rotation invariance and rotation variance.

174

S. Saha et al.

Fig. 5 Accuracy comparison between SVM, perceptron and k-NN for rotation variance and invariance cases

In [13], two person interactions are modelled using Linear SVMs and multiple instance learning (MILBoost) with achieved maximum recognition of 87.6, 91.1 %. But in this work, the rotation variance is not taken into account. The proposed work in the paper, not only able to produce higher recognition rates than [13] for rotation invariance purpose, but also capable to manage good results, i.e. greater than 80 % for majority of the cases. We have taken 45◦ and 135◦ angles for rotation invariance cases. Thus the limitation of [13] is overcome with high efficiency and also with less timing complexity.

5 Performance Analysis McNemar s Test [23] is used to judge two algorithms. Here, we assume multi-class SVM to be the reference algorithm (A) and compare it with either multi-layered perceptron and k-nearest neighbours (B) at a time using (7). n 01 is number of samples mapped to a wrong cluster by algorithm A but not by B and n 10 is number of samples mapped to a wrong cluster by algorithm B but not by A. The critical value of Z for 95 % confidence interval is 3.84 for one degree of freedom. According to Table 2, for both the cases the null-hypothesis is rejected. Hence, the algorithms are not equivalent. This validates our results. Z = (|n 01 − n 10 | − 1)2 /n 01 + n 10

(7)

Two Person Interaction Detection Using Kinect Sensor Table 2 Results of statistical test

Classifier used

175

n01

n10

Z

Comment

Multi17 layered perceptron k-nearest 26 neighbours

63

0.5625

Reject

79

0.4952

Reject

6 Conclusion and Future Work The proposed work is to recognise eight interactions between two persons. As the same gesture depicting a particular interaction varies widely across different persons, thus, the input is fuzzy in nature. Hence, multi-class support vector machine is employed. Till date none of the papers acknowledge rotation invariance interaction modelling. Also Kinect sensor is implemented, so differences in weight, height and body types for different persons do not hamper the results. We have achieved a high accuracy of more than 80 % for all the cases. Hence this proposed work can easily find its place in surveillance purposes. In future, we will concentrate on much more difficult interactions between two persons. Acknowledgments We would like to thank University Grant Commission, India and University of Potential Excellence Program (Phase II) in Cognitive Science, Jadavpur University.

References 1. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. In: IEEE Proceedings of Nonrigid and Articulated Motion Workshop, 1997, pp. 90–102 (1997) 2. Yang, H.-D., Park, A.-Y., Lee, S.-W.: Gesture spotting and recognition for humanrobot interaction. IEEE Trans. Robot. 23(2), 256–270 (2007) 3. Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. Comput. Vis. Image Underst. 81(3), 231–268 (2001) 4. Wang, L., Hu, W., Tan, T.: Recent developments in human motion analysis. Pattern Recognit. 36(3), 585–601 (2003) 5. Leyvand, T., Meekhof, C., Wei, Y.-C., Sun, J., Guo, B.: Kinect identity: technology and experience. Computer 44(4), 94–96 (2011) 6. Solaro, J.: The kinect digital out-of-box experience. Computer 44, 97–99 (2011) 7. Clark, J.: Object digitization for everyone. Computer 44, 81–83 (2011) 8. Zhang, Z.: Microsoft kinect sensor and its effect. IEEE Multimed. 19(2), 410 (2012) 9. Ryden, F.: Tech to the future: making a “Kinection” with haptic interaction. IEEE Potentials 31(3), 34–36 (2012)

176

S. Saha et al.

10. Cho, J.-H., Lee, K.H., Aizawa, K.: Enhancement of depth maps with alpha channel estimation for 3-D video. IEEE J. Sel. Top. Sig. Process. 6(5), 483–494 (2012) 11. Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H.: Scanning 3d full human bodies using kinects. IEEE Trans. Vis. Comput. Graph. 18(4), 643–650 (2012) 12. Herrera, C., Kannala, J.: Joint depth and color camera calibration with distortion correction. IEEE Trans. Pattern Anal. Mach. Intell. 34(10), 2058–2064 (2012) 13. Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.l., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 28–35 (2012) 14. Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 4458 (2006) 15. Wachter, S., Nagel, H.-H.: Tracking of persons in monocular image sequences. In: IEEE Proceedings of Nonrigid and Articulated Motion Workshop, 1997, p. 29 (1997) 16. Snidaro, L., Foresti, G.L., Chittaro, L.: Tracking human motion from monocular sequences. Int. J. Image Graph. 8(03), 455471 (2008) 17. Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recognit. Lett. 20(11), 1191–1199 (1999) 18. Theodoridis, S., Pikrakis, A., Koutroumbas, K., Cavouras, D.: Introduction to Pattern Recognition: A Matlab Approach: A Matlab Approach. Academic Press, New York (2010) 19. Konar, A.: Computational Intelligence: Principles, Techniques and Applications. Springer, The Netherlands (2005) 20. Duin, R.P.W., Juszczak, P., Paclik, P., Pekalska, E., De Ridder, D., Tax, D.M.J., Verzakov, S.: A matlab toolbox for pattern recognition. PRTools version, vol. 3 (2000) 21. Coomans, D., Massart, D.L.: Alternative k-nearest neighbour rules in supervised pattern recognition. Fresenius. Z. Anal. Chem. 313, 15–27 (1982) 22. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967) 23. Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)

Suggest Documents