is proposed for simultaneously segmenting motion streams and recognizing them. ... with high accuracy, a similarity measure is needed for motion segment ...
Segmentation and Recognition of Motion Streams by Similarity Search CHUANJUN LI, S. Q. ZHENG and B. PRABHAKARAN The University of Texas at Dallas
Fast and accurate recognition of motion data streams from gesture sensing and motion capture devices has many applications and is the focus of this paper. Based on the analysis of the geometric structures revealed by singular value decompositions (SVD) of motion data, a similarity measure is proposed for simultaneously segmenting motion streams and recognizing them. A direction identification approach is explored to further differentiate motions with similar data geometric structures. Experiments show that the proposed similarity measure can segment and recognize motion streams of variable lengths with high accuracy without knowing beforehand the number of motions in a stream. Categories and Subject Descriptors: I.5.3 [Pattern Recognition]: similarity measures General Terms: Algorithms Additional Key Words and Phrases: Motion capture, gesture recognition, pattern analysis, segmentation, similarity measures, principal component analysis, singular value decomposition
1. INTRODUCTION Multi-attribute data streams generated by 3D motion capture systems and gesture sensing devices such as data gloves have a wide variety of real-world applications in life sciences and animations [Vicon ], including gait analysis and rehabilitation [Watelain et al. 2000; Schollhorn et al. 2002], machine translation of sign languages [Liang and Ouhyoung 1998; Yang and Shahabi 2004; Li et al. 2004], film and video games, etc. For instance, various new motions can be reconstructed from video sequences [Pullen and Bregler 2002] or images [Taylor 2000] or from different captured motions [Ikemoto and Forsyth 2004] for computer animations. The data streams are considered to be a relatively new form of multimedia information and have their own characteristics. A motion stream usually has dozens of attributes for 3D motions of different subject joints. For instance, various motion capture systems continuously generate positional and joint angular values of subject joints when capturing the continuous motions of reflective markers or sensors on human subjects as shown in Figure 1. Authors’ addresses: C. Li, S. Q. Zheng and B. Prabhakaran, Department of Computer Science, the University of Texas at Dallas, Richardson, TX 75080; email: {chuanjun,sizheng,praba}@utdallas.edu. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2007 ACM 0000-0000/2007/0000-0001 $5.00
ACM Tansactions on Multimedia Computing, Communications and Applications, Vol. 03, No. 03, 08 2007, Pages 1–26.
2
·
Chuanjun Li et al.
Fig. 1.
Capturing human motions by surrounding cameras.
Making the best use and re-use of a motion data stream requires recognizing the motions in the stream with high accuracy. Recognizing motions in such continuous multi-attribute streams involves also segmentation, i.e., identifying the beginning and the end of a motion in a stream. To recognize and segment motion streams with high accuracy, a similarity measure is needed for motion segment comparison and several challenges need to be addressed: —Similar motions can have variations. Different attributes can have different variations and can have different temporal shifts due to motion variations. —Similar motions can have different lengths. Motions performed at different times or by different subjects have different durations, and motion sampling rates can also be different at different times. Different from the subsequence matching as in [Faloutsos et al. 1994], motions in a stream can have longer lengths than similar reference motions. —Complete motions need to be compared with both incomplete motion segments and over-completed motion segments. As shown in Figure 2, complete motions are concatenated by brief transitions, and motion candidates in a stream can contain these transitions. Hence, differences between complete motions and motion candidates with missing or extra segments should be captured. —Motions can follow similar trajectories in different directions, and their semantic meanings can be different. For example, to sit down from a standing pose can follow the similar trajectory as that of standing up to that pose, yet the two motions are different. Proposed Approach: The above challenges make the Euclidean distance of two motion data matrices not suitable for recognizing motions in a stream. We explore the geometric structures revealed by singular value decompositions (SVD) of motion data matrices, and observe that the first few right singular vectors (see Section 3) are the dominating factors of geometric structure similarity, and that singular values reflect the sensitivities of the associated singular vectors to data variations. Based on these SVD properties, we define a similarity measure that is effective in comparing any two multi-attribute motions. The similarity measure, however, is insensitive to motion direction or data generation order. Hence, for ACM Tansactions on Multimedia Computing, Communications and Applications, Vol. 03, No. 03, 08 2007.
Segmentation and Recognition of Motion Streams
·
3
1000 800 600
Coordinates(mm)
400 200 0 −200 −400 −600 −800 SITUP −1000
0
200
FLEX 400
600
STRETCH
PUNCH 800
1000 Frames
1200
1400
SOMERSAULT 1600
1800
2000
Fig. 2. Multi-attribute captured motions SITUP, FLEX, PUNCH, STRETCH and SOMERSAULT. Transitions between neighboring motions are marked with the dotted boxes.
motions of similar geometric structures or following similar trajectories, we utilize the left singular vectors from SVDs to further verify/identify motions in different directions. We propose a simple algorithm to segment and recognize multi-attribute motion streams using the proposed SVD-based similarity measure and the motion direction identification method. The paper extends the preliminary work [Li et al. 2004] and has the following conceptual contributions: —A new similarity measure that considers the geometerical properties of the motion data matrices: this measure can effectively recognize not only similar motion patterns that are isolated, but also similar motions in motion streams. —Our approach generates projection vectors to identify motion generation orders, or motion directions. To the best of our knowledge, no work has considered the motion generation orders before. Due to the above contributions, the proposed approach can find the most similar motions for 99.7% of 330 hand gesture motions, and can find the second most similar motions for 97.5% of the 330 hand gestures. It can also find the most similar motions for all of 310 human body motions, and the second most similar motions for 99.8% of them. For motion streams, the approach can obtain 94.0% recognition accuracy for hand gesture streams and 94.6% recognition accuracy for captured human motion streams. 2. RELATED WORK Multi-attribute pattern similarity search, especially in continuous motion streams, has been studied for sign language recognition and for motion synthesis in computer animation. The recognition methods usually include template matching by distance measures and hidden Markov models (HMM). Template matching by using similarity/distance measures has been employed for multi-attribute pattern recognition. Joint angles are extracted in [Qian et al. 2004] as features to represent different human body static poses and the Mahalanobis distance measure is used for the joint angle features. Similarly, momentum, kinetic ACM Tansactions on Multimedia Computing, Communications and Applications, Vol. 03, No. 03, 08 2007.
4
·
Chuanjun Li et al.
energy and force are constructed in [Kahol et al. 2003; Dyaberi et al. 2004] as activity measure and prediction of gesture boundaries for various segments of the human body, and the Mahalanobis distance function of composite features are solved by dynamic programming. Similarity measures are defined for multi-attribute data in [Krzanowski 1979; Shahabi and Yan 2003; Yang and Shahabi 2004] based on principal component analysis (PCA). Inner products or angular differences of principal components (PCs) are considered for similarity measure definitions, with different weighting strategies for different PCs. Equal weights are considered for different combinations of PCs in [Krzanowski 1979], giving different PCs equal contributions to the similarity measure. The similarity measure in [Shahabi and Yan 2003] takes the minimum of two weighted sums of PC inner products, and the two sums are respectively weighted by different weights. A global weight vector is obtained by taking into account all available isolated motion patterns in [Yang and Shahabi 2004], and this weight vector is used for specifying different contributions from different PC inner products to the similarity measure Eros. This paper extends the similarity measure proposed in [Li et al. 2004], in which the first right singular vector and a normalized singular value vector are considered to be dominating for pattern recognition, and the defined similarity measure captures the Main Angular Similarity (referred to as MAS hereafter) of two motions. In contrast, the proposed similarity measure in this paper considers the first few singular vectors associated with large singular values, and the angular differences or inner products of different singular vector pairs are weighted by different weights which depend on the data variations along the corresponding singular vectors. The HMM technique has been widely used for sign language recognition, and different recognition rates have been reported for different sign languages and different feature selection approaches. Starner et al. [Starner et al. 1998] achieved 92% and 98% word accuracy respectively for two systems, the first of the systems used a camera mounted on a desk and the second one used a camera in a user’s cap for extracting features of five-word sentences as the input of HMM. Similarly Liang and Ouhyoung [Liang and Ouhyoung 1998] used HMM for features such as postures, orientations and motion primitives extracted from continuous Taiwan sign language streams and an average 80.4% recognition rate was achieved. In contrast, the approach proposed in this paper is an unsupervised approach, and no training as required for HMM recognizers is needed. For isolated motion data, support vector machines (SVM) are used in [Li et al. 2004] for classification and up to 100% accuracy has been achieved to correctly recognize isolated motions. Learning techniques such as Decision Trees, Bayesian classifiers and Neural Networks are used in [Shahabi et al. 2001] to recognize static signs for a 10-sign vocabulary with an 84.66% accuracy achieved. Dynamic Time Warping (DTW) and Longest Common SubSequence (LCSS) distances are used for similarity measures in [Vlachos et al. 2003]. Both DTW and LCSS have a computational complexity of O(wd(m + n)), where w is the size of a matching window, d is the number of dimensions, and m and n are the lengths of two data sequences. If m and n are quite different, w has to be a significant portion of m or n, and the computation can be even quadratic in the length of the sequences. ACM Tansactions on Multimedia Computing, Communications and Applications, Vol. 03, No. 03, 08 2007.
Segmentation and Recognition of Motion Streams
·
5
3. GEOMETRIC STRUCTURES REVEALED BY SVD For any general real matrix, singular value decomposition can be used to reveal its geometric structure, i.e., the distribution of the matrix row vectors. As proven in [Golub and Loan 1996], if A is a real m × n matrix, there exist orthogonal matrices ¯ = [¯ U u1 , u ¯2, . . ., u ¯m ] ∈