tures of 3D models, and the geodesic shape distribution al- gorithm, which is ... For instance, in. TVMs, when a person claps his/her hands, the palms appear .... ing would give no false positives and return a score of 100%. The âsecond tierâ ...
A EUCLIDEAN-GEODESIC SHAPE DISTRIBUTION FOR RETRIEVAL OF TIME-VARYING MESH SEQUENCES Toshihiko Yamasaki†‡ and Kiyoharu Aizawa† Department of Information and Communication Engineering, The University of Tokyo ‡ MSR IJARC Fellow
†
ABSTRACT This paper proposes a Euclidean-geodesic shape distribution for the more accurate retrieval of time-varying meshes, which are 3D mesh sequences of real-world objects generated by multiple cameras. The Euclidean-geodesic shape distribution derives from a combination of the modified shape distribution algorithm, which analyzes the global shape features of 3D models, and the geodesic shape distribution algorithm, which is used to investigate topological changes. The optimal weighting for the two algorithms is investigated experimentally. Experimental results show that the performance for similar motion retrieval is better than that of conventional algorithms, being improved by 2% on average and by 5% in the best case. Index Terms— Time varying meshes, 3D, retrieval, feature vectors, shape distributions 1. INTRODUCTION Three-Dimensional (3D) geometric modeling of human appearance and motion based on computer vision and graphics has received much attention during the last decade [1]– [8]. A prototype for capturing human motion in the form of a 3D mesh was presented by Kanade et al. [1]. Since then, a number of systems have been developed, aiming at real-time modeling [2], high resolution and high quality modeling using a deformable mesh [3], stereo matching [4][5], and graph-cuts [6]. In most cases, frames are generated independently of each other because of the nonrigid nature of human bodies and clothes. Therefore, the vertices and the connections are not always time-consistent. In this paper, we shall refer to such data as Time-Varying Meshes (TVMs). In recent years, a few techniques of mesh deformation to generate 3D mesh sequences while retaining time-consistency have been proposed [7][8]. However, the generation of the optimal initial 3D model remains a difficult issue. Although TVM generation is still an emerging technology, it is apparent that efficient and effective retrieval systems for TVMs will be required in the future for managing largescale databases of TVMs. There are related papers on retrieval systems for static 3D models [9] and motion capture data [10]. However, to the best of our knowledge, the re-
978-1-4244-4291-1/09/$25.00 ©2009 IEEE
846
trieval systems for TVMs reported so far are by the authors [11]–[14]. The difficulty in developing TVM retrieval systems lies not only on the lack of TVM data available to researchers but also on the fact that it is difficult to locate and track the feature points in TVMs. This is because the vertices and the connections are not time-consistent, as discussed above. Therefore, it is very difficult to track and analyze the motion of objects. Another feature of TVMs is that topology of the mesh is also not time-consistent. For instance, in TVMs, when a person claps his/her hands, the palms appear connected, changing the genus by one. Therefore, graphbased or skeleton-based motion analysis will fail in many cases. In [11], a modified shape distribution was developed for robust shape feature representation in each frame, and the sequences of the extracted feature vectors were utilized in calculating the distance between TVM clips. Then the modified shape distribution was extended to motion capture data, enabling motion capture data to be used as queries [12]. Some false positives occurred because the modified shape distribution was designed for global shape features[11]. For instance, the difference in shape between the “standing still” and “clapping hands in front of stomach” motions is in the positions of arms and hands, which occupy only a small portion of the surface area of the 3D models. In such a case, few differences can be observed between the extracted feature vectors. The purpose of this paper is to improve the retrieval performance of TVMs by taking advantage of the topology changes occurring in TVMs. We combine the modified shape distribution [11] and the geodesic shape distribution [15], which employs geodesic distance instead of Euclidean distance, to form feature vectors. The modified shape distribution is used to analyze the global shape features while the geodesic shape distribution is used to reflect the topology changes. In addition, the weighted sum of the two distance measures is employed for similarity evaluation between TVM clips. 2. DATA DESCRIPTION The TVMs in the present work were obtained via the system developed in [4]. They were generated from multipleview images taken by 22 synchronous cameras. The 3D ob-
ICME 2009
Table 1. Summary of TVMs utilized in experiments. Sequence # 1-1 # 1-2 #2 # of frames 613 612 1,981 # of vertices 17k 17k 17k (average) # of patches 34k 34k 34k (average) Resolution 10 mm 10 mm 10 mm Frame rate 10 frames/s
Fig. 1. Example frame of our TVMs. Each frame comprises the coordinates of vertices, their connection, and color. ject modeling is based on a combination of volume intersection and stereo matching. Similarly to Two-Dimensional (2D) video, TVMs are composed of a consecutive sequence of “frames.” Each TVM frame is represented in terms of a polygon mesh model. That is, each frame is expressed by three items of data as shown in Fig. 1, namely the coordinates of vertices, their connection (topology), and color. As shown in the figure, the sleeves of the woman’s kimono are connected to her body. Such connection and topology changes occur when two parts of the body touch each other. 3. RELATED WORKS AND PROPOSED ALGORITHM The shape distribution [16] is one of the most efficient algorithms for static 3D model retrieval. In [16], a number of points (e.g., 1,024) in the 3D model were randomly sampled, and the Euclidean distances between all possible combinations of the points were calculated. Then a histogram of the distance distribution was generated as a feature vector to express the shape characteristics of the 3D model. The shape distribution algorithm has the virtue of robustness with respect to object rotation, translation, etc. However, stable histograms using the original shape distribution cannot be generated because of the random sampling of the 3D surface. The modified shape distribution [11] was developed to robustly extract the feature vectors. In [11], all the vertices were clustered into 1,024 groups according to their geometrical distance. Then a feature vector was formed by calculating the Euclidean distances between the centers of the clusters. As an alternative, the geodesic shape distribution [15] was developed to capture the nonlinear geometric structure of 3D models. In [15], geodesic distance was introduced to calculate the distance between points instead of Euclidean distance. The geodesic shape distribution is very sensitive to topological (genus) changes.
847
Stand still
Clap handss twice
Clap handss once
Draw a big circle
Twist to right
Twist to left
Twist to right
Twist to left
Jump three steps
Stoop down
Draw a big circle
Jump three steps
Jump and spread hands
Fig. 2. Motion definitions for sequence #1-1 for the first 20 seconds. The proposed similarity measure for TVMs is a combination of the modified shape distribution and the geodesic shape distribution, which we shall call the Euclideangeodesic shape distribution. The dissimilarity between TVM frames (fi and fj) is calculated as follows: (1) D(fi, fj)=w×Dmsd(fi, fj) + (1–w)×Dgsd(fi, fj), where Dmsd(fi, fj) and Dgsd(fi, fj) are the dissimilarities calculated by the modified shape distribution and geodesic shape distribution, respectively. w is a weight parameter ranging from 0 to 1. The modified shape distribution analyzes the global shape feature of each frame. Alternatively, the geodesic shape distribution is utilized to investigate the topological change. The geodesic distance between the centers of clusters is calculated as described in [16]. 4. EXPERIMENTAL RESULTS In our experiments, three TVM sequences generated by the system developed in [4] were utilized (five minutes in total, comprising 336 clips). The parameters for the data are summarized in Table 1. Sequences #1-1~#1-2 are Japanese traditional dances called bon-odori and sequence #2 is a Japanese warming-up exercise. Sequences #1-1 and #1-2 are identical, but performed by different persons. The frame rate was 10 frames/s. The detailed content of the TVM is shown in Fig. 2. The dimensions of the feature vectors of the modified shape distribution and the geodesic shape distribution were set as 1,024 and 128, respectively. Since the geodesic distribution is used only for analyzing the topology changes, the dimension is reduced to save memory. The TVM se-
100 Second tier
Accuracy (%)
90
Nearest neighbor 80
First tier
70 60
#69
#70
(a)
Normalized frequency
0.003
0
0.4 0.6 0.8 w value Fig. 4. Performance comparison as a function of w.
frame #69 frame #70
0.002
0.001
0
200 400 600 800 1000 Elements of feature vectors
1200
(b)
0.02 Normalized frequency
frame #69 frame #70
0.01
0
20
40 60 80 100 120 Elements of feature vectors
0.2
140
(c) Fig. 3. (a) Frames #69 and #70. The topology changes because the dancer claps his hands and the hands become connected by meshes. (b) Feature vectors by modified shape distribution. (c) Feature vectors by geodesic shape distribution. The range of the normalized frequency values differ in (b) and (c), which is due to the difference in vector dimension. quences were divided into segments in advance by the method used in [11], and the similar motion retrieval was conducted by applying dynamic time warping between the segmented clips. The dissimilarity between frames was evaluated using Equation (1).
848
1.0
The differences between the feature vectors generated with modified shape distribution and geodesic shape distribution are shown in Fig. 3. Frames #69 and #70 are very similar in terms of their global shape, but the dancer’s hands become topologically connected when he claps his hands in Frame #69 (Fig. 3(a)). Therefore, the difference between the feature vectors using the geodesic shape distribution (Fig. 3(c)) is greater than that for the modified shape distribution (Fig. 3(b)). The impact of changing the weight value w in Equation (1) is shown in Fig. 4. The figure shows the averaged accuracy of all the retrieval experiments (336 queries). The performance was evaluated by the method employed in [16]. The “first tier” in Fig. 4 demonstrates the averaged percentage of correctly retrieved clips in the top-k highest similarity score clips, where k is the number of the ground truth of similar motion clips defined by the authors. An ideal matching would give no false positives and return a score of 100%. The “second tier” gives the same type of result, but for the top 2×k highest similarity score clips. The “nearest neighbor” shows the percentage of the test in which the retrieved clip with the highest score was correct. It is shown that the retrieval performance becomes optimal when w is around 0.8. In addition, it was also confirmed that w=0.6~0.8 was optimal for most of the queries. In these conditions, the performance was equal to or better than that of w=1 (modified shape distribution alone) and w=0 (geodesic shape distribution alone), demonstrating the validity of the proposed algorithm. It is also observed that the performance degrades when w is decreased. As discussed above, the geodesic shape distribution is appropriate for analyzing topological change but not for comparing global shape similarity. Ideally, the feature vectors will be almost the same, provided the topology does not change, and regardless of the posture of the 3D model. The performance of the proposed algorithm (w=0.8) is compared with that of [11] (w=1), as shown in Table 2. In the experiments, each clip from sequences shown in each
6. REFERENCES
Table 2. Performance comparison of the proposed algorithm (w=0.8) with ࢚࣮ࣛ! [11] (w=1): ཧ↷ඖࡀぢࡘࡾࡲࡏ (a) first tier, (b) second tier, (c) nearest The(b) numbers brackets are the (w=1):neighbor. ࢇࠋ[9] (a) first tier, secondin tier, (c) nearest performances [11]. in brackets are the performances in neighbor. The in numbers (a) ࢚࣮ࣛ! ཧ↷ඖࡀぢࡘࡾࡲࡏࢇࠋ[9]. #1-1㻌 #2㻌 㻌 (a) #1-2㻌 98%#1-1㻌 (98%) 74%#1-2㻌 (71%) #1-1㻌 N.A.㻌 #2㻌 㻌 86% #1-2㻌 98% (81%) (98%) 86% 74% (86%) (71%) #1-1㻌 N.A.㻌 86%N.A.㻌 (81%) 86%N.A.㻌 (86%) 92%N.A.㻌 #2㻌 (90%) #1-2㻌 (b) #2㻌 N.A.㻌 N.A.㻌 92% (90%) #1-1㻌 #2㻌 㻌 (b) #1-2㻌 81% (77%) 66% (62%) #1-1㻌 N.A.㻌 #1-1㻌 #1-2㻌 #2㻌 㻌 74% 81% (72%) (77%) 85% 66% (85%) (62%) #1-2㻌 #1-1㻌 N.A.㻌 74%N.A.㻌 (72%) 85%N.A.㻌 (85%) 81%N.A.㻌 #2㻌 (80%) #1-2㻌 (c) N.A.㻌 #2㻌 N.A.㻌 81% (80%) #1-1㻌 #2㻌 㻌 (c) #1-2㻌 96% (94%) 88% (88%) #1-1㻌 N.A.㻌 #1-1㻌 #1-2㻌 #2㻌 㻌 91% #1-2㻌 96% (89%) (94%) 95% 88% (95%) (88%) #1-1㻌 N.A.㻌 91%N.A.㻌 (89%) 95%N.A.㻌 (95%) 87%N.A.㻌 #2㻌 (86%) #1-2㻌 #2 N.A. N.A. 87% (86%) column was used as a query. The clips from the sequences shown in each row were used as candidates. The query itself was not included in the candidate list. Table 2 shows that the performance can be improved by several percent. In the best case, the accuracy is improved by 5% (2% on average). In addition, it can be seen that the performance is improved in every case. The disadvantage of the proposed Euclidean-geodesic shape distribution is that it does not contribute to performance improvement for topologically time-consistent mesh sequences such as those described in [7][8]. For such cases, skeleton-based motion analysis and retrieval [13] or the modified shape distribution [9] would be better solutions. 5. CONCLUSIONS In this paper, we have developed a Euclidean-geodesic shape distribution for better similar motion retrieval of TVMs. By the combination of global shape features and topology change analysis, the retrieval performance was improved by 5% in the best case and by 2% on average. In addition, it was demonstrated experimentally that the optimal weighting value is w=0.8. ACKNOWLEDGEMENTS This work is supported by the Microsoft Institute for Japanese Academic Research Collaboration (IJARC), and the Ministry of Education, Culture, Sports, Science and Technology of Japan under the “Development of Fundamental Software Technologies for Digital Archives” project.
849
[1] T. Kanade, P. Rander, and P. Narayanan, “Virtualized reality: constructing virtual worlds from real scenes,” IEEE Multimedia, vol. 4, no. 1, pp. 34–47, Jan./March 1997. [2] W. Matusik, C. Buehler, R. Raskar, S. Gortler, and L. McMillan, “Image based visual hulls,” ACM SIGGRAPH2000, pp. 369– 374, 2000. [3] T. Matsuyama, X. Wu, T. Takai, and T. Wada, “Real-time dynamic 3-D object shape reconstruction and high-fidelity texture mapping for 3-D video,” IEEE TCSVT, vol. 14, no. 3, pp. 357– 369, March 2004. [4] K. Tomiyama, Y. Orihara, M. Katayama, and Y. Iwadate, “Algorithm for dynamic 3D object generation from multiviewpoint images,” Proc. SPIE, vol. 5599, pp. 153–161, 2004. [5] J. Starck and A. Hilton, “Surface capture for performancebased animation,” IEEE CGA, vol. 27, no. 3, pp. 21–31, May– June 2007. [6] T. Tung, S. Nobuhara, and T. Matsuyama, “Simultaneous super-resolution and 3D video using graph-cuts,” Proc. IEEE CVPR2008, pp. 1–8, 2008. [7] D. Vlasic, I. Baran, W. Matusik, and J. Popovic, “Articulated mesh animation from multi-view silhouettes,” ACM SIGGRAPH08, #97, 2008. [8] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.P. Seidel, and S. Thrun, “Performance capture from sparse multi-view video,” ACM SIGGRAPH08, #98, 2008. [9] J. Tangelder and R.C. Veltkamp, “A survey of content based 3D shape retrieval methods,” Proc. Shape Modeling International 2004, pp. 145–156, 2004. [10] C.Y. Chiu, S.P. Chao, M.Y. Wu, S.N. Yang, and H.C. Lin, “Content-based retrieval for human motion data,” Journal of Visual Communication and Image Representation, vol. 15, no. 3, pp. 446–466, 2004. [11] T. Yamasaki and K. Aizawa, “Motion segmentation and retrieval for 3D video based on modified shape distribution,” EURASIP Journal on Applied Signal Processing, vol. 2007, Article ID 59535, 11 pages, 2007. [12] T. Yamasaki and K. Aizawa, “Content-based cross search for human motion data using time-varying mesh and motion capture data,” Proc. IEEE ICME 2007, pp. 2006–2009, 2007. [13] R. Tadano, T. Yamasaki, and K. Aizawa, “Fast and robust motion tracking for time-varying mesh featuring Reeb-graph-based skeleton fitting and its application to motion retrieval,” Proc. IEEE ICME2007, pp. 2010–2013, 2007. [14] D. Kasai, T. Yamasaki, and K. Aizawa, “Retrieval of timevarying mesh and motion capture data using 2d video queries based on silhouette shape descriptors,” Proc. ICME2009, 2009. [15] A.B. Hamzal and H. Kriml, “Geodesic object representation and recognition,” Lecture Notes in Computer Science, vol. 2886, pp. 378–387, 2003. [16] R. Osada, T. Funkhouser B. Chazelle, and D. Dobkin, “Shape distributions,” ACM Transactions on Graphics (TOG), vol. 21, issue 4, pp. 807–832, 2002. [17] M. Hilaga, Y. Shinagawa, T. Kohmura, and T. L. Kunii, “Topology matching for fully automatic similarity estimation of 3d shapes,” ACM SIGGRAPH01, pp. 203–212, 2001.