retrieval of time-varying mesh and motion capture data using 2d video ...

5 downloads 3896 Views 609KB Size Report
capture data as queries and the cost for query generation was a significant issue ... the retrieval cost and make retrieval easier to use, a simpler query generation ...
RETRIEVAL OF TIME-VARYING MESH AND MOTION CAPTURE DATA USING 2D VIDEO QUERIES BASED ON SILHOUETTE SHAPE DESCRIPTORS Daisuke Kasai†, Toshihiko Yamasaki†‡, and Kiyoharu Aizawa† †

Department of Information and Communication Engineering, The University of Tokyo ‡ MSR IJARC Fellow

ABSTRACT This paper presents a retrieval system for Time-Varying Mesh (TVM) and motion capture data using 2D video queries. Previous approaches have used other TVM and motion capture data as queries and the cost for query generation was a significant issue. Instead, the proposed system uses 2D video queries, which can be easily captured by a single camera, enabling end users to retrieve 3D motion sequences such as TVM and motion capture data easily and interactively. We introduce the P-type Fourier descriptor, which is a feature of 2D contour images. TVM and computer graphics sequences rendered from motion capture data are silhouetted rendering from multiple viewpoints. Feature vectors for TVM and motion capture data are generated by applying the P-type Fourier descriptor to these silhouetted images. Experimental results using four TVM sequences and motion capture data demonstrated an average retrieval accuracy of 88% in terms of nearest neighbors. Index Terms— TVM, motion capture, 2D video, retrieval, silhouette, P-type Fourier descriptor 1. INTRODUCTION Dynamic Three-Dimensional (3D) mesh model sequences (TVMs) generated by multiple cameras have been actively researched in recent years [1]–[4]. A TVM is represented as a 3D polygon mesh sequence, comprising vertices, connections and colors. One of the key technologies for building an archive system for TVM is efficient and effective motion retrieval. Related work in similar motion retrieval involves 3D “motion capture” data [5]–[6]. However, most existing algorithms are not appropriate for our purpose because they assume that the structure information of human body, such as where the joints are or how they move, is well specified in advance. Because TVM is generated from multiple passive cameras, this structural information is not available. Regarding similar motion retrieval for TVM, only a few papers have been reported to date [7][8]. In [7], a modified shape distribution algorithm was proposed to represent the shape feature of each frame (3D mesh model) robustly. In [8], skeleton extraction and motion tracking to extract motion data compatible with motion capture data was investigated. However, the accuracy achieved was only fair because of changes in the topology of the mesh.

978-1-4244-4291-1/09/$25.00 ©2009 IEEE

854

In all these cases, queries were either TVM data or motion capture data, which are costly to generate. To reduce the retrieval cost and make retrieval easier to use, a simpler query generation method is required. Similar problems remain in retrieving motion capture data. In conventional approaches, query-by-example using other motion capture data is employed, which is costly for end users. Keyword-based search is a possible solution, but it strongly depends on “proper” labeling. This paper proposes an easy-to-use retrieval system for 3D motion sequences, such as TVM and motion capture data that use 2D video queries, which can be generated easily by a single camera. We define features that can be applied to both 2D video and 3D motion sequences. In addition, we demonstrate a prototype of the graphical user interface. Our retrieval system is inspired by [9], which was developed for “static” 3D model retrieval. In [9], 3D static models were rendered from 20 viewpoints and silhouetted. A sketch, which was used as a query, was compared with silhouette images of the 3D models in the database using Zernike moment and Fourier descriptors [10]. In this paper, we introduce P-type Fourier descriptors [11] for the more efficient and effective feature representation of silhouettes. Feature vectors using the P-type Fourier descriptor are extracted for a query and rendered TVM images are taken from 10 or 20 viewpoints. Then, Dynamic Programming (DP) matching is conducted to evaluate the similarity. Our proposed method can also be applied to motion capture data retrieval by rendering the data as a 3D Computer Graphic (CG) scene and extracting the feature vectors. In our experiment, it took 1.74s to retrieve the data using a database of 380 clips (420s in total). The “nearest neighbor” accuracy, namely the percentage of the test for which the retrieved clip with the highest score was correct, was up to 88%. 2.

DESCRIPTION OF PROPOSED 3D MOTION SEQUENCES The TVM data in this work were obtained via the system described in [3]. They were generated from multiple-view images taken by 22 synchronous cameras in a dedicated studio. As distinct from 2D video, TVM comprises a sequence of 3D models. Each TVM frame is represented as a polygon mesh model. That is, each frame is expressed in terms of three types of data, namely the coordinates of vertices, their connection, and color. The motion capture data in

ICME 2009

Fig. 2. Broken line approximation.

Fig. 1. Flow of TVM retrieval with 2D video queries. this work were obtained from the Web site of the Carnegie Mellon University Graphics Lab1 3. PROPOSED METHOD 3.1. Overview A processing flow for our system is as follows (see Fig. 1). 1. Normalization: all 3D videos in the database are translated to place their centers of gravity at the origin of the world coordinate system and scaled so that the length from the origin to the furthest vertex is 1. 2. TVM rendering: all of the frames of 3D videos are rendered from 20 viewpoints (using 10 or fewer viewpoints is also possible for quicker processing), with 20 2D videos being generated. These viewpoints are at the 20 vertices of a regular dodecahedron. 3. Feature extraction for 3D motion sequences: all frames of the 2D videos are silhouetted, and feature vectors (we use P-type Fourier descriptors) are generated. Then feature vector sequences for all viewpoints are calculated. 4. Feature extraction for a query: in the same way as for 3D videos, 2D video queries are silhouetted and feature vectors are extracted. 5. Similarity calculation: the similarity between a 3D motion sequence and a query is calculated in the feature vector space by employing DP matching. The similarity of the most similar viewpoint of the 20 viewpoints is defined as the similarity between the 3D video and the query. 3.2. P-Type Fourier Descriptor The P-type Fourier descriptor [11] is a Fourier descriptor that can be applied to an open curve. As shown in Fig. 2, a broken line approximation is applied, using a line of length ߜ, to the curves (or contours). Let ‫ݓ‬ሾ݅ሿ be the exponential of ߠሾ݅ሿ, the angle of broken line ݅. Then the relation between ‫ݓ‬ሾ݅ሿ and ‫ݖ‬ሾ݅ሿ on the plane of the complex is shown in the following equation:

1

http://mocap.cs.cmu.edu/search.php

855

‫ݓ‬ሾ݅ሿ  ൌ  ‡š’ሺ݆ߠሾ݅ሿሻ ൌ  …‘• ߠሾ݅ሿ ൅ ݆ •‹ ߠሾ݅ሿ ‫ݔ‬ሾ݅ ൅ ͳሿ െ ‫ݔ‬ሾ݅ሿ ‫ݕ‬ሾ݅ ൅ ͳሿ െ ‫ݕ‬ሾ݅ሿ ൌ ൅݆ ߜ ߜ ‫ݖ‬ሾ݅ ൅ ͳሿ െ ‫ݖ‬ሾ݅ሿ ൌ  Ǥ ߜ Here, ‫ݖ‬ሾ݅ሿ represents the coordinate of the starting point of the ݅ ‫ ݄ݐ‬line.The Fourier transform of ‫ݓ‬ሾ݅ሿ is called the Ptype Fourier descriptor. In this paper, the 20 frequency components from the lowest order were used as feature vectors. 3.3. Query Silhouetting Queries are generated by performing in front of a single camera. The silhouette images are obtained by a combination of background subtraction and graph cuts [12]. Then feature vectors are generated in the same manner as for the 3D motion sequences. 3.4. Matching between 2D Videos by DP Matching DP matching [13] is utilized to calculate the similarity between the query and candidate clips. DP matching is a wellknown matching method for time-inconsistent sequences. Assume that the feature vector sequences of the query ܳ and the silhouette video from viewpoint ݅, namely ܻ ݅ , are denoted by: ܳ ൌ  ሼ‫ ͳݍ‬ǡ ‫ ʹݍ‬ǡ ǥ ǡ ‫ ݏݍ‬ǡ ǥ ǡ ‫ ݈ݍ‬ሽǡ ܻ ݅  ൌ  ൛‫ ݅ͳݕ‬ǡ ‫ ݅ʹݕ‬ǡ ǥ ǡ ‫ ݅ݐݕ‬ǡ ǥ ǡ ‫ ݅݉ݕ‬ൟǡ where ‫ ݏݍ‬and ‫ ݅ݐݕ‬are the feature vectors of the ‫ ݄ݐ ݏ‬and ‫݄ݐ ݐ‬ frames in ܳ and ܻ ݅ , respectively. ݈ and ݉ represent the number of frames inܳ and ܻ ݅ , respectively. Define ݀ሺ‫ݏ‬ǡ ‫ݐ‬ሻ as the Euclidean distance between ‫ ݏݍ‬and ‫ ݅ݐݕ‬, namely: ݀ሺ‫ݏ‬ǡ ‫ݐ‬ሻ ൌ  ฮ‫ ݏݍ‬Ȃ‫ ݅ݐݕ‬ฮǤ Then, the dissimilarity (‫ )ܦ‬between the sequences ܳ and ܻ ݅ is calculated as: ܿ‫ݐݏ݋‬ሺ݈ǡ ݉ሻ Ǥ ‫ܦ‬ሺܳǡ ܻ ݅ ሻ ൌ  ξ݈ ʹ ൅ ݉ʹ where ܿ‫ݐݏ݋‬ሺ‫ݏ‬ǡ ‫ݐ‬ሻ is a cost function involving feature vectors ‫ ݏ‬and ‫ݐ‬. The cost function ܿ‫ݐݏ݋‬ሺ‫ݏ‬ǡ ‫ݐ‬ሻ is defined as follows: if ‫ ݏ‬ൌ ‫ ݐ‬ൌ ͳ, ܿ‫ݐݏ݋‬ሺͳǡͳሻ ൌ ݀ሺͳǡ ͳሻ; otherwise, ܿ‫ݐݏ݋‬ሺ‫ݏ‬ǡ ‫ ݐ‬െ ͳሻ ܿ‫ݐݏ݋‬ሺ‫ݏ‬ǡ ‫ݐ‬ሻ ൌ ݇ሺ‫ݏ‬ǡ ‫ݐ‬ሻ݀ሺ‫ݏ‬ǡ ‫ݐ‬ሻ ൅ ݉݅݊ ቐ ܿ‫ݐݏ݋‬ሺ‫ ݏ‬െ ͳǡ ‫ݐ‬ሻ ቑǡ ܿ‫ݐݏ݋‬ሺ‫ ݏ‬െ ͳǡ ‫ ݐ‬െ ͳሻ where ‡š’ሺԡ‫ݏ‬Τ݈ െ ‫ݐ‬Τ݉ԡሻ ሺԡ‫ݏ‬Τ݈ െ ‫ݐ‬Τ݉ԡ ൐ ‫ݎ‬ሻ Ǥ ݇ሺ‫ݏ‬ǡ ‫ݐ‬ሻ ൌ ൜ ሺ‫ݎ݄݁ݐ݋‬ሻ ͳ The path-limitation parameter ݇ሺ‫ݏ‬ǡ ‫ݐ‬ሻ is introduced to im-

Table 1. Database information of TVM data. Sequence ID #1 #2 #3 #4 Number of frames 413 378 421 259 Number of clips 28 26 26 27 Rate 10 frames/sec

(a) source

Table 2. Database information of motion capture data. Number of frames 8190 Number of clips 273 Rate 30 frames/sec

(b) BG subtraction

(c) proposed method

Fig. 3. Sihouette image by the method of 3.3.

prove the retrieval performance by eliminating unnatural path fittings. Here, the symbols ܳ and ܻ ݅ are omitted in ݀ሺ‫ݏ‬ǡ ‫ݐ‬ሻ and ܿ‫ݐݏ݋‬ሺ݈ǡ ݉ሻ for simplicity. Because the cost is a function of the sequence lengths, ܿ‫ݐݏ݋‬ሺ݈ǡ ݉ሻ is normalized by ξ݈ ʹ ൅ ݉ʹ . To eliminate extreme matching, pathlimitation via ݇ሺ‫ݏ‬ǡ ‫ݐ‬ሻ is used. The smaller the value for ‫ ܦ‬is, the more similar are the sequences. 4. EXPERIMENTAL RESULTS 4.1. Database In our experiments, four TVM sequences (147s in total) were generated by the system developed in [3] and divided into clips (107 clips in total) using the methods described in [7]. The parameters for the data are summarized in Table 1. The TVMs were of Japanese traditional dances called Bon Odori. The sequences were identical, but performed by different persons. The frame rate for the TVMs was 10 frames/s. The detailed content of the TVMs is described in [7]. The same motion appears four times in each sequence. In addition, 273 instances of motion capture data were rendered using a standard human CG model and included in the database. The parameters for this data are summarized in Table 2. They represented the motions of running, basketball dribbling, being stationary, dancing, jumping, and turning Catherine wheels. There were 380 3D motion clips in the database. 4.2. Query In this experiment, eight 2D video queries were generated using a Web camera. The frame rate was 10 frames/s and the resolution was 640 u 480. A blue sheet was used for the background. The videos were silhouetted by the method described in Section 3.3. Fig. 3 shows an example of this method performing well. Although the legs are not silhouetted well using only background subtraction, they are silhouetted well when the combination with graph cuts is used. 4.3. Retrieval Accuracy In the experiment, the 3D motion sequences were rendered from 20 viewpoints for accuracy. The retrieval accuracy was evaluated in terms of the “nearest neighbor” and the “first tier” [14]. “Nearest neighbor” accuracy shows the percentage of tests for which the retrieved clip with the highest score was correct. “First tier” accuracy demonstrates the average percentage of correctly retrieved clips in the top ݇

856

Fig. 4. Retrieval result (Jump and spread hands).

Fig. 5. Retrieval result (Draw a big circle).

Fig. 6. Precision recall (Draw a big circle).

Table 3. The matching time cost. P-type P-type Feature Z-type (20) (10) Calculating 0.27s 0.27s 3.7s query features Similarity 4.7s 1.5s 4.7s computation time Total time 4.9s 1.7s 8.4s

Zernike Order of ten of mins N. A. N. A.

highest similarity score clips, where ݇ is the number of the ground truth of similar motion clips defined by the authors. Therefore, the value for ݇ will depend on the queries. Fig. 4 shows the retrieval results for a query involving the motion “Jump and spread hands”. The first tier for the query was 50%. Fig. 5 shows a retrieval example for a query involving the motion “Draw a big circle”. The first tier for the query was 75%. Fig. 6 shows the precision/recall tradeoff. Retrieved clips from 1st to 21st (75% of all the correct clips) are all correct. This is very good performance. The average of first tier for all eight queries is 50%, the maximum is 75%, and the minimum is 20%. Nearest neighbor for all eight queries is 88%. The “Nearest neighbor” and the “First tier” accuracies using the Zernike moment were 70% and 41%, respectively. In addition, those using the Z-type descriptor were 81% and 46%, respectively. Thus the validity of our algorithm is demonstrated. In [7], the mean “nearest neighbor” and “first tier” values were 67% and 75%, respectively. Therefore, our present work yields a reasonable performance for its simplicity in query generation. 4.4. Retrieval Time The processing times are summarized in Table 3. The retrieval time comprises the feature vector calculating time for a query and the similarity computation time (in DP matching). To compute the retrieval time, a retrieval system was implemented in C++, 10 frames of 2D video were generated, and the retrieval time for the query was calculated. This processing cost is reasonable for practical usage. However, retrieval time will become a problem when handling large databases. Then it will be necessary to use speed-up techniques, such as Ball Partitioning [15], in the feature vector space. 5. CONCLUSIONS In this paper, we have proposed a 3D motion sequence retrieval algorithm using 2D video queries. The 3D motion sequence was rendered from multiple viewpoints and the Ptype Fourier descriptor was employed to represent the shape feature of the 3D models. By using 2D video for queries, silhouetting by background subtraction and graph cut segmentation, and calculating the P-type Fourier descriptor, the query generation cost can be made small. The validity of our proposed approach was demonstrated by experiments that used dance sequences by four performers and motion capture sequences. The retrieval time cost was small, being 0.27 s for calculating the features for a 10frame query and 1.47 s for similarity computation. The re-

857

trieval accuracy was 88% in terms of nearest neighbor accuracy. This work is supported by the Microsoft Institute for Japanese Academic Research Collaboration (IJARC), and the Ministry of Education, Culture, Sports, Science and Technology of Japan under the “Development of Fundamental Software Technologies for Digital Archives” project. 6.

REFERENCES

[1] T. Kanade, P. Rander, and P. J. Narayanan, “Virtualized reality: constructing virtual worlds from real scenes,” IEEE Multimedia, vol. 4, no. 1, pp. 34–47, 1997. [2] T. Matsuyama, X. Wu, T. Takai, and T. Wada, “Real-time dynamic 3-D object shape reconstruction and high-fidelity texture mapping for 3-D video,” IEEE TCSVT, vol. 14, no. 3, pp. 357–369, 2004. [3] K. Tomiyama, Y. Orihara, M. Katayama, and Y. Iwadate, “Algorithm for dynamic 3D object generation from multiviewpoint images,” Proc. SPIE, vol. 5599, pp. 153–161, 2004. [4] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.P. Seidel, and S. Thrun, “Performance capture from sparse multi-view video,” In ACM Transactions on Graphics, vol. 27, no. 3, 2008. [5] C.-Y. Chiu, S.-P. Chao, M.-Y. Wu, S.-N. Yang, and H.-C. Lin, “Content-based retrieval for human motion data,” Journal of Visual Communication and Image Representation, vol. 15, no. 3, pp. 446–466, 2004. [6] M. Muller, T. Roder, and M. Clausen, “Efficient content-based retrieval of motion capture data,” Proc. SIGGRAPH, pp. 677–685, 2005. [7] T. Yamasaki and K. Aizawa, “Motion segmentation and retrieval for 3D video based on modified shape distribution,” EURASIP Journal on Applied Signal Processing, vol. 2007, Article ID 59535, 11 pages, 2007. [8] R. Tadano, T. Yamasaki, and K. Aizawa, “Fast and robust motion tracking for time-varying mesh featuring reeb-graph based skeleton,” Proc. ICME, Th-P9.6, pp. 2010–2013, 2007. [9] D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung, “On visual similarity based 3D model retrieval,” Computer Graphics Forum, vol. 22, no. 3, pp. 223–232, 2003. [10] D. Zhang and G. Lu, “A comparative study of Fourier descriptors for shape representation and retrieval,” Proc. ACCV, pp. 646– 651, 2002. [11] Y. Uesaka, “Spectral analysis of form based on Fourier descriptors,” Proc. the First International Symposium for Science on Form, pp. 405–412, 1986. [12] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222–1239, 2001. [13] R. Bellman and S. Dreyfus, “Applied dynamic programming”, Princeton University Press, 1962. [14] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, “Shape distributions,” ACM Transactions on Graphics, vol. 21, no. 4, pp. 807–832, 2002. [15] J. K. Uhlmann, “Satisfying general proximity/similarity queries with metric trees,” Information Processing Letters, vol. 40, no. 4, pp. 175–179, 1991.

Suggest Documents