Articulated Human Motion Capture from ... - Semantic Scholar

Articulated Human Motion Capture from Segmented Visual Hulls and Surface Reconstruction Weilan Luo∗ Toshihiko Yamasaki∗ and Kiyoharu Aizawa†∗ ∗

Departement of Information and Communication Engineering † Interfaculty Initiative in Information Studies The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, 113-8656, Japan E-mail: luoweilan, yamasaki, [email protected] Tel: +81-3-5841-6761

Abstract—In this paper, we propose a stochastic approach for tracking articulated 3D human motion in the high-dimensional configuration spaces using synchronized multiple cameras. We seek for the globally optimal solutions of the likelihood with local memorization about the “fitness” of each body segment for a volume sequence directly in the 3D space instead of projecting a rough simplified body model to 2D images. we have developed a modified annealed particle filtering algorithm for the global optimization while taking into consideration of local constraints. The volumetric 3D models generated from the multiple cameras are segmented into 15 parts and assigned 42 degrees of freedom for motion estimation. The matching error is about 7% on average while tracking the posture between two neighboring frames. For the frame t+1, we deform the surface model of the frame t by the estimated rigid transformations. Then we capture the non-rigid deformation by fitting the transformed surface to 2D silhouettes.

I. I NTRODUCTION Kinematic body motion capture and 3D spatio-temporal surfaces reconstruction from synchronous multi-camera or multi-view video sequences are still the challenging and fundamental problems for many applications, including 3D animation movies and games, medical diagnostics motion analysis, or robot motion simulation. Marker-based motion capture system is capable to provide motion capture data of high accuracy quickly. However, it requires people to wear skin-tight clothing with markers and special capture hardware is needed. In addition, it is hard to capture motion robustly while human wears loose apparels. In the past years markerless motion capture has received more and more attention and been used in commercial fields varying from surveillance to character animation and 3D movie display. Some approaches capture only the rigid postures of an articulated body with strong restrictions such as wearing tight clothes. It causes problems while generating realistically complicated surfaces using the motion capture data. Some other methods track the temporal corresponding vertices and deform the 3D mesh surface respectively without providing articulated skeleton model. In this paper, we propose a motion tracking method to capture human motion in two aspects. A sequence of the kinematic articulated model is captured by a stochastic search method through the visual hulls. In addition, a segmented template surface model combined with a skeleton is provided for motion tracking and surface estimation. The transformed

template surface is deformed to match the silhouettes while preserving the topology and surface smoothness. We recover the surface details by solving a optimization schema. The volume data of the frame t are labeled in accordance with the corresponding estimated surface model for motion extraction of the next frame. A segmented template is provided for motion tracking and surface reconstruction. In our work, we construct a mesh model by marching cubes [1] and extract the articulated skeleton model from it for the first frame. Then we segment the mesh surface into 15 parts based on the skeleton and geodesic distances. The labeled model with a underlying skeleton is employed for motion tracking and surface estimation as a template. Deutscher et al. [2] constructed an articulated body simplified by cones with elliptical cross-section and assign the model 29 degrees of freedom, and estimated the 3D posture deforming the model to match 2D images by the annealed particle filter (APF) method. However, the oversimplified models made it difficult to recover complicated shape and motion precisely. We employ the volumetric models [3] generated from multi-view images directly for 3D pose estimation. Then each voxel of the visual hull is labeled in accordance with the corresponding segmented surface model. Only 5% number of voxlels are chosen for motion tracking in order to decrease the computational cost as the number of voxels of each model is beyond 100 thousands. We also capture the 3D posture for human body by annealing to generate better particles, however the APF method is hard to recover the 3D poses of hands or feet robustly using a 42 degrees of freedom (DOF) model. The fitness of some local body segments may be ignored while the APF algorithm focuses on seeking for the globally optimal solution for the whole human body. Therefore, we develop the stochastic approach with the memorization of local optimization to avoid the problem. Our proposed method is able to track of quick movement or human with general apparel in high dimensional spaces. An initial surface model can be constructed by deforming the template surface using the rigid transformation matrixes, which is supposed to be much closer to the real surface than that reconstructed by linear or quaternion blend skinning method [4]. We recover the surface details to take the nonrigid deformation into consideration which is similar to [5],

109 Proceedings of the Second APSIPA Annual Summit and Conference, pages 109–116, Biopolis, Singapore, 14-17 December 2010.

(a) Skeleon

(b) Template

(c) Visual Hull

(a) Visual Hull

Fig. 1. Model segmentation.

(b) Sample Model

(c) Skeleton

Fig. 2. Sampling.

[6] by refining the mesh surface to match the silhouettes. This paper is organized as follows. Section II explains the related work in this field, outlines our method for motion tracking and surface reconstruction in Section III, and introduces the model segmentation method to segment the human body into 15 rigid parts, articulated motion representation and our method of human motion tracking for a sequence of volumetric models in Section IV. Section V introduces the Laplacian deformation framework to recover the surface details, which enforces the reverted 2D images obtained by the estimated surface to match with the silhouettes. In Section VI, we show our experimental results, and we summarize the results and discuss the future work in Section VII. II. R ELATED W ORK Marker-less human motion tracking has been a challenging problem in fields of computer vision for years. It is intuitive to represent kinematic postures by articulated skeleton models. Therefore, several simple general geometric representation methods are used to replace the body segments of a human. Deutscher et al. [2], [7] reconstructed the subject’s body shape by cones with elliptical cross-sections with 29 degrees of freedom. They modified conventional particle filters by layering the search space based on annealing to estimate articulated body motion in the high dimensional spaces. Plänkers et al. [8] develop a method to reconstruct articulated deformable objects based on metaballs and estimate human motion using the Levenberg-Marquart least squares estimator. However, in general such models are too simple to recover shape and motion accurately. In contrast to these methods, Corazza et al. [9] proposed an automatic generation method of a subject-specific model with joint center locations instead of using simplified models. They utilized a training set to locate the optimal joint position in a model. A marker-less motion tracking which took advantage of visual hulls, subject specific model and articulated using different number of cameras and dataset was present in [10]. However, the articulated ICP method does not work well for fast motion clips and the tracking results rely on the qualities of the visual hulls. The kinematic chains were extracted directly from timevarying meshes which had the time-varying topologies (ver-

tices, faces and colors) in [11], [12], [13]. The motion extraction method was based on Reeb graph, and a geodesic function utilizing principal component analysis was given for pose estimation in [11]. Lee et al. [12], [13] utilized the extracted skeleton chains, segmented the time-varying mesh surfaces based on skeleton modes by distance calculations, and refined the 3D pose using the decomposition results. Vlasic et al. [5] presented a method which pulled the template skeleton to fit the visual hull by minimizing an energy function. This approach does not always works well. So if the posture is misaligned they will adjust it by hand. Gall et al. [6] also extracted the 3D articulated model which registered the texture correspondences by solving an energy minimization problem in the first stage. In addition, they detected the misaligned limbs and refined the pose by particle filter to seek for global optimization. In the surface estimation stage, both Vlasic and Gall utilized the skinning method to generate the surface model and then recover the details by deforming it to match the silhouette rims. As it is shown that a stochastic methods are appropriate for global optimization, we follow the idea proposed in [2] which combines annealing simulation with particle filter for tracking. We also employed the APF algorithm for the 3D human motion tracking directly from the visual hulls [14]. However, it was difficult to track each body segment robustly in high dimensional configuration spaces about 42 DOF. Therefore, we presented a local adjustment framework which detect the misaligned body limbs and then refined them using the ICP algorithm [15]. But it could not ensure the articulated property for the motion as the ICP method assigned the DOF of each body segment six. We modified the APF method with local constraints for human motion tracking to avoid the problem in [16]. In this paper, we developed a modified annealed particle filtering algorithm in which the annealing process finds the globally optimal solution while satisfying the local optimization. 3D representation by volume data can be generated easily from multi-view images and segmented by distance estimation. Sample data from the template are chosen and deformed stochastically to match the volumetric model to be tracked. In our proposed method, the weights for particles are

110

(a) The Sample Model X

(a) Visual Hull

(b) The target Z

(b) APF

(c) Our Method

Fig. 3. Compare the volumetric models.

Fig. 4. Pose extraction by APF and our proposed approach.

calculated and sorted. Half of the particles are reconstructed by combining the 15 selected particles according to sorted indexes for all body segments. New particles for the next layer are generated using the global weights. The process is repeated until layer 1. The solution converges to the global maximum while local optimizations are also enforced. We also estimate the surface details by matching it to the 2D images. In this manner, we can generate a 3D mesh sequence that is time-consistent throughout the sequence in terms of topology of the 3D mesh. In our approach, the 3D mesh models are generated based on the template and the extracted motion tracking data. Therefore, the resultant 3D mesh models are closer to the real shape of the object than those by skinning methods.

with local memorization. In addition, we realign the posture by registering corresponding 3D vertices to the silhouette rim by Laplacian deformation. The non-rigid surface is recovered by matching 2D images instead of 3D volume data. The estimated segmented surface and deformed skeleton pose are then used as the reference models for the next frame to be tracked. Segmented time-varying surfaces with underlying skeleton models is generated by repeating this process.

III. OVERVIEW The proposed approach tracks 3D rigid human motion from sequences of visual hulls constructed from multiple camera images with volume intersection methodologies [3], and generates deformed sequence of mesh surfaces utilizing the template. The deformed surface is then projected to 2D silhouettes and non-rigid deformation is recovered by matching to the rims with constraints. The time-varying surfaces are recovered based on the template model. In our approach, the template model is selected by the user from the generated 3D mesh sequence. As shown in Fig. 1, the template is segmented into 15 parts based on the underlying skeleton and geodesic distance. Then the corresponding visual hull is labeled in accordance with the template. Each voxel is labeled by searching the nearest vertex in the template surface. The segmentation method will be described in section IV-A in detail. We exploit an articulated volumetric model to track 3D human motion instead of conventional skeleton-based method since it provide surface information as well as inside data which enforces robust postures estimation. Samples from each segment are selected as shown in Fig. 2. 5% of the voxels in the model (a) are used to construct the sample model (b). The skeleton model (c) is utilized to define the rotation axises for 3D pose estimation in the skeleton model. An appropriate articulated model is extracted by annealing produced particles

IV. M ODEL - BASED P OSE E STIMATION The labeled visual hull with a corresponding articulated model of the frame t is used to extract the 3D posture of the next frame. We select 5% of the voxels in the volume data, assign it 42 degrees of freedom, and deform it to match the one to be tracked. The number of selected voxels of each body segment is equal in order to ensure the same importance for pose estimation. A. Model Segmentation We propose a model-based motion tracking method that intends to extract the 3D articulated kinematic chains by analyzing the time-varying visual hulls. The model decomposition work is necessary as we tend to analyze and locate each body segment for the human body and we abandon to use the simple model represented by cones with joints. Although a subject-specific model generation method is proposed in [9], it is not easy to achieve this for usual maker-less system. However, it is hard to segment the volumetric model directly in a time-consistent manner as no explicit correspondence between frames is given in our time-varying visual hulls. So we first segment the template mesh surface. As long as the model with the same topology of the frame t is generated, the corresponding visual hull is able to be labeled according to it. We prefer constructing the template model for the first frame directly from images to avoid the registration problem as described in [6] while the surface is generated by the laser. We assume there are not any crossed body segments for the initial posture as the quality of the template will affect on the reconstructed time-varying sequence. The template surface and the visual hulls are all constructed using the multicamera system. A skeleton model as shown in Fig. 1 (a) is extracted from the mesh surface. It can be obtained by hand

111

100 Our method APF Raw comparison between neighboring frames

90

Matching Error (%)

80 70 60 50 40 30 20 10 0

Head

LHand RHand LFoot

RFoot LLArm RLArm LLLeg RLLeg LUArm RUArm LULeg RULeg Belly

Wrest Whole

The Body Segments Fig. 5. The difference results of each body segment and the whole human body between the left visual hull and the deformed volumetric models as shown in Fig. 4 using the APF algorithm and our method respectively.

or the methods proposed in [11], [12]. The template model is segmented into 15 body parts based on geodesic distance and the underlying skeleton (b). Then the visual hull (c) is decomposed according to the segmented mesh surface. Our proposed mesh surface segmentation algorithm is conducted according to the following steps: (1) Calculate the five start points which are the nearest one in the model to the corresponding leaf joint of the skeleton. (2) Label the body segments of head, hands and feet in respectively. We assign a plane which passes the joint of two articulated bones and splits them in two parts. It is easy to detect the vertices which does not belong the corresponding segment. The geodesic distances to the start point are computed. Assume that the start point is labeled as i, the minimal geodesic distance of the vertex that does not belong to the same body segment according to the plane is k, then the vertices whose geodesic distances are less than k will be labeled as i. We add a virtual point which is assumed to be neighboring to all the vertices whose geodesic distances are k. The virtual point is used as the start one for the next segmentation process. (3) Repeat Step 2 to segment the lower parts of the legs and arms, the upper parts, and the body in turn. This method is robust for the uniform model while it represents each body segment clearly. We recover the time-varying surfaces based on the template model, the volumetric sequence is able to be labeled according to the corresponding one. Each volumetric model is partitioned into 15 limbs in accordance with the minimum Euclidean distances to the corresponding

deformed template model. However, it will be time-consuming if we calculate all the distances between points in the mesh surface and the visual hull. We replace the vertices with the nearest volume data and represent the visual hull to be labeled by boolean values. For each volume, the k−neighboring data are detected in turn until it meets with the surface data. Then the segmented body is utilized to extract the articulated pose for the next frame. B. Sampling Model The voxel size is set 10mm in this paper. The number of voxels in a model is about 100 thousands. Sample data are selected randomly from each body segment for motion estimation. The number of the sample data in each body segment is same in order to ensure the same significance of each limbs while tracking. About five percent of the data are chosen for tracking as shown in Fig. 2. Twists representation and exponential coordinates as given in [17], [18] are employed for expressing the rigid motion of the sample model. It is then restricted to move with articulated constraints in high dimensional configuration spaces of 42 DOF. DOF of the global translation and rotation are treated as six. Wrist, knee and ankle joints are defined with two degrees of freedom. Shoulder, hip, neck and upper body joints are given three degrees of freedom. Then we define the state of the sample by a vector χ = (t1 , t2 , t3 , θ1 , θ1 , . . . , θ39 ) that consists of the three parameters of the global translation and 39 rotate angles. The global translation is expressed by the

112

35

35 ORG APF with local memorization

30

30

25

25

SDiff(X,Z) (%)

Diff(X,Z) (%)

ORG APF with local memorization

20

15

20

15

10

10

5

5

0 0

10

20

30

40

50

60

70

0 0

80

Frame Number

TG



0  0 =  0 0

0 0 0 0

0 0 0 0

20

30

40

50

60

70

80

Frame Number

Fig. 6. The matching errors represented by the difference function Dif f (X, Z), where X is the deformed model reconstructed by motion tracking methods and Z is the target model.

following 4×4 matrix

10

 t1 t2   t3  0

(1)

For the joint angle θi , the rotate axis ωi and corresponding joint Ji are known from the skeleton model. The rotation R(θi ) is given by ω bi θi (I − eωbi θi )(ω × J ) + ω ω T J θ e i i i i i i R(θi ) = (2) 0 1 where ωbi is the matrix representation of ωi as described in [19]. The rigid transformation of body segment i is represented by Y R(θj ) + TG (3) Ti = j∈k(i)

Fig. 7. The matching errors represented by the difference function SDif f (X, Z).

volume data in the labeled model (a), the volume data is in the target Z if it is in the bounding box and the corresponding value is 1. Therefore, we compare each voxel in X to Z just one time to calculate the value of Dif f (X, Z). The model as shown in Fig. 2 has been segmented into 15 body segments, therefore we guard the difference of two models by a difference function SDif f which takes each body limb into consideration. Assume Xχ is the model obtained by the configuration vector χ and Z is the visual hull to be tracked, a difference function SDif f (Xχ , Z) is then obtain by 15 1 X Dif f (Xχ (i), Z) (5) SDif f (Xχ , Z) = 15 i=1 where Xχ (i) is the body segment i. It is easy to demonstrate that Dif f (Xχ , Z) 6 SDif f (Xχ , Z). Then the exponential weighting function for annealed particle filtering [2] is given as shown in the following

where k(i) is the order of the joint angles which have effect on the transformation of the segment i. Then the transformation of a vertex vj that is associated with body segment i is obtained by Ti v1j . A corresponding deformed model of the sample can be constructed if the value of χ is given.

In the process of motion tracking, new particles are generated using the annealing process according to the weighting function.

C. The Weighting Function

D. APF with Local Memorization

The articulated sample model is employed to estimate the pose of the following frame. Assume both X and Z are the models represented by volume data. A difference function Dif f (X, Z) is then given by Dif f (X, Z) =

N 1 X (1 − p(xi , Z)) N i=1

(4)

where xi is the volume data and N is the number of the volume data of X. If xi is also in Z, the value of p(xi , Z) is 1, otherwise 0. We compute the bounding box of the volumetric model Z as shown in Fig. 3, and denote a binary vector to represent the volume data in the bounding box. For each

w(Xχ , Z) = exp(−SDif f (Xχ, Z))

(6)

In this section, we describe the motion tracking approach which seeks for 3D optimal kinematic chains fitting to each body segment. It is known that stochastic methods such as PF and APF are able to obtain the solution with global optimization. Condensation [20] has shown its’ robustness for tracking in low dimensional configuration spaces. APF improved the method by annealing to generate new particles resulting in robust pose estimation with 29 DOF. However, it is hard to capture the motion of hands or feet while the human body is assumed to move in higher dimensional configuration spaces, for example, 42 DOF as defined in this paper. Also it is hard to extract the 3D postures for rapid motion clips.

113

100

100 ORG APF with local memorization

90

Error of The Right Foot (%)

90

Error of The Head (%)

80 70 60 50 40 30 20 10


80 70 60 50 40 30 20 10

0 0

10

20

30

40

50

60

70

0 0

80

10

20

Frame Number

50

60

70

80

70

80

Fig. 10. The matching error of the right foot.

100

100


90

Error of The Right Calf (%)

Error of The Left Hand (%)

40

Frame Number

Fig. 8. The matching error of the head.

90

30

80 70 60 50 40 30 20 10


80 70 60 50 40 30 20 10

0 0

10

20

30

40

50

60

70

0 0

80

Frame Number

10

20

30

40

50

60

Frame Number

Fig. 9. The matching error of the left hand.

Fig. 11. The matching error of the right calf.

The reason is that the motion tracking process guards the global solution by the distribution of the weights while the local fitness of each body segment is ignored to some extend. In contrast, our method enforces the local fitness for each segment to seek for the solution with global optimization. We represent a normalized weighted particle by splitting it into 15 elements:

(3) Assign the normalized global weights to all particles as described in [2] and represent them as shown in equation 7. (4) Select some particles according to the local fitnesses and combine them to generate new ones. Assume χi captures well for the body segment i, then χii is needed for the combination of the new particle χnew . So we represent the combined one by 15 {(χ11 , w11 ), (χ22 , w22 ), . . . , (χ15 15 , w15 )}. However, it does not satisfy with the property of a particle as described above. We define the kth element in χi by the mean value of the nonzero kth value of χii . Then we represent the generated new particles as in equation (7). (5) Choose N new particles randomly according to the normalized weights. (6) The selected particles are used to initialize the next layer. The process is repeated until we arrive at the layer 1. (7) The optimal solution is estimated by combining the particles of the layer 1 according to the normalized weights. In our experiments, it was found that setting the layer number M = 10 with particle number N = 300 worked well for human

(χ, π) = {(χ1 , w1 ), (χ2 , w2 ), . . . , (χ15 , w15 )}

(7)

where π is the corresponding weight and wi = Dif f (Xχ (i), Z). If the element in χ has effect on the transformation of the body segment i, the corresponding one in χi is set to be equal to it. All other elements equal to 0. The particle has the property that the element in χi equals to the corresponding one in χj or one of them is 0. The tracking algorithm is conducted according to the following steps: (1) The motion tracking process is started at the layer M . (2) We deform the labeled sample model to generate N un-weighted particles.

114

Fig. 12. From top to bottom: the segmented visual hulls, motion capture data and the reconstructed surfaces.

motion tracking. We extract the articulated posture from the volumetric model of the frame t+1 according to the estimated kinematic model of the frame t. The labeled volume data of the previous frame are deformed to match the current model by our proposed method. Furthermore, the deformed template model of the frame t is utilized to recover the surface details of the next frame. V. S URFACE E STIMATION Blend skinning approaches such as linear blend skinning are widely used in shape reconstruction when 3D skeleton poses are provided. These methods are not good at recovering surface details as they just couple of vertices to underlying relative bones although it is simple to implement. Vlasic [5] and Gall [6] took the 3D models generated by skinning methods as initial surfaces estimation and iteratively deformed them to recover the refined models. In our process of surface reconstruction, we apply the transformations obtained directly to the deformed template model of the previous frame. It is obvious that the generated surface will be similar to the target. Furthermore, non-rigid transformation should also be taken into consideration to recover the surface details. It is intuitive to take the surface points from the visual hulls as constraints. Unfortunately, the visual hull suffer from the noises caused by not only the qualities of the silhouettes but also the reconstruction method, therefore, we turn to deform the mesh surface to match the 2D silhouettes instead of the visual hull. The refined surface are generated by solving a

least-squares optimization problem as following argmin {kLV V

− δk2 + αkV − T V ∗ k2 + βkCsil V − qsil k2 } (8)

where L is the Laplacian matrix and δ are the differential values of the mesh surface of the previous frame with vertices V ∗ . T are the transformation matrixes for each body segments. Csil is a parameter matrix to express the constraints of the silhouette rims and qsil are the corresponding confined points as described in [5], [6]. The refined surface is then used to segment the corresponding visual hull for 3D motion estimation of the next frame. The process is repeated and the time-varying sequence with underlying skeleton chains can be reconstructed. VI. E XPERIMENTAL R ESULTS We use the public dataset provided by Gall et al. [6] to test our algorithm at first. The purpose of our approach is to extract 3D articulated kinematic chains directly from a timevarying volume sequence. In our program, the property of binary representation of volumetric model makes it easy to compare models and computer distances. We test our method to extract a quick motion of the human body with 42 DOF comparing with the APF method [2]. As seen in Fig. 4, APF cannot locate the positions of the hands and feet correctly while our method works well. The difference values Dif f (Xχ (i), Z) and the global difference are depicted in Fig. 5. We first calculate the differences between the visual hull to be tracked and the previous one by comparing the voxels of the volume data. The matching errors obtained by

115

the APF method and our algorithm are also presented. As shown in Fig. 5 it tracks perfectly the poses of the head, the left foot (LFoot), the left lower leg (LLLeg) and the right lower leg (RLLeg) especially. For instance, our method reduce the mismatching rate of the left foot from 35% to 7% as compared to the APF method. Fig. 6 - Fig. 11 depict the motion tracking results for each part of the segments of visual hulls. The 3D volumetric models reconstructed by motion tracking data cover about 93% of the target models on average. It is seen in Figs. 6 and 7 that the distributions of Dif f (X, Z) and SDif f (X, Z) are different at the beginning. The difference is decreased after motion tracking as our method tends to average the mismatching rate of each body segment to obtain a global optimal solution. The first five frames do not change because the errors are lower than 5% then the pose is defined as the previous one. In addition, each body segment of the human body is also located correctly. We depict the mismatching rates of the head, the left hand, the right foot and the right lower leg in Fig. 6 - Fig. 11. Our method shows its’ ability to capture the 3D postures of the limbs, such as the hands and feet. It should be mentioned that the motion features for the body segments can be extracted for analyzing the movement range of the human body. It can be used to extract key motion or recognize the motion. Fig. 12 shows the motion capture data and the time-varying surfaces reconstructed by our method. VII. C ONCLUSION We proposed a model-based motion tracking method capable of extracting 3D articulated postures with 42 degrees of freedom through a sequence of visual hulls. Although the accuracy relies on the quality of visual hulls, our tracking method works well for human tracking in high dimensional configuration spaces as compared to other methods such as the APF algorithm. We are convinced that if the subject-specific models are provided as mentioned in [9], the results will be more robust. We refer better model segmentation method for visual hulls to decrease the error caused by the segmentation noise. The volume data should be analyzed first or iterative methods for model segmentation are preferred. In addition, the transformation about vertices near joints should be paid more attention. We use the Laplacian deformation to refine the surface in detail and it may cause twist problem near joints. ACKNOWLEDGMENTS Part of this work was supported by the National Institute of Information and Communication Technology (NiCT), Japan. R EFERENCES [1] W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3d surface construction algorithm,” SIGGRAPH Comput. Graph., vol. 21, no. 4, pp. 163-169, July 1987. [2] J. Deutscher and I. Reid, “Articulated body motion capture by stochastic search,” Int. J. Comput. Vision, vol. 61, no. 2, pp. 185-205, February 2005.

[3] I. Mikić, M. Trivedi, E. Hunter, and P. Cosman, “Human body model acquisition and tracking using voxel data,” Int. J. Comput. Vision, vol. 53, no. 3, pp. 199-223, July 2003. ˇ ara, and C. O’Sullivan, “Geometric skinning [4] L. Kavan, S. Collins, J. Z´ with approximate dual quaternion blending,” ACM Trans. Graph., vol. 27, no. 4, pp. 1-23, October 2008. [5] D. Vlasic, I. Baran, W. Matusik, and J. Popović, “Articulated mesh animation from multi-view silhouettes,” In SIGGRAPH ’08, New York, USA, 2008, pp. 1-9. [6] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H.P. Seidel, “Motion capture using joint skeleton tracking and surface estimation,” IEEE Conf. Computer Vision and Pattern Recognition, pp. 1746-1753, June 2009. [7] J. Deutscher, A. Davison and I. Reid, “Automatic partitioning of high dimensional search spaces associated with articulated body motion capture,” In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, v.2, pp. 669-676, 2001. [8] R. Plänkers and P. Fua, “Articulated soft objects for multiview shape and motion capture,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1182-1187, 2003. [9] S. Corazza, L. Mndermann, and T. P. Andriacchi, “Automatic generation of a subject specific model for accurate markerless motion capture and biomechanical applications,” IEEE Trans. on Biomedical Engineering, vol. 57, no. 4, pp. 806-812, April 2010. [10] S. Corazza, L. Mndermann, E. Gambaretto, G. Ferrigno, and T. Andriacchi, “Markerless motion capture through visual hull, articulated icp and subject specific model generation,” International Journal of Computer Vision, vol. 87, no. 1, pp. 156-169, March 2010. [11] R. Tadano, T. Yamasaki, and K. Aizawa, “Fast and Robust Motion Tracking for Time-Varying Mesh Featuring Reeb-Graph-Based Skeleton Fitting and Its Application to Motion Retrieval,” IEEE International Conference on Multimedia & Expo, Th-P9.6, pp. 2010-2013, July 2007. [12] N. Lee, T. Yamasaki, and K. Aizawa, “Hierarchical Mesh Decomposition and Motion Tracking for Time-Varying-Meshes,” IEEE International Conference on Multimedia & Expo, pp. 1565-1568, June 2008. [13] N. Lee, T. Yamasaki, and K. Aizawa, “Motion Tracking of TimeVarying Mesh Through Surface Gradient Matching With Multi-Temporal Registration,” ACM SIGGRAPH 2008, Posters, August 2008. [14] W. Luo, T. Yamasaki, and K. Aizawa, “Marker-less human motion tracking using visual hulls and time-varying surfaces reconstruction,” Meeting on Image Recognition and understanding, IS3-51, July 2010. [15] P. J. Besl and H. D. Mckay, “A method for registration of 3-d shapes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 2, pp. 239-256, 1992. [16] W. Luo, T. Yamasaki, and K. Aizawa, “3D Pose Estimation in High Dimensional Search Spaces with Local Memorization,” Picture Coding Symposium 2010, Accepted for publication. [17] C. Bregler, J. Malik, and K. Pullen, “Twist based acquisition and tracking of animal and human kinematics,” Int. J. Comput. Vision, vol. 56, no. 3, pp. 179-194, 2004. [18] C. Bregler and J. Malik,“Tracking people with twists and exponential maps,” IEEE Conf. Computer Vision and Pattern Recognition, 1998. [19] R. M. Murray, Z. Li, and S. S. Sastry, A Mathematical Introduction to Robotic Manipulation, 1st ed. CRC, March 1994. [20] M. Isard, and A. Blake, “CONDENSATIONłConditional Density Propagation for Visual Tracking,” Int. J. Comput. Vision, vol. 29, no. 1, pp. 5-28, August 1998. [21] J. Starck and A. Hilton, “Model-based multiple view reconstruction of people,” In Int. Conf. on Computer Vision, October 2003. [22] K. Varanasi, A. Zaharescu. E. Boyer, and R. Horaud, “Temporal surface tracking using mesh evolution,” In European Conf. on Computer Vision, pp. 30-43, October 2008. [23] E. de Aguiar, C. Theobalt, C. Stoll, and H. P. Seidel, “Marker-less 3D feature tracking for mesh-based human motion capture,” In Proc. ICCV HUMO07, pp. 1-15, 2007. [24] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H. P. Seidel, and S. Thrun, “Performance capture from sparse multi-view video,” In SIGGRAPH ’08, pp. 1-10, August 2008. [25] E. de Aguiar, C. Theobalt, C. Stoll, and H. P. Seidel, “Markerless deformable mesh tracking for human shape and motion capture,” IEEE Conf. Computer Vision and Pattern Recognition, 2007. [26] M. Botsch and O. Sorkine, “On linear variational surface deformation methods,” IEEE Transactions on Visualization and Computer Graphics, vol. 14, no. 1, pp. 213-230, January 2008.

116