3D Modeling of Humans with Skeletons from Uncalibrated Wide Baseline Views Chee Kwang Quah1, Andre Gagalowicz2, Richard Roussel2, and Hock Soon Seah3 1
Nanyang Technological University, School of Computer Engineering, Singapore
[email protected] 2 INRIA, Domaine de Voluceau, BP105 78153 Le Chesnay, France {andre.gagalowicz, richard.roussel}@inria.fr 3 Nanyang Technological University, School of Computer Engineering, Singapore
[email protected]
Abstract. In order to achieve precise, accurate and reliable tracking of human movement, a 3D human model that is very similar to the subject is essential. In this paper, we present a new system to (1) precisely construct the surface shape of the whole human body, and (2) estimate the underlying skeleton. In this work we make use of a set of images of the subject in collaboration with a generic anthropometrical 3D model made up of regular surfaces and skeletons to adapt to the specific subject. We developed a three-stage technique that uses the human shape feature points and limb outlines that work together with the generic 3D model to yield our final customized 3D model. The first stage is an iterative camera pose calibration and 3D characteristic point reconstruction-deformation algorithm that gives us an initial customized 3D model. The second stage refines the initial customized 3D model by deformation via the silhouette limbs information, thus obtaining the surface skin model. In the final stage, we make use of the results of skin deformation to estimate the underlying skeleton. From our final results, we demonstrate that our system is able to construct quality human model, where the skeleton is constructed and positioned automatically.
1 Introduction In the context of sports science, augmented reality and toward the future for freeviewpoint video [2], 3D television and media production, precise and accurate tracking of the human’s movements are needed. To date many computer vision based human tracking systems had been proposed e.g. [5], [13], [19], [23]. However, all these methods employed a too generic model e.g. stick-figures, cylinders. The process of tracking is very sensitive to the shape model and animation used, with considerable amount of effort spent to tune these parameters [7]. In the work by [6], [9], they also stressed the importance in the quality of the 3D model used for tracking. Thus, it is inappropriate to use, for example, a generic “averaging human” model for accurate and precise tracking of human that come in different shapes and sizes. In this paper, we focus our attention on building a good customized model, since it is crucial to track the human subject using a very similar 3D human model. The surface skin and the underlying skeleton will be built and fitted to our subject. The key A. Gagalowicz and W. Philips (Eds.): CAIP 2005, LNCS 3691, pp. 379 – 389, 2005. © Springer-Verlag Berlin Heidelberg 2005
380
C.K. Quah et al.
challenge to our system is to accomplish its task from a set of limited images acquired from the various wide baseline viewpoints. For reconstruction we use a maximum of 6 images. The resultant model will maintain the correct object modeling topology, as this is important for future usage e.g. character skinning. In addition, our method does not need any special calibration tools. In section 2, we review some of the existing modeling systems. In section 3, we propose our modeling system framework, and in sections 4, 5 and 6 we describe our modeling system in detail. Finally we show our results in section 7.
2 Existing Modeling Systems The existing vision-based reconstruction systems that mainly deal with constructing the surface skin model fall into the 2 categories: (1) 3D laser-scanner systems, and (2) passive multi-camera systems. The 3D laser-scanner systems [26], [27] capture the entire surface of the human body in about 15 to 20 seconds with resolution of 1 to 2mm. However the drawbacks of such a device are (1) highly priced at about few hundred thousands of dollars, and (2) require the subject to stay still and rigid for the whole duration of scanning (about 15 seconds for full body coverage) which is quite constrictive in practice. On the other hand the passive multi-camera systems are much cheaper and video cameras are more easily available. Most of the existing methods e.g. [8], [22] make use of shape-from-silhouette related approaches requiring (1) the subject to be segmented from the image background, and (2) the cameras to be calibrated beforehand using calibration tools. Shape-from-silhouette approaches also give rise to ‘blocky’ results if there are insufficient views (this can be seen from the theoretical proof in [12]). More recent approaches e.g. [16] propose 3D reconstruction from un-calibrated views, which uses feature correspondents, requires the subject to remain still and rigid for about 40 seconds during the video capturing of the whole body. Moreover, the reconstructed model could contain non-manifold problems e.g. holes and open edges. There are research that attempt to estimate more precisely the joint locations. They are usually done using optical, magnetic or mechanical motion capture system e.g. [14], [15], [20]. However, all these methods require tedious post-processing to clean up the motion capture data. More recent approach [21] attempted to estimate the skeleton from sequence of volume data of rigid bodies. However, the resultant skeleton is an estimated stick-figure-like structure. These structures do not contain sufficient anatomical details for realistic character animation and skinning. Another alternative to acquire the human skeleton is via X-ray. However X-ray devices are not easily available. In addition, tedious post-processing may be required to integrate the data from both the cameras and X-rays.
3 Our Modeling System Our proposed human model construction starts from a generic human model in a stanza position (fig. 1). The generic 3D human model that we used is Ergoman, provided by MIRAGES, INRIA, France. The surface of our model is made up of about
3D Modeling of Humans with Skeletons from Uncalibrated Wide Baseline Views
381
17000 vertices and 34000 triangular faces. Inside this surface is the underlying generic skeleton. The anatomic measurement of the subject is used for deforming this generic model to produce a specific model. The strategy of our framework is motivated by the method in [17], which was used for the construction of human faces. In our system, the subject’s body is used as the calibration tool. The 3D generic model guides the camera calibration, which, in turn, allows 3D point reconstruction to yield the camera poses and produces the customized 3D model. Our image acquisition for all the views is instantaneous. The inputs to the system are: 1) 2D images from different views (ideally we should have good view coverage of the subject) i.e. wide baseline. This acquisition will be done at a single time instance (using several gen-locked cameras). 2) Generic 3D human model (i.e. surface and skeleton) and its 32 selected surface characteristic points (fig. 2). 3) 2D/3D feature point matches in the image views and 3D human model points. The outputs of the system are: 1) Calibrated camera poses of the different views. 2) Customized 3D surface model with regular surface that will overlay nicely onto the images of the subject’s silhouette limb in all the views. This customized model has the geometry of the shape and size of our subject. 3) Estimated position and reconstruction of the customized skeleton of the subject.
Fig. 1. (a) Generic surface model, (b) generic skeleton, (c) overall generic model
The block diagram of our model construction system is shown in fig. 3. This task can be realized on an off-line basis, comprised the three main stages: (1) camera calibration and reconstruction of model characteristic points (section 4), (2) refinement of model via silhouette limbs deformation, as described in section 5, and (3) skeleton estimation (section 6). The testing data are the images acquired from different camera views provided by MIRAGES, INRIA, France. Fig. 2 shows the example of the selected feature points on the 2D images corresponding to the 3D points. These correspondences can be established via an interactive point-matching tool that we have developed. This ensures that the correspondences are 100% correct, so that the calibration is always stable. Although automatic body-part recognition had been studied in e.g. [24], how-
382
C.K. Quah et al.
ever in our wide-baseline and cluttered environment, automatic feature detection becomes highly ill posed. In our set-up, we utilized a set of 32 surface characteristic points. These characteristic points will provide an over-determined set of information and sufficient view coverage for camera calibration and reconstruction of points.
Fig. 2. Example of features points on 3D generic model corresponding on the 2D image
Fig. 4. Triangulation of projected rays, when the rays do not intersect images (R is the reconstructed point)
Fig. 3. Block diagram for model construction
Fig. 5. Example showing some of the deformation vectors
4 Camera Calibration and Feature Reconstruction This section describes the first stage of our model adaptation system (first 3 blocks of fig. 3). Using the subject’s 2D characteristic points from the images in collaboration with their respective correspondents on the 3D generic model (Fig. 3), we iterate the
3D Modeling of Humans with Skeletons from Uncalibrated Wide Baseline Views
383
process comprising the camera calibration and 3D generic model point deformation (3D reconstruction) until convergence is attained. At the start the 3D characteristic points of the generic model do not project correctly during the early iterations. As the process iterates, these 3D characteristic points will converge together with the camera poses. We will obtain a set of sparse deformed 3D model points and calibrated camera poses. By using the sparse deformed model points, we complete our initial customized 3D model by interpolating the deformations using radial basis function (RBF). 4.1 Camera Calibration In this module, we use the POSIT (pose iteration) algorithm [3] to calibrate the camera extrinsic parameters. The intrinsic parameters can be obtained using simple camera calibration software such as [25]. Another alternative that we study is to add an addition layer above POSIT in order to search for the intrinsic parameters. This is done by regarding POSIT as a function of the intrinsic parameters, which we will minimize using simplex minimization. 4.2 Feature Points Reconstruction By using the calibrated camera parameters and the 3D/2D correspondences, we perform 3D point reconstruction to deform the 3D characteristic points toward the new positions. The 3D point reconstruction is achieved by triangulating the projected rays from the characteristic image points (fig. 4). This algorithm takes into account that the rays will not intersect when the calibration is not perfect by minimizing the sum of square of distances to the projected rays from all the possible views. We only reconstruct the respective points seen in more than one image. When the process converges, we obtained a final set of reconstructed 3D points Ri. We also have the original set of 3D points from the initial generic model Pi. Using Pi and Ri we form a set of deformation vectors
Pi Ri (see fig. 5 for example).
4.3 Interpolating the Deformation Considering the deformed characteristic 3D model points, they are very sparsely distributed. These sparse points are not sufficient to represent the complete 3D model. Therefore, we make use of the sparse points in collaboration with the generic 3D model to complete the 3D model deformation via interpolation. The interpolation is done by using radial basis functions (RBF). Using RBF for data interpolation had been researched and used successfully in e.g. [4], [18]. We can write the equation of a linear system as: ⎡ σ ( P1 − P1 ) σ ( P1 − P2 ) ......... σ ( P1 − PN ) ⎤ ⎢ ⎥ O M ⎢ σ ( P2 − P1 ) ⎥ ⎢ ⎥ M O M ⎢ ⎥ − − σ ( ) ......... ......... σ ( ) P P P P N N N ⎦ 1 ⎣
⎡ Ax1 Ay1 Az1 ⎤ ⎢A A A ⎥ ⎢ x2 y2 z2 ⎥ M ⎢ ⎥ ⎢ ⎥ A A A ⎣ xN yN zN ⎦
⎡ PR PR y1 PR z1 ⎤ ⎥ = ⎢ x1 ⎢ PR x 2 PR y 2 PR z 2 ⎥ ⎥ ⎢ M ⎥ ⎢ ⎢⎣ PR xN PR yN PR zN ⎥⎦
(1)
384
C.K. Quah et al.
where: 1) PR are the set of deformation vectors computed via 3D reconstruction of characteristic points. P, R are the original and reconstructed characteristic points. 2) σ ( Pi − Pj ) are the radial basis function. Here we use σ ( Pi − Pj ) = Pi − Pj . 3) A (i.e. Axi, Ayi, Azi) are the weights that we are seeking for. The weights A can be obtained by solving equation (1) using simple linear algebra method like the LU decomposition. After having obtained the deformation weights A, we can then use them to deform the rest of the model points using the equation (2) : N
Fx , y , z ( P ) = ∑ Axi , yi , zi • σ ( P − Pi )
...(2)
i =1
where P is the set of 3D points from the generic model that we need to deform. 4.4 Initial Results of Customized Surface Model Up to this point, we have an initial customized surface model (fig. 6). We can notice from the results that projected local model silhouette limbs of the initial customized model do not overlay exactly onto the images e.g. on the inner legs of the subject. Fig. 7 shows the results of the feature point reprojection error in pixels plotted against the number of iterations. It can be observed that the process converges after about 30 iterations. The reprojection mean-square error at convergence is about 1.1 pixels with a standard deviation of 0.9 pixels.
Fig. 6. Initial model – not precisely fitted Fig. 7. Plotting reprojection error vs number of iterations
5 Surface Model Refinement The deformation based on points enables us to restore the global surface geometry of the human subject. However, the more local elements such as the curves on the shoulders and legs of the subject are not precisely reconstructed. To act on this set of local elements we design an algorithm to deform the human body based on his silhouette contours, called the limbs. For this stage, we will automatically extract the silhouette edges of the model from various views and will deform them so that they correspond exactly to the respective silhouette curves of images (fig. 8).
3D Modeling of Humans with Skeletons from Uncalibrated Wide Baseline Views
385
5.1 Silhouette Extraction 5.1.1 Silhouette from Initial Model This process deals with the extraction of silhouette curve from the initial surface model from section 4. We follow the method as in [17] to extract the silhouette. We have sped up the process by (1) finding the contour edges [11] via an XOR operation, and (2) checking for the possibility of intersections between the contour edges as we traveling along the bounding silhouette (because our subject is highly concave). 5.1.2 Silhouette from Images The segmentation and extraction of silhouette pixels from static 2D images may be done either in an (1) automatic way using edge detection, or (2) interactive way. Many edge detection algorithms for image segmentation had been proposed over the last decades e.g. [1]. However, using edge detection to segment out a continuous close
Fig. 8. Model refinement via deformation of silhouette curves
contour from any noisy image is very difficult. The only way to achieve this is to acquire the images in a very well controlled environment e.g. making the subject wear special colored cloth. If the well-controlled environment is unlikely, then we have to bring out the silhouette features interactively. We can make use of the curve digitizing tools (e.g. Bezier curves drawing) available from common commercial software like the Photoshop. We perform edge-linking after we have obtained the digitized contours using any one of the above-mentioned methods. This ensures that topological information is maintained when we have to match the two sets of silhouette curves.
386
C.K. Quah et al.
5.2 Silhouette Curve Matching The aim of this module is to find a good correspondence for all projected silhouette vertices with respect to the image curves. We also have to make sure that the matching takes place in a correct order. The bottom-left diagram of fig. 8 shows an example whereby if we simply search for the nearest point, the matching topology will be wrong. We will proceed with the matching by sub-dividing the model curve at half curve length and seek for the closest point on the image curve. When finding the closest point, we may impose some simple constraint e.g. maximum angle different in the curve directions. The sub-division and seeking for the closest points go on recursively until there are no more points left for matching. Since our curve matching is a one-pass algorithm, the outcomes may not minimize the energy between the 2 curves. However, we found that the matching is sufficient for us to complete the final deformation (for the next section). If one is not satisfied with the energy minimization between the 2 curves, one may use the active contours [10] to refine the registration. 5.3 Reconstrction of Model via Silhouette Curves The matching of the correspondences between the model and image curves enables us to compute the refinements needed for the model. For each correspondence in the 2D matching, we are able to calculate its deformation vector in 3D (bottom-left diagram of fig. 8). Once we have computed all the deformation vectors from the curves matching, we use them in addition with the reconstructed feature points (from section 4) to deform the whole model. We use RBF to complete the whole model as before.
6 Skeleton Estimation Up till now we have constructed the surface of our subject. Here we will estimate the underlying skeleton of the subject. Once again we make use of the deformation vectors of (1) the feature points (from section 4), and (2) silhouette points (from section 5). These respective deformations from the generic model to the customized model were used to compute the RBF function. Finally, we used the RBF weight to deform the generic skeleton (fig. 1b) to yield the customized skeleton. We used the RBF so that the transition from the generic skeleton will give us a smooth customized skeleton.
7 Results and Discussion In our system, we used at least 4 images for reconstruction. Our algorithm was implemented using C++ (without optimization) running on a Pentium 4. The whole reconstruction process takes about 5 minutes. We had noticed that the bulk of the
3D Modeling of Humans with Skeletons from Uncalibrated Wide Baseline Views
387
computation time is due to the silhouette computation because we are processing a fairly dense 3D model of about 50000 edges. Fig. 9 shows the results of the final surface model. They are reprojected and overlaid onto the testing images. As we can see from the results, refining the initial model by using silhouette curve improved the results tremendously. This is because the feature points alone are too sparse, hence they do not provide enough local information. Fig. 11 shows the visual results of the estimated skeleton inside the surface model. Fig. 10 shows the silhouette curve reprojection error in pixels plotted against the number of iterations. It took about 20 iterations to converge. The mean reprojection error of the final model reprojected onto the testing images is about 0.5 pixel.
Fig. 9. Results for reconstruction of model surface
Fig. 10. Result of mean reprojection vs number of iterations
388
C.K. Quah et al.
8 Conclusions In this article, we proposed a new method to (1) construct the skin surface model, and (2) estimate the skeleton of the human from a set of limited images acquired from different views with wide baseline. We execute a 3-stages algorithm using a set of images in collaboration with a generic human model. In the first stage, we establish an initial model by a camera calibration/feature-point reconstruction loop and interpolating the sparsely reconstructed points. The second stage consists of matching the silhouette edges of the initial model with the image silhouette to obtain a refinement for the final deformation. Finally, we combine the deformation results from stages 1 and 2 to estimate the underlying skeleton. The final result is a regular-surface customized model incorporating its skeleton. In our future work, we will use this customized model to track the targeted subject.
Fig. 11. Visual results of the estimated skeleton inside the surface model
References 1. J Canny, A Computational Approach to edge detection, IEEE Trans. PAMI, 8(6), pp 679698, 1986. 2. J Carranza, C Theobalt, M A Magnor, H Seidel, Free-Viewpoint Video of Human Actors, Proceedings of the SIGGRAPH2003 Conference, pp 569-577, 2003. 3. D F Dementhon and L S Davis, Model-based object pose in 25 lines of code, International Journal of Computer Vision, 15, pp 123-141, 1995. 4. S Fang, R Raghavan and J Richtsmeier, Volume morphing methods for landmark based 3D image deformation, SPIE Int. Symp. on Medical Imaging, CA, 1996. 5. D M Garvrila and L S Davis, 3-D model-based tracking of humans in action: a multi-view approach, In CVPR, San Franscisco, USA, pp 73-80, 1996. 6. P Gérard and A Gagalowicz, Human Body Tracking using a 3D Generic Model applied to Golf Swing Analysis, MIRAGE 2003 Conf., INRIA Rocquencourt, France, March, 2003. 7. L Goncalves, E Di Bernardom, E Ursella and P Perona, Monocular tracking of the human arm in 3D, Proc. of ICCV 1995, Boston, USA, pp 764-770, 1995.
3D Modeling of Humans with Skeletons from Uncalibrated Wide Baseline Views
389
8. A Hilton, D Beresford, T Gentils, R Smith, W Sun, J Illingworth, Whole-body modelling of people from multi-view images to populate virtual worlds, The Visual Computer, 16(7), pp 411-436, 2000. 9. I A Kakadiaris and D Metaxas, Three-dimensional human body model acquisition from multiple views, International Journal of Computer Vision, 30, pp 191-218, 1998. 10. M Kass, A. Watkin and D Terzopoulos, Snake: Active contour models, International Journal of Computger Vision, 1, pp 321-331, 1988. 11. L Kettner and E Welzl, Contour edge analysis for polyhedral analysis, Geometric Modeling: Theory and Practice, Springer, pp 379-394, 1997. 12. A Laurentini, How far 3D shapes can be understood from 2D silhouettes, IEEE Trans. PAMI, 17, pp 188-195, 1995. 13. T B Moeslund and E Granum, A survey of computer vision-based human motion capture, Computer Vision and Image Understanding, 81, pp 231-268, 2001. 14. J F O’Brien, R E Bodenheimer Jr, G J Brostow and J K Hodgins, Joint parameter estimation from magnetic motion capture data, Proceedings of Graphics Interface 2000, Montreal, Canada, pp. 53-60, May 2000. 15. M M Panjabi, V K Goel, S D Walter, Errors in the centre and angle of rotation of a joint: An experimental study. Journal of Biomechanics, 15(7), pp 537-544, 1982. 16. F Remondino, 3-D reconstruction of static human body shape from an image sequence, Computer Vision and Image Understanding, 93, pp 65-85, 2004. 17. R Roussel, A Gagalowicz, Morphological adaptation of a 3D model of face from images, MIRAGE 2003 Conf., INRIA Rocquencort, France, March, 2003. 18. D Ruprecht and H Muller, Free form deformation with scattering data interpolation methods, Geometric Modeling (Computing Suppl. 8), Springer Verlag, editor G. Farin, H. Hagen and H. Noltemeier, pp 267-281, 1993. 19. H Sidenbladh, M J Black and Leonid Sigal, Implicit Probablistic Models of Human Motion for Synthesis and Tracking, Proc. of ECCV, pp 784-800, Copenhagen, 2002. 20. M-C Silaghi, R Plankers, R Boulic, P Fua and D Thalmann. Local and global skeleton fitting techniques for optical motion capture, Modeling and Motion Capture Techniques for Virtual Environments, Lecture notes in artificial intelligence, editor N Thalmann and D Thalmann , pp 26-40, 1998. 21. C Theobalt, E Aguiar, M Magnor, H Theisel and H-P Seidel, Marker-free kinematic skeleton estimation from sequence of volume data, Proc. ACM Virtual Reality Software and Technology, Hong Kong, pp.57-64, November 2004. 22. S Weik, A passive full body scan using shape from silhouette, Proc. ICPR 2000, Barcelona, Spain, pp 99-105. 23. C Wren, A Azarbayejani, T Darrell and A Pentland, Pfinder: real-time tracking of the human body, IEEE Trans. PAMI, 19(7), pp 780-785, 1997. 24. C Yaniz, J Rocha and F Perales, 3D Part Recognition Method for Human Motion Analysis, Proceedings of the International Workshop on Modelling and Motion Capture Techniques for Virtual Environments, pp 41-55, 1998. 25. Camera calibration toolkit for Matlab, J Bouguet. http://www.vision.caltech.edu/bouguetj/calib_doc 26. Cyberware. http://www.cyberware.com 27. Hamamatsu. http://usa.hamamatsu.com