Automated Avatar Creation for 3D Games - Semantic Scholar

Automated Avatar Creation for 3D Games Andrew Hogue

Sunbir Gill and Michael Jenkin

Faculty of Business and IT University of Ontario Institute of Technology Oshawa, Ontario, Canada

Dept. of Computer Science and Engineering York University Toronto, Ontario, Canada

[email protected]

ABSTRACT Immersion is a key factor in video games and virtual reality simulation environments. Users’ presence in a virtual environment is highly dependent on the user’s identification with their avatar. Creating more realistic looking avatars thus enables a higher level of presence. Current video games allow character customizability via techniques such as hue adjustment for stock models, or the ability to select from a variety of physical features, clothing and accessories in existing player models. Occasionally user uploadable facial texture is available for avatar customization. We propose a dramatic leap forward in avatar customization through the use of an inexpensive, non-invasive, portable stereo video camera to extract model geometry of real objects, including people, and to then use these textured 3D models to drive avatar creation. The system described here generates the 3D textured and normal-mapped geometry of a personalized photorealistic user avatar suitable for animation and real-time gaming applications.

Categories and Subject Descriptors I.4.8 [Image Processing and Computer vision]: Scene Analysis—sensor fusion, stereo, range data, photometry, motion, tracking

General Terms Experimentation, Design, Algorithms, Measurement

1.

INTRODUCTION

Immersion is key to create a compelling modern video game experience. Users’ presence in a virtual environment is highly dependent on the amount of their immersion. Presence is further augmented with the user’s identification with their avatar[19]. Creating more realistic looking avatars enables a higher level of identification with the character and thus an increased level of presence. As a consequence, many current video games are capable of some level of user-customization of the playable character. Most notably are the Mii avatars

{sunbir,jenkin}@cse.yorku.ca

for the NintendoTM WiiTM game console. These are 3D caricatures that the user can customize to fit his or her personality. Adjustable attributes include hair shape and color, eye shape and color, height, width, and the ability to add accessories to the player’s avatar. Other existing game environments such as Tony Hawk’s Underground 2, Second Life[14] and Sony’s Home[30] for the Playstation 3 allow for a certain level of player customization. Ubisoft’s game Tom Clancy’s Rainbow Six Vegas is perhaps the leader in terms of photorealistic avatar customization. The user can capture two images of themselves (head on and side profile) using the XBox Live camera. These images are then texture mapped to an existing model in the game, thus enabling increased presence. The photo-realistic customization is limited to the head but is very effective at capturing the appearance of the player. In this paper we build on existing techniques to develop a system that creates an accurate 3D representation of the user. The system described in this paper generates the 3D model information without an existing stock 3D model and can be used to extract 3D models of not only humans, but also of entire rooms and scenes, to aid in avatar and virtual scene construction.

1.1

Existing 3D Modelling Approaches

Although there is this growing need for such technology, to date few effective low-cost, eye-safe systems have been developed that can create compact, accurate photorealistic 3D representations of users for use as personal avatars. The traditional technique for generating 3D surface models of humans is through the use of laser line scanning (see [32] for example). In these systems, a laser line scanner is used to obtain local surface geometry and this information is then merged with surface intensity (colour) information in order to obtain a 3D surface model. Although this technique has been used successfully to build 3D models of actors and agents the approach is not without its problems. First, laser line scanners are reasonably slow, obtaining only a single 1D line stripe at a time of the object under view. To obtain more complete 3D models it is necessary to sweep the line across the object in various directions and from various starting points. Second, the laser scanning process itself does not obtain colour information and a second visible light capture system must be used in order to obtain appropriate texture information. Third, although these laser scanners are normally considered ‘eye safe’ due to the use of low power visible light lasers, many users are uncomfortable using them as the laser is fired directly into their face.

(a)

(b)

(c)

(d)

Figure 1: Recovering 3D structure. (a) and (b) show a single frame from the stereo image sequence being captured. (c) shows local temporal matches between the 3D structure from the current frame and the existing model projected into the right camera view. These matches are used to estimate the camera egomotion between frames. (d) shows the current 3D model rendered to a particular viewpoint. See text for details. Given the problems inherent with laser scanning approaches, this paper demonstrates how a passive stereo video sequences obtained with natural illumination can be used to construct avatars safely and efficiently. We demonstrate how accurate point cloud models of heads and torsos can be acquired and how compact photorealistic polygon-based representations of these point clouds can be constructed. Stereo video 3D surface reconstruction, more broadly, is an emerging technology for scanning scenes to produce 3D models. Its uses are currently being investigated in the mining industry, as a means of producing mine maps, and by law enforcement agencies, as a forensic data collection tool (see [24]). This is in large part due to its advantages over its laser-based rivals. Stereo video is relatively inexpensive, fast, mobile, and is capable of producing photorealistic texture from captured video during 3D model creation. These reasons, coupled with its non-invasiveness and portability make stereo video an attractive technology for scanning faces of people for a range of applications.

1.2

Stereo Video Reconstruction

The problem of obtaining 3D surface models from a pair of cameras has been studied extensively (see [4] for a review of various approaches). As the technology has advanced it is now possible to use these algorithms to extract a dense set of 3D measurements per image frame. Integrating these measurements from different viewpoints enables the construction of a 3D point model which can be converted to a polygonal mesh representation for fast rendering and manipulation. In the development of such a system there are a number of critical design decisions that influence the algorithm’s performance. Perhaps the most important of these relates to how the ego-motion of the stereo sensor is estimated in order to integrate measurements taken from different viewpoints. The Instant Scene Modeler [24], for example, extracts highly salient SIFT features[17] with their corresponding 3D locations from the current frame and matches them with a large database of features extracted from previous time frames. The position and orientation of the camera relative to some arbitrary world frame is estimated using these matches[25]. Although this technique works well, as the database of features grows more efficient search algorithms must be developed to achieve real-time performance. This technique also eliminates the camera motion information which can be a huge asset when creating dense 3D models.

An alternative strategy is to estimate the ego-motion using more ‘traditional’ computer vision techniques based on the registration of subsequent 3D point clouds. This is the approach used by the stereo-inertial AQUASENSOR[5]. The strategy used in the AQUASENSOR algorithm uses relative ego-motion coupled with 3D point registration algorithms to create 3D models. The AQUASENSOR combines range information extracted from stereo imagery with 3DOF orientation from an inertial measurement unit (IMU). We utilize a terrestrial version of the AQUASENSOR which lacks the IMU data and is a pure vision-based approach. Although not using the IMU may limit accuracy for reconstruction of large arbitrary scenes (on the order of tens of meters) for the application considered here the sensor moves only a few feet. This allows us to use a pure stereo vision approach for extracting 3D information from the scene. The handheld sensor records full frame-rate (30fps) stereo video to disk which is later analyzed to extract the 3D model and the camera trajectory of the scene. It is important to note that although the sensor does not process the collected data in real-time, the algorithms used to analyze the data were developed with real-time performance in mind and with optimization can indeed run at framerate. Recovering accurate depth from stereo imagery is performed by leveraging the optimized sum-of-squared differences (SSD) algorithm implemented in the Point Grey Triclops library[20]. There are many different stereo algorithms that could be used for this step (see [23] for an overview of other methods) however the SSD approach provides a good tradeoff between accuracy and speed. The output of the stereo module provides a dense set of 3D points per acquired image frame. Afterwards, the unbounded handheld motion of the camera is estimated by tracking 3D features temporally and computing a six degree-of-freedom (6DOF) transformation to align the consecutive frames. The sensor position and orientation over multiple frames is estimated using a linear least-squares algorithm for an initial estimate which is then refined using a non-linear least-squares minimization. The pose is integrated over time to recover the entire trajectory of the camera as it has moved throughout the environment. This also implies that any error in the pose estimate will accumulate over time, however in the application presented the accumulated error can be ignored since the total distance traveled is small. The details of the technique are presented below and are summarized in Figure 1.

Figure 2: A recovered point cloud. Four views are shown. Aligning sets of 3D points requires that a rotation and translation be estimated between the model and the current set of points. Typically this is done by using the variants of the Iterative Closest Point (ICP) algorithm[2]. ICP works well for highly accurate sets of data such as those extracted from laser range scans. However, the 3D information extracted from stereo can be quite noisy as the points get further from the camera. This causes correspondence problems which can lead to divergence. This can be overcome by utilizing motion information from the image sequence as a guide to direct the correspondence problem. Since the camera is moving, there is a direct correlation with the motion of the 3D point sets between frames. Computing the 2D motion from the images provides cues as to how features are moving in the environment. We retain the 3D information corresponding to this computed motion and thus we know the correspondences between the two images in 3D. The 3D rotation and translation can be computed using these corresponding points. Estimating the camera trajectory is beneficial in two ways. First, it allows the algorithm to converge faster by having knowledge of the corresponding points between the overlapping 3D point sets. Second, it enables us to map the triangles of our output mesh to actual camera images for texturing purposes. To estimate the camera trajectory “good” features are extracted from the reference camera at time t using the KanadeLucas-Tomasi feature tracking algorithm (see [28, 18]) and are tracked from the previously captured image at time t−1. Any type of image features could be used to estimate the camera motion; we employ the KLT tracker due to its relatively good run-time performance. A survey of other techniques can be found in [37]. Using the disparity map extracted for both time steps, tracked features that do not have a corresponding disparity at both time t and t − 1 are eliminated. Surviving features are subsequently triangulated to determine the metric 3D points associated with each disparity. In order to deal with potential confounding matches from the ego-motion estimation process, we employ robust statistical estimation techniques to label the feature tracks as belonging to either a static or non-static world model. This is achieved by estimating a six degree-of-freedom (6DOF) transformation model under the assumption that the scene is stationary. The resulting 3D temporal correspondences are associated with stable scene points for the basis of later processing. The camera orientation is represented as a quaternion and we estimate the least-squares best-fit rotation and trans-

lation for the sequence in a two stage process. First, using RANSAC [6] we compute the best linear least squares transformation using Horn’s absolute orientation method [11]. Given two 3D point clouds rt0 and rt1 at times t0 and t1 respectively, we estimate the rotation and translation required to bring rt1 into accordance with rt0 . The centroid, r¯t0 , r¯t1 , of each point cloud is computed and subtracted from the points to obtain two new point sets, rt0 0 = rt0 − r¯t0 and rt0 1 = rt1 − r¯t1 To compute the rotation, R(·), we minimize the error function n X ||rt0 0 ,i − sR(rt0 1 ,i )||2 (1) i=1

The rotation, R(·), and scale, s, are estimated using a linear least-squares approach (detailed in [11]). After estimating the rotation, the translation is estimated by transforming the centroids into a common frame and subtracting. To achieve a higher registration accuracy we refine the rotation and translation using a nonlinear Levenberg-Marquardt minimization [21] over six parameters. This is a necessary step as the solution from the initial RANSAC estimate is prone to small errors negatively affecting the visual quality of the registration. The residual error is minimized through a non-linear minimization stage. For this final pose refinement stage, we parameterize the rotation as a Rodrigues vector [35] and simultaneously estimate the rotation and translation parameters by minimizing the transformation error n X

||rt0 ,i − (R(rt1 ,i ) + T )||2

(2)

i=1

In practice, we find that the minimization takes few iterations (as few as 4-10) to minimize the error to acceptable levels and as such does not preclude real-time operation. Another factor aiding algorithm performance is the deterministic size of the parameters in the minimization. This approach to pose estimation differs from the traditional Bundle-Adjustment approach [31] in the structurefrom-motion literature in that it does not refine the 3D locations of the features as well as the trajectory. We chose not to refine the 3D structure to limit the number of unknowns in our minimization and thus provide a solution to our system more quickly. The minimization occurs to reduce the error of registration only for subsequent frames. Thus the time spent per iteration is limited by the maximum number of features tracked per frame. This is set empirically to 1000 features to ensure the minimization time is not the limiting factor for performance. The algorithm is capable of estimating the full six degree-of-freedom handheld mo-

(a)

(b)

(c)

(d)

Figure 3: Polygonal mesh obtained from a point cloud. (a, c) show views of the generated triangle mesh and (b, d) show their respective wireframe renderings. tion of the sensor as it moves throughout the environment. Since we use the area-based sum-of-squared-distance stereo algorithm, there is an artificial ‘thickening’ of the 3D points around the edges of the model. This creates unwanted effects such as haloing around the objects to be reconstructed. This can be alleviated by using a smaller window size for the stereo algorithm but it does not eliminate the effect completely. The halo effect becomes more problematic when looking at the subject from a profile view and then rotating around the subject. The set of 3D points associated with the halo tends to occlude the subject’s face resulting in a noisy model. This can be alleviated in several ways; by using a stereo algorithm such as the one described in [29] or by moving the camera in a way that this cannot occur. Lateral movements from the left to the right of the subject seem to perform the best for creating facial models, and vertical motions seem to perform the best for anterior or posterior full-body models. The output of the stereo algorithm is a 3D coloured point cloud (see Figure 2). Although this cloud captures the salient scene structure it is not directly useful as a head/face model given its large size (each stereo frame may provide tens of thousands of points, and at 30 frames per second, even a minute of data collection provides hundreds of frames). In addition to providing too many (and redundant) data points, the data may contain holes and outliers which need to be filled and pruned. Also, representing the scene as a set of 3D points is not the most efficient in terms of rendering using modern graphics hardware. These issues are addressed in the following sections by extracting a polygonal mesh more suitable for real-time rendering.

2.

MESHING POINT-CLOUDS

Creating polygonal meshes is a standard form of data compression for 3D models and has an extensive history. It has been over twenty years since the breakthrough publication of the classic (and highly used) Marching Cubes[16] algorithm. A comprehensive survey of the topic has been published recently (see [12]). Other meshing techniques exist that alleviate several issues that plague the Marching Cubes algorithm; for example, Marching Cubes is subject to a stair-stepping effect on slanted data generating noticeable visual artifacts. One of the techniques we have found to be exceptionally useful for the purpose of scanning face and body models is the Constrained Elastic Surface Nets (described in [7]) algorithm. This meshing approach involves voxelizing the point-cloud into a discrete volume, meshing the volume by sub-dividing the voxels and creating a vertex

centered in each of them, applying a relaxation function to vertex positions, and then triangulating adjacent vertices. A more detailed description and analysis of our meshing strategy can be found in [8]. Prior to attempting to mesh the point-cloud several preprocessing steps are performed. First, the point-cloud is culled to a cube that contains the volume of interest. This subset of points then goes through a merging step to reduce the number of points in the volume. Merging helps reduce redundant data resulting from overlapping 3D points produced from the range images. This improves the performance of the meshing step dramatically. The space encompassing the point-cloud is divided into a voxel grid at a specified resolution. Points contained within non-empty voxels are then averaged to produce a single voxel value. This averaging is done for all point attributes: position, intensity and (normalized) normal information. The last pre-processing step involves removing spurious data points. These are floating outlier data points and points lying along the camera trajectory (artifacts of the stereo algorithm). These points are easy to detect by sweeping a small volume in space-time along the camera trajectory and removing the intersecting points. The voxel representation is then meshed using a variation of the Surface Nets algorithm (see [7]). However, we omit the relaxation step (and hence do not subdivide the voxels). The reason for this is that the subdivided voxels yield many more mesh faces and the relaxation function does not provide enough smoothing in this case to make the higher cost associated with the additional mesh faces worthwhile. Instead we post-process the mesh to fill small holes and smooth the resulting representation.

Hole closing. The first follow-up processing step is to close any holes present in the mesh. These holes are primarily the result of highly specular or textureless surfaces which yield areas of very sparse point-cloud data. These holes are fairly small and we take two approaches to closing them. To detect the holes we first identify boundary mesh edges. We then enumerate through chains of boundary edges that form loops. Edge loops of size three can be closed trivially by creating a new triangle to fill it. Edge loops of greater than size three are filled by first computing the mean vertex from all the vertices in the edge loop chain. We then construct a triangle fan with this averaged vertex as a center out to every edge in the loop. This works very well for small holes

(a)

(b)

(c)

(d)

(e)

Figure 4: Mesh Unwrapping. (a) and (b) show the input mesh to the unwrapping algorithm. (c) shows the flattened version of the mesh and (d) its wireframe representation. (e) shows the computed normal map gradient. and does not require an advanced 3D tessellation algorithm such as Delauney Triangulization.

Smoothing. With the mesh free of holes we address the smoothing issue. Mesh smoothing is a solution to addressing the problem of noise in the point-cloud information. We tackle smoothing using a mesh subdivision smoothing algorithm. These are iterative algorithms that consist of dividing the faces of a mesh and applying a relaxation function to the vertices. We have found applying a single iteration of Loop Subdivision[15] improves visual quality significantly (see [22] for a recent survey on the topic). The discrete iterative nature of the meshing algorithm produces a fairly large uniform mesh when run at resolutions required to extract a good facial profile. The size increases again once subdivision has been applied. With rendering performance of a model being inversely proportional to its size we reduce the number of faces in the mesh using a mesh decimation algorithm (see [9]). We iteratively find and collapse the edge that has least impact on the model geometry. The result is a dramatically reduced non-uniform mesh with approximately the same 3D profile as the original input. The results of the point-cloud meshing can be seen in Figure 3.

3.

TEXTURE MAPPING WITH RANGE IMAGES

The task of texture-mapping a 3D model given real-world images and associated camera position and orientation information is very complex. Small changes in color between adjacent triangles seem unnatural and can easily catch the attention of the human eye. The discontinuity between different textures applied to adjacent mesh faces is referred to as seaming. Even with a near pixel-perfect geometric mapping of different captured images to adjacent faces, the brightness distortion from camera gain is enough to result in seaming. Thus many approaches to texture-mapping (see [36, 1, 33, 3, 34]) revolve around addressing this issue. (See [27] for a recent survey of mesh parameterization approaches.) Our approach is similar to the technique described in [36]. We compute the mapping of a triangle to a range image by projecting the triangle to the image plane in world-space by using its associated camera position and orientation. Our mapping process begin by selecting an “optimal” range image, i.e., the image that is able to map the most mesh triangles, and use that to texture the model. Triangles that do not map within the bounds of the image or have a zero

area mapping are then mapped with a best-fit strategy: the range image chosen is the one that yields the largest projected area on the texture. This fairly simple approach is able to yield a texture that is free of any prominent seaming artifacts. The next step is to compose a single texture from the mesh mapping. Retaining all the range images as textures for model would dramatically increase the size of the model and preclude real-time performance. To produce an efficient model representation we use an automatic UV Parameterization algorithm to map the vertices of the mesh from 3D to a 2D plane. This is often referred to as a mesh unwrapping. We can then render the unwrapped mesh fully textured as our new single unified texture map; texture UV-coordinates for the 3D mesh are simply the vertex positions of the unwrapped mesh (see Figure 4(a) and (b)). Angle-based Flattening (ABF)[26] is used to perform the mesh unwrapping. The algorithm minimizes the difference in the interior angles of the triangles between the mesh triangles from 3D to 2D. This effectively minimizes stretch and skew artifacts of the parameterization. It also maintains adjacency between faces in the rendered texture, i.e. internal mesh edges are consistent, which makes the texture much friendlier for post-processing such as scaling or image filtering.

4.

PER-PIXEL ILLUMINATION

A potential defect with low resolution polygon meshes is that the mesh may undersample the underlying geometry resulting in illumination artifacts. In essence the mesh incorrectly samples the position and orientation of the surface of the object (here a face) and the lighting model then highlights these artifacts. One approach to addressing this potential defect is to explicitly model local surface normal information that might not be properly captured by the geometry of the polygon mesh. Bump Mapping[13] is a technique similar to texture mapping, however, instead of obtaining intensity information from a 2D map, displacement from the surface is taken into consideration when lighting calculations are performed to generate fine surface detail without the cost of additional geometry data. Normal mapping is the dual to bump mapping; instead of encoding displacement in the 2D map, the surface normal is encoded directly. With current video hardware supporting multiple texture paths and dot-product blending per-pixel diffuse illumination can be achieved without run-time performance overhead (see [10]).

(a)

(b)

(c)

(d)

(e)

Figure 5: Final Model. (a)-(e) provides multiple views of the textured and normal mapped mesh. The raw point cloud data obtained with the sensor provides dense samples of environmental structure. We compute local normal information from the point-cloud simply by averaging the normals between a pixel and its adjacent pixels with depth values. A vertex is assigned a normal that is the normalized sum of point normals local to that vertex. This improves the surface detail of a model when lighting is applied. We propagate this normal information further by generating a normal map. We use the unwrapped mesh information and subdivide it multiple times to achieve higher resolution (in the same way the meshing algorithm produces the mesh). Normals for the new vertices are sampled from the pointcloud respective to their positions in 3D space. The normal map is produced by encoding the normals as vertex colors and rendering linearly interpolated gradient-filled triangles to an image (see Figure 4(c) and (d)). Figure 5 illustrates the quality of model that can be constructed from our stereo image sensor. After creating the photo-realistic 3D avatar, this model can be animated using traditional rigging techniques in commercial applications such as Maya, 3D Studio Max, or Blender 3D to attach existing stock animations to the models. By applying a skeletal bone structure to the model, it can be animated and exported into the game environment where the user can now play as, or against, themselves.

5.

RESULTS AND DISCUSSION

The system presented here can be used to extract accurate 3D information from the stereo video capture. Advantages to using this type of system over existing laser scanning systems to acquire 3D information include (i) It is a passive, eye-safe system. No energy is emitted into the environment. This is a particularly important issue in the generation of 3D head models as laser light would be projected towards the subject’s eyes. (ii) The system is low cost, requiring only the use of two computer cameras. It would be possible, for example, to utilize off-the-shelf synchronized web cameras to collect the necessary imagery. (iii) The system generates photorealistic 3D models of human faces, heads, bodies, other objects or even entire rooms without requiring special markers to be placed in the environment. (iv) There is no need to calibrate multiple camera positions or to calibrate multiple (e.g. laser and visible light) sensors together. The camera motion is obtained from the visual motion alone without the need of specially placed markers in the scene. Models obtained with this technique can then be rigged and animated with commercial animation packages such as Maya and Morpheme and imported into the game or simulation environment to be used as an avatar in the 3D world. This technology is suitable for creating photo-realistic avatars for

use in games and simulation environments and can lead to a more immersive experience for the user.

Acknowledgements The support of the University of Ontario Institute of Technology, OGS, NSERC, CRTI and MDA is gratefully acknowledged.

6.

REFERENCES

[1] A. Baumberg. Blending images for texturing 3d models. In Proceedings of the British Machine Vision Conference, 2002. [2] P. Besl and N. McKay. A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 14(2):239–256, February 1992. [3] P. Debevec, Y. Yu, and G. Boshokov. Efficient view-dependent image-based rendering with projective texture-mapping. Technical Report CSD-98-1003, 20 1998. [4] U. Dhond and J. Aggarwal. Structure from stereo - a review. IEEE Trans. Systems, Man, and Cybernetics, 19:1489–1510, 1989. [5] G. Dudek, P. Giguere, C. Prahacs, S. Saunderson, J. Sattar, L. Torres, M. Jenkin, A. German, A. Hogue, A. Ripsman, J. Zacher, E. Milios, H. Liu, P. Zhang, M. Buehler, and C. Georgiades. AQUA: An Amphibious Autonomous Robot. IEEE Computer, 40(1):46–53, January 2007. [6] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting with application to image analysis and automated cartography. Communications of the ACM, 24(6):381–385, 1981. [7] S. Frisken. Constrained elastic surface nets: Generating smooth surfaces from binary segmented data. In Proceedings of the First International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 888–898, 1998. [8] S. Gill. Polygonal meshing for stereo video surface reconstruction. Master’s thesis, York University, Toronto, Ontario, Canada, 2007. [9] P. S. Heckbert and M. Garland. Survey of polygonal surface simplification algorithms. Technical Report CMU-CS-95-194, 1997. [10] W. Heidrich and H.-P. Seidel. Realistic, hardware-accelerated shading and lighting. In A. Rockwood, editor, Siggraph 1999, Annual Conference Proceedings, pages 171–178, Los Angeles, 1999. Addison Wesley Longman. [11] B. Horn. Closed-form solution of absolute orientaiton using unit quaternions. AI Magazine, A(4):629, April

1987. [12] L. Kobbelt and M. Botsch. A survey of pointbased techniques in computer graphics. ACM Trans. Graph., 22(3):641–650, 2003. [13] V. Krishnamurthy and M. Levoy. Fitting smooth surfaces to dense polygon meshes. Computer Graphics, 30(Annual Conference Series):313–324, 1996. [14] Linden Research Inc. Second life. http://www.secondlife.com, 2007. [15] C. Loop. Smooth subdivision surfaces based on triangles. Master’s thesis, Utah University, USA, 1987. [16] W. E. Lorenson and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, pages 163–169, 1987. [17] D. G. Lowe. Object recognition from local scale-invariant features. In International Converence on Computer Vision, pages 1150–1157, 1999. [18] B. Lucas and T. Kanade. An Iterative Image Registration Technique with an Application to Stereo Vision. In International Joint Conference on Artificial Intelligence (IJCAI), pages 674–679, 1981. [19] M. Meehan, B. Insko, M. Whitton, and J. Frederick P. Brooks. Physiological measures of presence in stressful virtual environments. In SIGGRAPH ’02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 645–652, New York, NY, USA, 2002. ACM Press. [20] Point Grey Research Inc. http://www.ptgrey.com, 2007. [21] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipes in C. Cambridge University Press, 2002. [22] M. A. Sabin. Recent progress in subdivision: a survey. Advances in Multiresolution for Geometric Modelling. Mathematics + Visualization, pages 203–230, 2005. [23] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1-3):7–42, April 2002. [24] S. Se and P. Jasiobedzki. Photorealistic 3d model reconstruction. In Proceedings of IEEE International Conference on Robotics and Automation, ICRA2006, pages 3076–3082, 2006. [25] S. Se, D. Lowe, and J. Little. Vision-based global localization and mapping for mobile robots. IEEE Transactions on Robotics, 21(3):364–375, June 2005. [26] A. Sheffer, B. Levy, M. Mogilnitsky, and A. Bogomyakov. Abf++: fast and robust angle based flattening. ACM Trans. Graph., 24(2):311–330, 2005. [27] A. Sheffer, E. Praun, and K. Rose. Mesh parameterization methods and their applications. Foundations and Trends˝ o in Computer Graphics and Vision, 2006. [28] J. Shi and C. Tomasi. Good Features to Track. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 593–600, Jun. 21-23 1994. [29] M. Sizintsev and R. Wildes. Coarse-to-fine stereo vision with accurate 3-d boundaries. Technical Report

CS-2006-07, York University, June 28 2006. [30] Sony Entertainment Inc. Home beta. http://www.homebetatrial.com, 2007. [31] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle Adjustment - A Modern Synthesis, pages 278–375. Lecture Notes in Computer Science. Springer-Verlag, 2000. [32] P. Vanezis, M. Vanezis, G. McCombe, and T. Niblett. Facial reconstruction using 3-d computer graphics. Forensic Science International, 108:81–95, 2000. [33] L. Wang, S. B. Kang, R. Szeliski, and H.-Y. Shum. Optimal texture map reconstruction from multiple views. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, volume 1, pages 347–354, 2001. [34] F. M. Weinhaus and V. Devarajan. Texture mapping 3d models of real-world scenes. ACM Comput. Surv., 29(4):325–365, 1997. [35] E. Weisstein. Rodrigues’ Rotation Formula. From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/RodriguesRotation Formula.html, 2006. [36] S. Wuhrer, R. Atanossov, and C. Shu. Fully automatic texture mapping for image-based modeling. Technical Report NRC/ERB-1141, August 2006. [37] B. Zitova and J. Flusser. Image registration methods: a survey. Image and Vision Computing, 21:977–1000, 2003.