3D Reconstruction of Environments for Virtual ... - Semantic Scholar

2 downloads 0 Views 461KB Size Report
to text has become paramount. Teleimmersion is the emerging technology which will ... 1This work is supported by, or in part by: The National Science Foun- dation under grant ..... We used black and white CCD cameras SONY XC-. 77RR ...
3D Reconstruction of Environments for Virtual Collaboration Ruzena Bajcsy, Reyes Enciso, Gerda Kamberova, Lucien Nocera University of Pennsylvania, Philadelphia, PA 19146 [email protected],frenciso,kamberov,[email protected] ˇ ara Radim S´ Czech Technical University, Prague, Czech Republic [email protected]

Abstract In this paper we address an application of computer vision which can in the future change completely our way of communicating over the network. We present our version of a testbed for telecollaboration. It is based on a highly accurate and precise stereo algorithm. The results demonstrate the live (on-line) recovery of 3D models of a dynamically changing environment and the simultaneous display and manipulation of the models1 .

and communicate during a design process. Figure 1 illustrates the idea. During a telecollaboration session, each of the participants is situated in a telecubicle equipped with variety of sensors and data acquisition systems. The participants are in control of their own viewpoints, and can contribute models of objects from their own worlds into the common virtual world. Each participant has a replica of this virtual world. Enabling the teleimmersion technology

1. Introduction With the wide availability of distributed computing systems with multimedia capabilities, it is very natural to contemplate collaborations that were not previously possible. Scientists, businessmen, doctors, educators, and others need to share, exchange, and debate their work, findings, methodology, and experiences. This is commonly done by attending meetings, publishing in journals and books, visiting each other’s sites. The Internet is a common medium for communicating text, images, and sounds. Recently the need for transmitting/communicating video and audio in addition to text has become paramount. Teleimmersion is the emerging technology which will provide people with realistic experiences in virtual reality [19]. We are particularly interested in telecollaboration, i.e., immersing the participants at geographically distributed sites into a common virtual world where they can interact 1 This work is supported by, or in part by: The National Science Foundation under grant numbers: MIP94-20397 A03 SUB, IRI93-3980, IRI9307126, GER93-55018, the U.S. Army Research Office under grant numbers: P-34150-MA-AAS DAAH04-96-1-0007 ARO/DURIP DAAG55-97ˇ ara was also 1-0064, and Advanced Networks and Services, Inc.. Radim S´ supported by the Grant Agency of the Czech Republic, grants 102/97/0480, 102/97/0855, and 201/97/0437, by European Union, grant Copernicus CP941068, and by the Czech Ministry of Education, grant VS96049.

Figure 1. Telecollaboration set up: each participant is in a telecubicle, and contributes to a common virtual world during the design process.

requires the joint efforts of researchers from many different areas: graphics, vision, networking. In this paper we focus only on the vision part, i.e. the automatic acquisition of 3-D models of environments for virtual reality. Important questions related to the limitations and needs of transmitting video for telecollaboration must be addressed: Why not just transmit two dimensional video (like in television or tele-conferencing)? If two dimensional video is not sufficient, then what is missing especially for achieving a more integrative/collaborative effort in design of mechanical parts? Will video information (in any form) be sufficient to convey all the information necessary during

Data Acquisition

3D Computation

TRANSPORT

Data Processing

Data interpretation

Stereo

Rendering

Pairs of images

3-D One image

camera Internet II Real World A

sensors

FG Acquisition

Computations

and

sound

Visualization

Rendering

Texture Projector

Virtual World

Interactivity

Figure 3. 3D model acquisition, system diagram, (FG denotes the frame grabber/interface.) camera Rendering Real World B

sensors sound

Computations

and Internet II

Virtual World

Interactivity

Figure 2. Telecollaboration testbed, system diagram.

the design process? Next, we address the above questions.

1.1. Limitations and needs of transmitting video Two-dimensional video towards collaboration This technology called tele-conferencing has been available for a while. It is suitable for two-site communications. If there are more than two sites, on the receiver’s site each sender is displayed in a different window. There is no sharing of a common “virtual space”; every participant lives in his/her own space, and the communication is only point to point, pairwise. A similar version can be implemented on the Internet. While the TV technology works and is commercially available, it is not sufficient for realistic interactive design by multiple users. What is missing is that the receiver has no way of looking around the part that is being discussed (unless he/she asks explicitly the sender for that particular view), i.e., he/she cannot freely interact with the part that is being designed. Furthermore, this technology does not allow the users to overlay any, real or synthetic, information from their database or other sources on the design object. The designers in the virtual space : During collaboration a great deal of information is communicated by face to face observations. There is a need to bring the designers alive into the virtual space. The telecollaboration requires the addressing of technical challenges, like high spatial and color resolution; synchronized real time capture at each site, real-time integration in the 3D virtual world of the 3D local representation; and wide field of view, so that the viewer has a natural look at the design environment. A virtual space for discussing and sharing a common design : One can easily imagine a common three-

dimensional virtual space into which all the participants input their information, and then this common space is transmitted to all participants. Then in turn on each receiver’s site the proper view is generated as if the viewer is physically present in that space. Our interest is in the automatic on-line model acquisition. The technical difficulties are: to ensure coherent accuracy and precision of the individual 3D representations; to register all the representations of the different sites in one virtual space; and to integrate real and virtual objects in the same virtual scene.

1.2. Immersive virtual environments Our approach is motivated by Henry Fuchs’s proposal on polynocular stereo using a “sea of cameras” [8]. Recently, the automatic extraction of 3-D models for virtual reality has been actively explored, i.e. at the IEEE and ATR Workshop on Computer Vision for Virtual Reality Based Human Communications [1], and in [16]. There are three main differences between our approach and that of [16]. The application in [16] is the entertainment industry; thus it is sufficient to achieve visually acceptable results. Wide field of view sensors are used covering a large volume, and only the image capture is on-line (using 51 cameras and 51 VCRs). All processing is done off-line (so far). Our application is telecollaboration for design, thus we are interested in high accuracy and precision, which we quantify with confidence intervals [10]; we use much narrower field of view (we focus on facial expressions, gestures, and objects with detail, which can be manipulated by hand); image acquisition, all computations, and the display are on-line; we use 2 cameras (in the future we will extend the number of cameras to 6). While our approach indeed makes some aspects of the stereo algorithm easier we are still left with the following outstanding problems: ambiguities, in case of weak texture or repetitive texture areas, and errors around occluding boundaries (these must be detected by some independent means). So far we have concentrated on robust techniques of analysis of stereo assuming textured surfaces. Furthermore, we have spent a great deal of effort on understanding the

min -1.6554 -12.6129

max 2.7778 4.9614

mean 0.0285 -0.0059

std 0.1668 2.1477

median 0.0005 0.1552

Figure 6. Statistics of the residuals, for the plane in pixels (top row), and for the cylinder in mm (bottom row). 12

12

10

10

8

8

% of data points

% of data points

sensor noise and behavior. By understanding and modeling the sensor we can improve the accuracy and can characterize the precision [10]. Stereo reconstruction has been an active area of research in computer vision. For a review on stereo and matching see [7], or more recently [21, 15, 4, 17, 18, 20]. The stereo algorithm we use is a classic area-based correlation approach. This class of algorithms is suitable to provide dense 3-D information which may be used in turn to accurately define higher object description. The advantages of our stereo algorithm are that it can be easily parallelized; it has precise and fast sub-pixel disparity computation; it is relatively insensitive to specularities (an advantage over single-pair stereo setups and/or matching algorithms that use the information from all cameras at once); and it relies on weaker than usual assumptions on the scene smoothness (we use no explicit smoothness or continuity constraint). In the next sections we will present our stereo algorithm, some implementation details, experimental results, and future directions for research.

6

6

4

4

2

2

0 −15

−10

−5 value

0

5

0 −15

−10

−5 value

0

5

Figure 7. Histograms of the residuals for the plane in pixels (left) and for the cylinder in mm (right)

2. Stereo algorithm The input data is a set of images taken by multiple cameras displaced in space and gazing at the object. The cameras are strongly calibrated. To recover range data from (polynocular) stereo, the corresponding projections of the spatial 3-D points have to be found in the images. This is known as the correspondence (matching) problem. The epipolar constraint reduces the search-space dimension, [12]. Rectification To simplify the algorithms and their parallelization, the input images of each stereo pair are first rectified, [2], so that corresponding points lie on the same image lines. Then, by definition, corresponding points have coordinates (u; v ) and (u ? d; v ), in left and right rectified images, respectively; u denotes the horizontal, and v the vertical coordinates in the image, and d is known as the disparity. Matching: disparity map computation The degree of correspondence is measured by a modified normalized cross-correlation,[14],

; IR ) : c(IL ; IR ) = var2(Icov) (+IL var (I ) L

R

(1)

where IL and IR are the left and right rectified images over the selected correlation windows. For each pixel (u; v ) in the left image, the matching produces a correlation profile c(u; v; d) where d ranges over acceptable integer disparities

We consider all peaks of the correlation profile as possible disparity hypotheses. We call the resulting list of hypotheses for all positions a disparity volume. The hypotheses in the disparity volume are pruned out by a selection procedure that is based on visibility constraint, ordering constraint, and disparity gradient constraint [22, 7]. The output of this procedure is an integer disparity map. The disparity map is the input to the reconstruction procedure. The precision in the reconstruction is proportional to the disparity error. To refine the 3-D position estimates, a subpixel correction of the integer disparity map is computed which results in a subpixel disparity map. The subpixel disparity can be obtained either using a simple interpolation of the scores or using a more general approach as described in [6] (which takes into account the distortion between left and right correlation windows, induced by the perspective projection, assuming that a planar patch of surface is imaged). The first approach is faster while the second gives a more reliable estimate of the subpixel disparity. To achieve fast subpixel estimation and satisfactory accuracy we proceed as follows. Let  be the unknown subpixel correction, and A(u; v ) be the transformation that maps the correlation window from the left to the right image (for a planar target it is an affine mapping that preserves image rows). For corresponding pixels in the left and right images,

IR (u ? d + ; v) = IL (A(u; v)) (2) where the coefficient takes into account possible differ-

Figure 4. Original stereo pairs of images: a planar scene (left), a scene with a cylinder (right).

20

140 50 130

200 0

z

disparity

10

150

40 30

120

−10

20

110

100

10

−20 50 200

150

r 100 c

50

0

0

100 140

0 135

130 y

125

120

x

−10

Figure 5. The results: left — the subpixel disparity map for the plane shown as a 3D mesh (row, column, disparity); right — the reconstructed points (x,y,z) in mm for the cylinder shown as a 3D mesh.

ences in camera gains. By taking a first order linear approximation of (2) over the correlation window, with respect to  and A, we obtain a linear system. The least squares solution of the system gives the subpixel correction . Reconstruction From the disparity maps, and the camera projection matrices the spatial position of the 3D points are computed based on triangulation [7]. The result of the reconstruction (from a single stereo pair of images) is a list of spatial points. Verification During this procedure, all the reconstructed points, from all stereo pairs, are re-projected back to disparity spaces of all camera pairs and it is verified if the projected points match their predicted position in the other image of each of the pairs. Then the selection procedure is re-applied. The output of the verification procedure is a subpixel disparity map with associated weights. This disparity map is much denser, smoother (but at the same time preserving discontinuities), and with less outliers, compared to the unverified maps. The verification eliminates near outliers very effectively. These near outliers typically appear in narrow strips and they are usually artifacts of matching near occlusions. They are very hard to identify by any analysis without referring back to the input images. Finally, the verified points from the reference images are back-projected to 3-D Euclidean space to produce the 3D

reconstructed points.

3. On-line Reconstruction

Figure 2 shows the system diagram for the complete telecollaboration testbed. Currently we have implemented the submodule which is represented on Figure 3. The 3D model acquisition system consists of two parallel processes. The first process continuously acquires a set of images from all cameras and computes the 3D reconstruction from polynocular stereo based on various pairs. The second process continuously acquires images from one camera, and maps the texture of each image onto the current 3D model. Each time a new reconstruction is completed the 3D model is updated. An OpenGL interface provides a simple interactive way of viewing the dynamically updated 3D model on a graphics workstation. This 3D model can be transfered via the Internet for a remote display or use in a virtual reality (this is part of our collaborative effort with the group of Prof. Henry Fuchs at the University of North Carolina at Chapel Hill). The Texture Projector (Figure 3) is an optional component which we may use in the future.

Figure 9. Two views of the reconstruction of the hands with natural texture.

Figure 8. Pairs of images used in the reconstruction.

4. Results

Figure 10. Two views of the reconstructed scene with natural texture.

4.1. Experimental Set-up We used black and white CCD cameras SONY XC77RR, a framegrabber Data Translation DT1451 (about 10MHz sampling frequency). The host was a Sun Ultra30, and we used an Indigo2 SGI workstation for displaying purposes. The input image size was 512(H)  480(V), and the rectified image size was 256  256. The cameras were in a fixed configuration: base line length approximately 8cm, verging at approximately 40 ; the cameras were viewing a volume of approximately 30  30  30cm at a distance of 80cm from the base line. The cameras were strongly calibrated using a grid (their projection matrices are known).

4.2. Quantitative Results We achieve a mean geometric accuracy of less than 1mm and a very good visual appearance for a 30x30x30cm target, even for partially specular surfaces with almost no surface texture. Next, we describe two experiments that demonstrate the accuracy of the stereo algorithm. In the first experiment, the target was a planar white poster board card (a Lambertian reflectance surface), exhibiting some minor surface roughness. Figure 4 (left pair), shows the original images. In the second experiment, we used a scene which contained a wooden cylinder of natural texture, (Figure 4, right pair). The target was the brightest object approximately in the center of the images.

Since the disparity map of a planar surface is a plane, in Figure 5 (left picture) we show the subpixel disparity map for the planar surface as a 3D mesh. For displaying purposes, the 3D plot shows a subsampled map. Irregular “holes” can be observed where the algorithm failed (matches were rejected). The “spikes” are outliers erroneously accepted as valid. For the cylinder we show the results in the world coordinate system: Figure 5, right picture, shows the reconstructed 3D points. The results for the planar case were evaluated quantitatively by fitting a plane to the subpixel disparity, and reporting the residuals. For the cylinder, we fitted a cylinder to the 3D points. The statistics of the residuals are shown in the table, Figure 6, and the histograms of the residuals are given in Figure 7.

4.3. 3D Reconstruction of Static Scenes

First, we present results of the reconstruction of a static scene with a human face and hands, Figure 8. The reconstruction results are given in Figures 9-10. The second experiment demonstrates reconstructions of plants (see Figure 12 and 11).

Figure 11. A basket with a plant with natural texture: original images(top) and the reconstruction (bottom).

Figure 12. A plant with projected texture: original images(top) and two views of the reconstruction (bottom).

4.4. 3D Reconstruction of Dynamic Scenes Figure 13 shows a subsampled sequence of images from the interactive display interface during the reconstruction of a dynamic scene (for an mpeg movie of this example see http://www. cis.upenn.edu/grasp/tii/). We started with a reconstruction of a plane, Figure 13, picture 1. The reconstruction is based on rectified images of size 256  256. It takes 5sec for a disparity range of [?10; 10]. The larger image size (256x256), high accuracy, precision and resolution, are the main advantages of our approach over existing similar systems (for example compared to the benchmarks given in [11]). We are placing different objects in front of a background plane. Images are continuously projected on the already reconstructed points. This is most noticeable in pictures 1-5, and 11 where the rendering of the left rectangular block is flat, i.e. the texture is projected on the points which correspond to the reconstruction of the plane. Simultaneously, reconstructions are computed, and each time a reconstruction is completed, the 3D model is updated, see pictures 7-8 (where the 3D models of the cylinder and the block are clearly perceived at the bottom edges) and 12. Another sequence of 3D reconstruction of a dynamic scene is shown on Figure 14.

5. Conclusions and future work We have presented an implementation of a system for 3D reconstruction from polynocular stereo which is part

of a testbed for telecollaboration. We have discussed the implementation details, and given quantitative results for static scene reconstruction. We have demonstrated the system performance for dynamic scene reconstruction. The 3D model is a set of 3D points with texture. We are working towards recovering higher-level models [?] and hierarchies of representations with varying level of interpretation (and resolution). On the other hand, because of the high volume of data resulting from the stereo process, we propose to use higherlevel per-object model for heads and hands in particular (for instance using deformable models [3, 13, 9, 5]). Combined with real-time tracking this will allow to avoid heavy computations involved in the stereo reconstruction process. The final light-weight description of the scene will then be suitable for the purpose of tele-immersion with Internet II capabilities.

References [1] IEEE and ATR Workshop on Computer Vision for Virtual Reality Based Human Communications, (CVVRHC’98), (http://www.mic.atr.co.jp/ tatsu/cvvrhc/cvvprog.html), January 1998. [2] N. Ayache and C. Hansen. Rectification of images for binocular and trinocular stereovision. Proc. of 9th International Conference on Pattern Recognition, 1:11–16, 1988. [3] R. Bajcsy and F. Solina. Three dimensional object representation revisited. Proc. Int. Conf. in Computer Vision, 1987.



1

2

3

4

5

6

11

2

3

4

5

6

7

8

9

10

11

12

13

14

8

7

9

1

10

12

Figure 13. On-line reconstruction of a dynamic scene. We are placing different objects in front of a background plane. Continuously, images are acquired, reconstructions are computed. An interactive display is the interface through which a remote observer can view the dynamicly updated reconstructions. The viewer’s viewpoint changes during the sequence. In particular, pictures 1,10,11 and 12 show the currently reconstructed points from views exemplifying the 3D. Refer also to Figures 2 and 3, and notes in Section 4.4.

Figure 14. On-line reconstruction of a dynamic scene: We place different objects in front of a plane. The frame rate was 1Hz.

[4] P. Belhumeur. A bayesian approach to binocular stereopsis. Intl. J. of Computer Vision, 19(3):237–260, 1996. [5] R. Bowden, A. Heap, and D. Hogg. Real time hand tracking and gesture recognition as a 3D input device for graphical applications. Gesture Workshop, York, UK, March 1996. [6] F. Devernay. Computing differential properties of 3-D shapes from stereoscopic images without 3-D models. INRIA, Sophia Antipolis, RR-2304, 1994. [7] U. Dhond and J. Aggrawal. Structure from stereo: a review. IEEE Transactions on Systems, Man, and Cybernetics, 19(6):1489–1510, 1989. [8] H. Fuchs and U. Neumann. A vision of telepresence for medical consultations and other applications. Proc. 6th Intl. Symp. on Robotics Res., 1993. [9] A. Heap and D. Hoggs. 3D deformable hand models. Gesture Workshop, York, UK, March 1996. [10] G. Kamberova and R. Bajcsy. Sensor errors and the uncertainties in stereo reconstruction, in K. Bowyer and P.J. Phillips edt. Empirical Evaluation Techniques in Computer Vision, IEEE Computer Society Press, pages 96–116, 1998. [11] K. Konolige. Small vision system: Hardware and implementation. Eighth International Symposium on Robotics Research, Hayama, Japan, October 1997. [12] S. Maybank and O. Faugeras. A theory of self-calibration of a moving camera. Intl. J. of Computer Vision, 8(2):123–151, 1992. [13] D. Metaxas and I. Kakadiaris. Elastically adaptive deformable models. Proc. European Conf. on Computer Vision, 1996. [14] H. Moravec. Robot rover visual navigation. Computer Science:Artificial Intelligence, (13-15):105–108, 1980/1981. [15] M. Okutomi and T. Kanade. A multiple-baseline stereo. IEEE Trans. on Pattern Analysis and Machine Intelligence, 15(4):353–363, 1993. [16] P. R. P. Narayanan and T. Kanade. Constructing virtual worlds using dense stereo. Proc, Intl. Conf. Computer Vision ICCV98, pages 3–10, 1998. [17] S. Roy and I. Cox. A maximum-flow formulation of the N-camera stereo correspondence problem. Proc. Int. Conf. Computer Vision, 1998. [18] D. Scharstein and R. Szeliski. Stereo matching with nonlinear diffusion. Proc. Int. Conf. Computer Vision and Pattern Recognition, 1996. [19] M. Slater and S. Wilbur. A framework for immersive virtual environments (FIVE): Speculations on the role of presence in virtual environments. Presence, 6(6):603–616, 1997. [20] C. Tomasi and R. Manduchi. Stereo without search. Proc. European Conf. Computer Vision, 1996. ˇ ara and R. Bajcsy. On occluding contour artifacts in [21] R. S´ stereo vision. Proc. Int. Conf. Computer Vision and Pattern Recognition, 1997. [22] A. Yuille and T. Poggio. A generalized ordering constraint for stereo correspondence. MIT, Artificial Intelligence Laboratory Memo, (777), 1984.

Suggest Documents