Document not found! Please try again

Modeling Geometric Structure and Illumination ... - Semantic Scholar

11 downloads 6810 Views 954KB Size Report
427 matches - International Conference on Computer Vision (ICCV'98). Bombay, India, January 4–7, ... Step 1: From the first viewpoint, take a sequence of images of the same scene by .... We will call M the photomet- ric matrix for convenience.
In Proc. International Conference on Computer Vision (ICCV’98) Bombay, India, January 4–7, 1998

Modeling Geometric Structure and Illumination Variation of a Scene from Real Images Zhengyou Zhangy z y

z

INRIA, BP 93, F-06902 Sophia-Antipolis Cedex, France ATR HIP, 2-2 Hikaridai, Seika-cho Soraku-gun, Kyoto 619-02 Japan Email: [email protected], [email protected] Abstract

We present in this paper a system which automatically builds, from real images, a scene model containing both 3D geometric information of the scene structure and its photometric information under various illumination conditions. The geometric structure is recovered from images taken from distinct viewpoints. Structure-from-motion and correlation-based stereo techniques are used to match pixels between images of different viewpoints and to reconstruct the scene in 3D space. The photometric property is extracted from images taken under different illumination conditions (orientation, position and intensity of the light sources). This is achieved by computing a low-dimensional linear space of the spatio-illumination volume, and is represented by a set of basis images. The model that has been built can be used to create realistic renderings from different viewpoints and illumination conditions. Applications include object recognition, virtual reality and product advertisement. Keywords: Geometric modeling, Representation, 3D reconstruction, Shading (illumination), CAD/CAM, Virtual reality, Rendering, Object recognition.

1 Introduction We present in this paper a technique which automatically builds, from real images, a scene model containing both the 3D geometric information of the scene structure and its photometric information under various illumination conditions. This work is motivated by two domains of applications: object recognition in computer vision and image-based rendering in computer graphics. A review will be given elsewhere. The remainder of this paper is structured as follows. In Sect. 2, we provide an overview of our system together with an example from a real experiment. In Sect. 3 and Sect. 4, we describe respectively the details of the geometric and the photometric modeling. In Sect. 5, we discuss how to render the reconstructed scene from vantage points under

virtual illumination. In Sect. 6, we conclude with discussions on our future work.

2 Overview In this section, we describe how our system is configured and how it operates.

2.1 System Configuration Our system is composed of three pieces of hardware: an imaging system with a CCD camera, a light source and a computer (see Fig. 1). The CCD camera can be moved in light source

position N

position 1

position 2

scene camera computer

viewpoint 1

viewpoint 2

Figure 1: Hardware configuration of the system. order to acquire images of the scene from at least two different viewpoints. The light source is also movable such that the scene can be illuminated under different conditions. At each viewpoint (thus the camera is fixed), a number of images of the static scene with various illumination are recorded. An essential task to be accomplished by the computer is to bring the image pixels between different viewpoints into correspondence. As to be explained later, only after establishing the correspondence, we are able to build the geometric and photometric model of the scene. For the moment, we only consider two viewpoints.

In our current system, both the CCD camera and the light source are displaced manually by an operator, which provides a high flexibility to bring the system into operation. Alternatively, the displacement of the camera and that of the light source can be controlled by a computer. Even more, we could use synchronized cameras to form a stereo pair and use multiple light sources to form an appropriate configuration in which the energization of each light source is controlled individually by the computer. The computercontrolled lighting system provides an accurate way to illuminate the scene in a specific condition, e.g., a light source moving in a circular trajectory.

2.2

Step 3: From the second viewpoint, take another sequence of images of the same scene by varying the illumination condition between successive shots. This sequence will be referred as the second sequence. Figure 3 shows five images of this sequence, which actually contains 22 images in total.

System Operation

In the following, we explain the system step by step together with an example from one of our real experiments. There are in total 10 steps. Step 1: From the first viewpoint, take a sequence of images of the same scene by varying the illumination condition between successive shots. This sequence will be referred as the first sequence. Figure 2 shows the first sequence of our example which contains 5 images taken in our experiment.

(a)

(b)

(a)

(b)

(c)

(d)

(e)

Figure 3: Example 1: Second sequence. Only five of the total 22 images are shown. They are, from (a) to (e), the first, sixth, 11th, 16th and 21st image in the sequence, respectively.

(c)

(d)

(e) Figure 2: Example 1: First sequence of images under five different illumination conditions. Step 2: Move the camera to a different position.

Step 4: Select one image from the first sequence (referred hereafter as the first reference image) and one image from the second sequence (referred hereafter as the second reference image) such that their illumination conditions are approximately identical, with the aim of facilitating the correspondence process between the two viewpoints. In out experiment, the following “trick” is used. The illumination condition of the last image of the first sequence is maintained for the first image of the second sequence. Therefore, the last image of the first sequence (the one shown in Fig. 2e) is our first reference image, and the first image of the second sequence (the one shown in Fig. 3a) is our second reference image. Step 5: Match characteristic pixels (called points of interest) of the first reference image with the points of interest of the second reference image, and estimate the displacement of the camera between the two viewpoints. This is

done automatically by a software available from the Internet [5]. Figure 4 shows the point matches established automatically and the estimation of the camera displacement which is depicted by the epipolar lines. Step 6: Match, as many as possible, the pixels between the two reference images. A correlation-based stereo technique is used to establish dense correspondence, and the estimation of the camera displacement determined in Step 5 is used to reduce the search space for correspondence. Figure 5 shows the disparity map obtained with our stereo algorithm.

Figure 6: Example 1: Several views of the reconstructed geometric model of the scene. Neighboring points are connected to obtain a wireframe representation.

responding intensity values in the second image sequence. This sequence of intensity values is represented by a vector, whose dimension is equal to the total number of images in both sequences. Step 9: Repeat Step 8 for all modeled pixels and construct a matrix each column of which is a vector obtained in Step 8. The number of columns is the number of matched pixels. This matrix can be considered as a representation of the spatio-illumination volume. Figure 7 shows this volume. Figure 4: Example 1: There are in total 427 matches established, and the epipolar lines shown correspond, from top to down, to match 34, 73, 215 and 426, respectively.

Figure 7: Example 1: The spatio-illumination volume which is

Figure 5: Example 1: Disparity map computed by a correlationbased stereo technique. The greylevel encodes the disparity value. The white color implies that the points have not been able to be paired between the two images.

Step 7: Reconstruct the 3D geometric structure for all matched pixels through triangulation. The particular representation we use is the VRML [3]. Figure 6 shows several views of the geometric model of the scene. Step 8: Select a pixel so modeled in the first reference image, and obtain a sequence of intensity values of this pixel in each of the images from both image sequences. The previously estimated disparity value is used to fetch the cor-

a stack of the image ensemble under different illumination conditions. The images of the second sequence have been transformed into the coordinate system of the first reference image using the disparity values estimated with the correlation-based stereo algorithm. Only matched pixels are shown. The volume is cut by a horizontal slice to show how the intensity varies.

Step 10: Decompose the matrix so obtained in order to obtain a set of basis images whose pixel values are representative of the photometric property of the scene. Figure 8 shows the 6 basis images so obtained with a tolerance of 5% in information loss. Steps 5 to 7 mainly concern the geometric modeling, although they are also essential to Step 8. The details of the geometric modeling are described in Sect. 3. Steps 8 to 10 deal with the photometric modeling, and the details will be described in Sect. 4. Once we have the set of photometric basis images, we

(a)

(b)

(c)

(d)

(e)

(f)

Figure 8: Example 1: Six basis images obtained with a tolerance of 5% in information loss. can choose a set of appropriate coefficients to obtain a texture image through linear combination of the basis images, which is integrated into the VRML model. This allows us to view the virtual (reconstructed) scene from any vantage point. We can therefore produce a realistic rendering, or even an animation. The details will be provided in Sect. 5. Figure 9 shows a few renderings from the same example.

Figure 9: Example 1: A few renderings from the VRML model using the VRweb viewer [4], which renders the texture only in six grey levels.

between two viewpoints and then reconstructing the structure of the scene in 3D space. We only consider the two reference images because each image sequence contains the same geometric information as its corresponding reference image (recall that they are all taken from the same viewpoint). Step 5 (estimating the camera displacement) is done automatically by a software available from the Internet [5]. It starts by finding candidate matches with correlation and relaxation techniques, and then uses a technique based on least-median-squares to detect false matches in order to estimate robustly and precisely the camera displacement. Step 5 also provides a sparse set of point matches, which may be sufficient for many applications. If we want a finer modeling of the scene, we should establish more point matches. As we have now estimated the displacement between the two viewpoints, the two reference images can be considered as a stereo pair, and the epipolar constraint can be used to reduce the search space from the whole image to just a line. Step 6 is then to obtain a dense correspondence between two images by using a correlation-based stereo technique ([1]). Figure 5 shows the disparity map obtained with our correlation-based stereo technique. We have specified an appropriate disparity range during matching such that only the object in the front is correlated. The disparity value, encoded in pseudo-colors, roughly reflects the distance of the space point to the camera: distant points usually have small disparity values, while close points have large disparity. Now we have a set of point matches between the two images. For each pair of matched points, we can obtain four equations in three unknowns (the coordinates of the corresponding point in 3D space), and thus we are able to reconstruct it (Step 7). We have adopted the Virtual Reality Modeling Language (VRML) [3]. We only use for the moment a very limited number of features offered by VRML. Neighboring points, if they are not far from each other, are linked as triangles to obtain a polygons-based representation of the scene. The reconstructed points are the vertices. Texture mapping will be considered in Sect. 5. Once we have a VRML model, any VRML browser can be used to navigate in the virtual environment. Figure 6 shows three views of the reconstructed scene captured using the VRweb viewer [4]. We can observe that the shape has been well reconstructed except for only a few points on the border.

4 Photometric Modeling 3 Geometric Modeling For geometric modeling, we follow a usual approach, which consists of first establishing pixel correspondence

In this section, we present our method for modeling the photometric property of the object surface under various illumination conditions. We now assume that a matrix , which is a representation of the spatio-illumination volume

M

(see Fig. 7), has been constructed as explained in Sect. 2.2 through Step 8 and Step 9. We will call the photometric matrix for convenience. Each row of this matrix then corresponds to an image of the scene under a particular illumination condition, while each column corresponds to a physical point of the scene under various illumination conditions. There is an information redundancy between the rows, and our objective is to decompose this matrix to extract the useful signal which reflects the photometric property of the scene. A particular technique to achieve this is the eigenspace analysis. The idea is to compute a lowdimensional linear subspace of the photometric matrix , without losing majority of the information contained in . This can be achieved by performing the Singular-Value Decomposition (SVD) [2]. be a m-by-n real matrix (m is the number of toLet tal images and n is the number of modeled pixels; usually m  n). Through the SVD, we have

M

M M

M

M = UVT ;

U =



where is a m  m orthonormal matrix, is a m  m diagonal matrix, and is a m  n orthonormal matrix. Let diag 1 ; 2 ; : : : ; m , where i ’s are the singular values of matrix and are in non-increasing order (i.e., i  i+1 ). Matrix contains all observed photometric information of the scene, plus the noise introduced during the image acquisition. We note that the singular values decrease quickly. Figure 10 depicts a graph of different singular values for the example described in Sect. 2. The last singular

V ( M

)

M

M, given by b ; = UB

Then, we obtain an approximation of

c = UV b T M

(1)

b is a m  k matrix whose columns are the first k where U

U

B V

columns of , and is a k  n matrix whose rows are the first k rows of b T . We can make this arrangement because the last m ? k diagonal elements of b are equal to 0. Each row of matrix is then a photometric basis image, and we thus have k basis images which constitute the photometric model of the scene. Figure 8 displays the 6 basis images thus obtained for the example considered in Sect. 2.2. A linear combination of these basis images allows us to generate a realistic image of the scene. If we use the coefficients from a particular row of b , we obtain an image approximate to the corresponding one really observed. If we use coefficients different from those in b , we actually synthesize a new image with a virtual illumination, as we shall discuss later. Figure 11a shows the synthesized image corresponding to the first image of the first sequence (Fig. 2a). The difference between the real and the synthesized image, multiplied by a scale factor of 5 for a better perception of the difference, is shown in Fig. 11b (only modeled pixels are considered). The average error is only 2 greylevels over 255.



B

U

U

270000 240000

(a)

Singular value

210000

Figure 10:

180000

(b)

Figure 11: Approximating the real image. (a) Synthesized image corresponding to the real image shown in Fig. 2a; (b) Difference between the real and the synthesized image, multiplied by a scale factor of 5.

150000 120000 90000

Example 1: Singular values of the photometric matrix

60000 30000 0

1

3

5

7

9

11 13 15 17 19 21 23 25 27

An important question to answer is: how many basis images are necessary? i.e., how is the value of k selected? If we use the Frobenius norm as an information measure 1 , then we can define the information loss as

v u m m  . X u X i2 : i2 Iloss (k) = t i=1 i=k+1

Number

value 27 is very small with respect to 1 (less than 0.5%). Therefore, we can reasonably consider that the last column of represents the noise or information which is not very useful. If we tolerate a small information loss, we can only retain the first k singular values and replace the remaining ones by 0. That is, we approximate the original by

V

b = diag(1 ; : : : ; k ; 0; : : : ; 0) :



(2)

By specifying a threshold of the tolerance in information loss, denoted by ", we can easily compute the smallest k by requiring Iloss k  ". In our example, k when " .

= 5%

()

=6

1 of the SVD, the Frobenius norm of M is simply kMk2F = PmIn terms 2

i=1 i .

5 Rendering and Animation

6 Conclusion and Future Work

Now we have a set of basis images which capture the essential photometric property of the scene under various illumination conditions. Let i (i ; : : : ; k) be the k basis images. A texture image can be obtained through linear combination of i :

In this paper, we have described a working system which automatically builds a VRML model of a scene from real images under various illumination conditions. In order to recover the 3D geometric structure of the scene, the images have to be taken from two distinct viewpoints. In order to facilitate the matching process between two viewpoints, the illumination condition is kept to be the same for the last image from the first viewpoint and the first image from the second viewpoint. The camera displacement is automatically estimated with a robust least-median-squares technique. The full correspondence is established with a correlation-based stereo technique, and the scene is reconstructed for all matched pixels. The whole image ensemble is then expressed in the image coordinate system associated with the first viewpoint. The singular value decomposition technique is applied to extract the essential photometric information, which yields a set of basis images. They can later be linearly combined as a texture image to simulate the scene appearance under a particular illumination condition. The recovered 3D shape and the texture image allow us to create realistic renderings from different viewpoints under different illumination conditions. There are several improvements and extensions to be made in our future work. First, color images will be considered. Use of color images will facilitate geometric modeling because matching between two views is easier. In photometric modeling, how to extend our current technique is not yet clear. Second, our current system only uses images from two viewpoints. The scene model is not complete. Multiple viewpoints should be considered in the future. Lastly, we have observed (not shown due to space limitation) a tight relationship between the motion of the light source and the variation of the photometric coefficients. This needs to be studied in a more careful and detailed way. This relationship is very important for rendering and animation, because it can be used to predict the appearance of a scene with respect to an arbitrary light source.

x

b

x= 1

b =1

k X i=1

ci bi + o1 ;

(3)

where is a vector whose elements are all equal to 1, and o is some scalar. The term o is included to simulate a camera’s offset. Figure 11a shows an example of the texture images generated in this way. The coordinate system of our texture image, by our choice, coincides with the first reference image. The texture image is integrated in VRML. From geometric modeling, we are able to connect each vertex in the scene (the reconstructed 3D points) with a corresponding spot on the texture image (the modeled image pixels). Figure 9 shows renderings of the scene from three different viewpoints and textures. They roughly simulate illumination from lower-right, bottom and top-left directions, respectively. By appropriately varying the coefficients ci and o in (3), we can generate a sequence of texture images which simulate the effects of the scene appearance with a light source moving along a specific trajectory. The 6 texture images shown in Fig. 12 have been generated from the basis images shown in Fig. 8.

1

(1)

(2)

References (3)

(4)

(5)

(6)

Figure 12: Example 1: Simulation of the effect of a light source moving horizontally from the left to the right

[1] O. Faugeras. Three-Dimensional Computer Vision: a Geometric Viewpoint. MIT Press, 1993. [2] G. Golub and C. van Loan. Matrix Computations. The John Hopkins University Press, 1989. [3] J. Hartman and J. Wernecke. VRML 2.0 Handbook. AddisonWesley, Aug. 1996. [4] VRweb: A multi-system VRML viewer. Software available from http://www.iicm.tu-graz.ac.at/vrweb/, 1996. [5] Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence Journal, 78:87–119, Oct. 1995.

Suggest Documents