Reconstructing Dense Light Field From Array of ... - IEEE Xplore

4 downloads 0 Views 1MB Size Report
Multifocus Images for Novel View Synthesis. Akira Kubota, Member, IEEE, Kiyoharu Aizawa, Member, IEEE, and Tsuhan Chen, Senior Member, IEEE.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

269

Reconstructing Dense Light Field From Array of Multifocus Images for Novel View Synthesis Akira Kubota, Member, IEEE, Kiyoharu Aizawa, Member, IEEE, and Tsuhan Chen, Senior Member, IEEE

Abstract—This paper presents a novel method for synthesizing a novel view from two sets of differently focused images taken by an aperture camera array for a scene consisting of two approximately constant depths. The proposed method consists of two steps. The first step is a view interpolation to reconstruct an all-in-focus dense light field of the scene. The second step is to synthesize a novel view by a light-field rendering technique from the reconstructed dense light field. The view interpolation in the first step can be achieved simply by linear filters that are designed to shift different object regions separately, without region segmentation. The proposed method can effectively create a dense array of pin-hole cameras (i.e., all-in-focus images), so that the novel view can be synthesized with better quality. Index Terms—Blur, image-based rendering (IBR), light-field rendering (LFR), spatial invariant filter, view interpolation.

I. INTRODUCTION

M

OST view synthesis methods using multiview images involve the problem of estimating scene geometry [1]. A number of methods have been investigated for solving this problem (e.g., stereo matching [2]); however, it is still difficult to obtain accurate geometries for arbitrary real scenes. Using an inaccurate geometry for the view synthesis would induce visible artifacts in the synthesized novel image. As an alternative to such a geometry-based approach is image-based rendering (IBR) [3], [4], which has been studied for view synthesis. IBR does not need to estimate the scene geometry and enables synthesis of photo-realistic novel images, independent of the scene complexity. The idea of IBR is to sample light rays in the scene by capturing multiple images densely enough to create novel views without aliasing artifacts through resampling of the sampled light rays [5]–[7]. The sampling density (i.e., camera spacing density) required for nonaliasing resampling is impractically high, however. To reduce the required sampling density, some geometric information is needed [8]. One problem is that, in practice, it is difficult to realize both of these approaches,

Manuscript received November 3, 2005; revised February 12, 2006. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Benoit Macq. A. Kubota is with the Department of Information Processing, Tokyo Institute of Technology, Yokohama 226-8502, Japan (e-mail: [email protected]). K. Aizawa is with the Department of Electrical and Electronic Engineering, University of Tokyo, Tokyo 113-8656, Japan (e-mail: [email protected]. jp). T. Chen is with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213-3890 USA (e-mail: [email protected]). Color versions of Figs. 1, 2, 4–8, and 10–14 are available online at http:// ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2006.884938

that is, accurately obtaining the scene geometry and densely capturing light rays, without special equipment, such as a laser scanner and plenoptic camera [9]. Therefore, a more practical and effective solution for solving this problem is required. In most recently presented IBR methods, to solve this problem, research has mainly concentrated on how to accurately obtain the geometric information to reduce the required sampling density. One approach is to find feature correspondences between reference images. This is a conventional approach, but it has recently been improved for application to IBR.With this approach, high-confidence feature points that are required to provide a synthesized view of sufficient quality can be detected and matched. Aliaga et al. [10] presented a robust feature detection and tracking method in which potential features are redundantly labeled in every image and only high-confidence features are used for matching. Siu et al. [11] proposed an image registration method between a sparse set of three reference images, allowing feature matching in a large search area. They introduced an iterative matching scheme based on the confidence level of the extracted feature points in terms of topological constraints that hold between triangles composed of three feature points in the reference images. Another approach is to estimate a view-dependent depth map at a novel viewpoint based on the color consistency of corresponding light rays or pixel values between the reference images, adopting a concept used in volumetric techniques [12]–[14], such as space sweeping for scene reconstruction. In this approach, the color consistency is measured at every pixel or selected pixels with respect to different hypothetical depths in the scene, and the depth value is estimated to be the depth that gives the highest consistency. The average of the color values with the highest consistency then is rendered at the pixel of the novel image. This approach is very suitable for real-time processing [15]–[18]. In this paper, we present a novel IBR approach that differs from the conventional ones. The concept of our approach is illustrated schematically in Fig. 1. We deal with a scene consisting of foreground and background objects at two approximately constant depths, and we capture two sets of images with different focuses, as reference images, using a 1-D array of real aperture cameras. Unlike the conventional methods using pin-hole cameras, our proposed method uses aperture cameras to capture differently focused images with each camera: one focused on the foreground and the other focused on the background. Our proposed method consists of two steps. In the first step, we interpolate intermediate images that are focused on both objects at densely sampled positions between all of the capturing cameras. In the second step, we synthesize a novel

1057-7149/$25.00 © 2006 IEEE

270

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

Fig. 2. View interpolation problem in the first step of our approach. We interpolate an all-in-focus intermediate image, f , that would be seen from a virtual camera position between the two nearest cameras. We use four reference images for this interpolation: the near-focused image at C , g ; the far-focused image ; and the far-focused image at at C , g ; the near-focused image at C , g C ,g .

Fig. 1. Concept of our approach. Assuming a scene containing two objects at different depths, we capture near- and far-focused images at each camera position of a 1-D aperture camera array. In the first step, we densely interpolate all-in-focus intermediate images between the cameras by linear filtering of the captured images. In the second step, we apply LFR to synthesizing novel views using the dense light-field data obtained in the first step.

view by light-field rendering (LFR) [6] using the intermediate images (i.e., the dense light-field data) obtained in the first step. The view synthesis in the second step can be easily achieved with adequate quality only if, in the first step, the light-field data are obtained with sufficient accuracy and density to achieve nonaliasing LFR. This paper mainly addresses the problem of view interpolation between the capturing cameras in the first step and presents an efficient view interpolation method using spatially invariant linear filters. The reconstruction filters we present in this paper make it possible to shift each object region according to the parallax required for the view interpolation, without region segmentation. For the simple scene we assume here, the conventional vision-based methods such as stereo-matching [2] or depth from focus/defocus [19]–[21] can estimate the scene geometry, but they depend on the scene complexity and have a high computational load. In contrast, our approach needs only filtering and is much simpler than such vision-based approaches. In addition, it is independent of the scene complexity as long as the scene consists of two approximately constant depths. II. RECONSTRUCTING DENSE LIGHT FIELD A. Problem Description In the first step of our method, we interpolate intermediate images densely between every camera pair. The view interpolation problem we deal with here is illustrated in Fig. 2. Our goal is to generate an all-in-focus intermediate image that would be captured at an arbitrary position (at the virtual camera in Fig. 2) and (where is referred between the adjacent cameras to as the camera index number) nearest to the view position. We

use the four reference images captured with the cameras: the image at the camera focused on the foreground; the image at focused on the background; the image at focused on the foreground; and the image at focused on the background (see example in Fig. 6). The view position of the intermediate image is represented as an intermediate po, sition between the cameras, parameterized by for and the distance between the cameras is . We assume that the focal lengths of all cameras, including the virtual camera, are the same. B. Imaging Model We model the four reference images and the desired intermediate images by a linear combination of foreground and background textures. Consider the model of two differently focused images and at camera . First, we introduce the foreground texture and the background texture that are visible from the camera ; we define them in the same way as those in [22] (1) (2) is the ideal all-in-focus image that is supposed where is the depth map at the to be captured with camera , camera position, denoting the depth value corresponding to the , and and are the depths of pixel coordinate the foreground and the background objects, respectively. Note that these textures and the depth map are unknown. The two and can be modeled by the differently focused images following linear combination of the defined textures: (3) where is the point spread function (PSF) and denotes a 2-D convolution. We assume that the PSF can be modeled as a Gaussian function (4)

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD

271

where is an amount of blur, which is related to the corresponding standard deviation of the Gaussian function [23]: . In (3), the same PSF can be used for both defocus regions because the amounts of blur in both regions become the same after correction of image magnification due to the different focus settings (for details, see [24]). The PSF does not depend on the camera position and can be commonly used for all the reference images, since we assume that the depths of the two objects are constant for all camera positions. In addition, since the amount of blur is estimated in the preprocessing step using our previously presented method [24], the PSF is given. The linear imaging model in (3) is not correct for the occluding boundaries, as reported in [25]. However, it has the advantage that it enables us to use convolution operations, which can be represented by product operations in the frequency domain, resulting in a simpler imaging model [see (8)]. Moreover, satisfactory results are obtained with this model, as shown by our experimental results in Section III. The limitations of this model and their effect on the quality of the intermediate images will be tested on real acquired images in Section IV. For modeling the all-in-focus intermediate image, we have to consider two things: how to model the parallax and how to model the occluded background texture. To model the parallax, we use a combination of textures that are appropriately shifted according to the intermediate position, parameterized by . When the textures and are used, the intermediate image, say , is modeled by (5) where and are, respectively, disparities of the foreground and the background objects between the adjacent cameras. These disparities can be estimated in the preprocessing step. The shift amounts we have to apply to the foreground and the and , respectively. background textures are and are Similarly, when the textures used, the intermediate image (say ) at the same position is modeled by

(6) and are the shift amounts to be where applied to the foreground and the background textures, respectively. To fill in the occluded background in either one of the two modeled by (5) and (6), we simply take their images and and for weighted average using weighting values and , respectively, and finally model the desired intermediate image as (7) C. View Interpolation With Linear Filters in the Frequency Domain Our goal is to generate the intermediate image in (7) from , and , modeled in the four reference images, , , (3). Note that the PSF and the disparities and are given,

, and are unknown. In this secbut the textures , , tion, we derive reconstruction filters that can generate the desired image directly from the reference images without estimating the textures (i.e., without region segmentation). in (5) from First, consider the problem of generating and in (3). The same problem was already dealt with in our previous paper [22] in which we presented an iterative reconstruction in the spatial domain. In the present paper, we present a much more efficient method using filters in the frequency domain. We take the Fourier transform of (3) and (5) to obtain those imaging models in the frequency domain, as follows: (8) and (9) where and denote the horizontal and vertical frequencies, respectively. We use capital-letter functions to denote the Fourier transform of the corresponding small-letter functions. These imaging models in the frequency domain are simpler than their spatial domain counterparts because the convolution and shifting operations are transformed into product operaand from (8) and (9), we derive the tions. By eliminating following sum-of-products formula: (10) and can be considered as the frequency characwhere and in the teristics of linear filters that are applied to frequency domain, respectively. The forms of and are as follows: (11) These filter values cannot be determined stably at (i.e., the DC component), since the denominator, , equals 0 and both limit values of (11) diverge at the DC. This divergence causes visual artifacts in the interpolated image when there is noise or modeling error in low-frequency components and . To avoid this problem, regularization is needed of based on a constraint or prior information about the intermediate image. In this paper, we propose a frequency-dependent shifting method that is designed to have shift amounts that gradually decrease to zero at DC, as shown in Fig. 3. We do not interpolate the DC component of the intermediate image. This is reasonable given that the low-frequency components, including DC, do not cause many visual artifacts in the image, even if they are not shifted appropriately. and Let the amounts of frequency-dependent shifting on for be and , respectively. Using a Cosine function, we formulate as follows: (12) where is a threshold frequency up to which the shift amount similarly to changes frequency dependently. We formulate

272

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

Fig. 3. Frequency-dependent shift amount, d . The shift amount is designed to gradually increase from zero at DC to d at a preset frequency threshold  . For frequencies larger than  , it has a constant value of d .

the above equation, but using instead of . By using these frequency-dependent shift amounts, we can make the reconstruction filters stable and design them as follows: (13) Note that both limit values of (13) converge to 0.5 at the DC. , which is the Second, consider the problem of generating Fourier transform of (6), using and . Similarly to the first case, we can derive the following formula for generating :

(14) with the stable reconstruction filters signed to be

and

, which are de-

(15)

using the frequency-dependent shift amounts is formulated to be case,

and

. In this

Fig. 4. Block diagram of the proposed view interpolation method using filters.

By taking the inverse Fourier transform of , we can obtain the all-in-focus intermediate image . This suggests that interpolation of the desired intermediate image can be achieved simply by linear filtering of the reference images. The reconstruction filters we designed in (13) and (15) consist of PSF and shifting operators (i.e., exponential functions) that are spatially invariant; therefore, view interpolation without region segmentation is possible. We can form these filters because the amount of blur and disparities are known or can be estimated. Fig. 4 shows a block diagram of the proposed view interpolation method based on (17). We use filtering in the frequency domain for faster calculation. First, we take the Fourier transform (FT) of the four reference images and multiply them by the reconstruction filters in (13) and (15). We then sum up all the images after the multiplication and finally take the inverse FT of the summed image. It should be noted that region segmentation is not performed in this diagram. Using the proposed view interpolation method, we can generate an all-in-focus intermediate image at any position between two adjacent cameras. Therefore, we can virtually create images identical to those that would be captured with densely arranged pin-hole cameras. The created set of images is considered as the dense light field, from which novel views from arbitrary positions can be rendered with sufficient quality by LFR. This will be demonstrated in Section III. III. EXPERIMENT

(16) is formulated in the same manner. Note that both limit and values of (15) also converge to 0.5 at the DC. Finally, by substituting the resultant (10) and (14) using the stable reconstruction filters in (13) and (15) into the Fourier transform of (7), we derive the filtering method for generating the desired intermediate image in the frequency domain

(17)

A. Experimental Setup Our experimental setup is shown in Fig. 5. We assume and coordinates as a 2-D world coordinate system. In our experiment, we captured ten sets of two differently focused images (20 reference images in total) at ten different positions on the axis with the F-number fixed at 2.4. When caphorizontal turing the reference images, we used a horizontal motion stage axis dito move a single digital camera (Nikon D1) in the mm from 0 to 72 mm. Let rection at equal intervals of and . us refer the camera at each position as The test scene we set up in this experiment consisted of foreground objects (balls and a cup containing a pencil) and back-

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD

273

TABLE I ESTIMATED AMOUNT OF BLUR AND DISPARITIES

Fig. 5. Configuration of scene and camera array used in the experiment.

ground objects (textbooks), located roughly at mm on the axis, respectively. and

mm

B. Preprocessing Steps As preprocessing steps, we needed three steps: 1) image registration; 2) estimation of blur parameter, ; and 3) estimation of disparities, and . Image registration is needed to correct the difference in displacements and magnification between the two differently focused images acquired at every camera position. The magnification is caused by the difference in focus between the images. We used the image registration method presented in [24], which is based on a hierarchical matching technique. Parameter estimation in steps 2) and 3) can be performed via camera calibration using captured images of a test chart. Once these parameters are estimated, we do not need to perform the estimation again and can use the same parameters for other scenes so long as they consist of objects at the same two depths. In this experiment, we estimated these parameters by the following image processing using the captured reference images of the test scene. This is because we used a single camera, not a camera array, and it was difficult to correctly set each camera position to the same position used during calibration. For the blur estimation based on the captured images, we also used our previously presented method [24], which estimates the amount of blur in the foreground and background regions indeand , respectively. The basic idea used in pendently, say the method is as follows. A blurred version of the near-focused image was created by applying the PSF with a blur amount and compared with the far-focused image to evaluate the similarity (here we used the absolute difference) between them at each pixel. After evaluating the similarity for various blur amounts , the blur amount that gave the maximum similarity at a pixel was estimated to be . In principle, this pixel belongs to a region that is part of the foreground region. Finally, the estimated blur amounts were averaged in this region. The same idea was applied to estimating . Note that this blur estimation could not precisely segment each region but could effectively estimate the blur only in the regions with texture patterns, excluding uniform regions and object boundary regions (see [24]).

Estimated blur amounts at every camera position, together with their averages (ave.) and standard deviations (std.), are shown in Table I. The two estimated values were not exactly had the same for the most cases, and the estimated values of some variation (their std. was 0.29 pixels). This is because of estimation errors and depth variation on the surface of the foreground objects. Nevertheless, we could use the averaged values and , namely 3.6 pixels, as the repof all the estimates of resentative blur amount for forming the PSF . This approximation was possible because the proposed view interpolation method is robust to blur estimation errors, which we confirmed experimentally, as described in Section IV-B. Disparity estimation was performed using a template matching with subpixel accuracy. In this experiment, we manually specified a block region to be used as a template. When estimating the disparity of the foreground, , we specified a region of a face pattern drawn at the center of the cup in the at , and we found the horizontal near-focused image position of the template that yielded the best approximation of the corresponding region in all other near-focused images, for . When estimating the disparity of the background, , we specified the region containing the word “SIGNAL” on the textbook and carried out template matching with all other far-focused images. The estimated disparities and between adjacent cameras are shown in the last two columns in Table I. Since the maximum deviation was within 0.5 pixels for both estimated values for all camera pairs, we and to be the averaged determined the final estimates values 18.5 and 11.4 pixels, respectively. C. Reconstructed Dense Intermediate Images We constructed the reconstruction filters using the estimated parameters of blur amount and disparities described in the previous subsection. Based on the proposed filtering method in Fig. 4, we interpolated nine all-in-focus intermediate images between every pair of adjacent cameras, for a total of 91 images, including the all-in-focus images at the same positions as the and 1). The parameter was varied cameras (that is, from 0 to 1 in equal increments of 0.1 for each interpolation. was set to 0.01 for all experiments; The threshold frequency this value was empirically determined. The same reconstruction

274

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

X

Fig. 6. Examples of captured real images. (a) A near-focused image at = 16 mm. (b) A far-focused image at = 16 mm. (c) A near-focused image at = 24 mm. (d) A far-focused image at = 24 mm.

X

X

X

process was carried out in each color channel of the intermediate images. Examples of captured images are shown in Fig. 6(a) and (b) are, respectively, near-focused and far-focused images captured with camera at mm, and Fig. 6(c) and (d) are those captured with camera at mm. These images are 24-bit color images of 280 256 pixels. We observed different disparities in the regions of the foreground and background objects; these were measured to be 18.9 pixels and 11.8 pixels, respectively, as shown in Table I. Although the distance between the cameras was set to 8 mm, this was not small enough to interpolate the intermediate images with adequate quality using LFR. This is because the difference between the disparities was about eight pixels, which is larger than the ideal maximum difference of two pixels that is allowed for nonaliasing LFR [8]. Fig. 7 shows all-in-focus intermediate images interpolated by our proposed method from the reference images in Fig. 6 for , 0.2, 0.4, 0.6, 0.8, and 1, which correspond to the camera positions of 16, 17.6, 19.2, 20.8, 22.4, and 24 mm, respectively. In all interpolated images, it can be seen that both foreground and background regions appear in-focus and are properly shifted. Expanded regions (64 64 pixels in size) including occluded boundaries in these images are shown in the right column of Fig. 7. At the occluded boundaries, although the imaging model we used is not correct, no significant artifacts are visible. Instead of artifacts, background textures (letters) that should have been hidden are partially visible because the object regions are shifted and blended in our method. This transparency of the boundaries does not produce noticeable artifacts in the final image obtained in the novel view synthesis, seen in Fig. 10. Filling in the hidden background precisely is generally a difficult problem even for state-of-the-art vision-based approaches. Estimation errors in the segmentation

Fig. 7. Intermediate images interpolated by the proposed method from the four reference images in Fig. 6.

causes unnaturally distorted or broken boundaries, which would be highly visible in a novel image sequence created by successively changing the novel viewpoint. Our proposed

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD

275

Fig. 8. Example of epipolar plane images (EPIs) of the reference images and the densely interpolated images by the proposed method. Here, white regions in (a) and (b) indicate regions of value zero. The proposed method can generate a dense EPI in (c) from the sparse EPIs in (a) and (b) by linear filtering. (a) EPI of g . (b) EPI of g . (c) EPI of the interpolated f .

method and the conventional vision-based approaches both need a smoothing process on the boundaries to prevent visible artifacts. In our approach, blending the textures serves this purpose. Fig. 8(a) and (b) show epipolar plane images (EPIs) constructed from captured reference images and , respectively, . Each horizontal line corresponds to a scanfor line image (here is chosen) of each reference image and its vertical coordinate indicates the corresponding camera position . These EPIs are very sparse and each EPI has only ten scan lines, according to the number of cameras. An EPI of the intermediate images interpolated by the proposed method is shown in Fig. 8(c), where each horizontal line in the EPI corresponds to a scan-line image of each interpolated image. The interpolated EPI is much denser (91 horizontal scan lines) compared with the EPIs of the reference images, and it contains straight, sharp line textures. This indicates that the all-in-focus intermediate images were generated accurately by the proposed method. The slope of lines indicates the depth of the objects. The striped regions with the larger slope corresponds to the foreground regions and the striped regions with the smaller slope, which are partially occluded by the foreground regions, corresponds to the background regions. Some conventional view-interpolation methods adopting interpolation of EPI [26]–[28] attempt to detect stripe regions with the same slope. In contrast, our method does not need such region detection but simply spatial-invariant filtering, under the specific condition that two differently focused images are captured at each camera position for a scene with two depths. D. Synthesizing a Novel View by LFR This section describes the second step of our approach, i.e., novel view synthesis by LFR using the interpolated intermediate images obtained in the first step. We also show some experimental results. LFR is performed based on pixel matching between the obtained intermediate images, assuming that the scene is a plane (see Fig. 9) [9]. Let us represent the interpolated images at position for . For instance, by

Fig. 9. LFR for synthesizing the novel image.

for corresponds to interpolated and . Let a novel image be , the between cameras , and the assumed plane . viewpoint is considered as an intensity value of the light ray passing through the novel viewpoint and the pixel position. We find the in Fig. 9) of the novel light ray intersection (indicated by with the assumed plane and project it onto every imaging plane of the interpolated images to obtain the corresponding pixel positions, that is, the corresponding light rays, from which we can determine the two nearest light rays. The intensity value of the novel light ray is finally synthesized as a weighted average of the intensity values of the two nearest light rays (say and ) based on the following linear interpolation:

(18) is the coordinate of the novel light ray at where , as shown in Fig. 9. The optimal depth of the plane [8] is

276

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

Fig. 10. Examples of the novel views synthesized by LFR from the densely interpolated images. ( ; Z ) are the coordinates of the novel viewpoint along the horizontal and depth axes. In (a) and (b), a zooming effect is demonstrated by changing only the depth position of the viewpoint. In (c) and (d), a parallax effect is demonstrated by changing only the horizontal position of the viewpoint. (a) (X ; Z ) = (20; 300), (b) (X ; Z ) = (20;300), (c) (X ; Z ) = (20;100), and (d) (X ; Z = (70;100).

X

0

determined so that the novel image has the least artifacts due to pixel mismatching over the entire image and is given by (19) Fig. 10 shows examples of synthesized novel view images denotes the from various viewpoints. The coordinate novel viewpoint. In order to clearly see the perspective effects of the novel view images, we changed the viewpoint along the horizontal and depth axes independently. In Fig. 10(a) and (b), a zooming (close up) effect is demonstrated by changing only the depth position of the viewpoint when the horizontal position was fixed at 20 mm. We can see that the image in Fig. 10(b) is not merely a magnified version of the image in Fig. 10(a); the foreground objects (the cup and pencil) were magnified more than the background object in Fig. 10(b). Fig. 10(c) and (d) demonstrates the parallax effect by changing only the horizontal position of the viewpoint. Different shifting (displacement) effects were observed on the objects, and the hidden background texture in one image was visible in the other image. IV. DISCUSSION We have shown that we can effectively interpolate all-in-focus intermediate images between two neighboring cameras simply by filtering the reference images captured with these cameras with different focuses. In this section, we experimentally evaluate the performance of our view interpolation method in terms of its accuracy and robustness against estimation errors in estimating the blur amount.

A. Performance Evaluation We tested the accuracy of the proposed method for the case of interpolating an image at the center of two camera positions. At each of the two camera positions, near- and far-focused images were captured with an F-number of 2.4 for the same scene used in the experiments. To measure the accuracy of the interpolated image, a ground truth image is required. In this test, we captured an image at the center of the camera positions with a small aperture (F-number: 13) as the ground truth image so that the image was focused on both regions at different depths. Fig. 11 shows the ground truth image and interpolated images for different obtained by the proposed method with distances between the cameras, namely, 4, 8, 12, and 16 mm. For comparison, we also generated the center image by LFR based on the focal plane at the optimal depth of 1230 mm calculated by (19). The reference images used for LFR were the all-in-focus images generated at the two camera positions by and 1. A comparison between the proposed method with the results in Fig. 11 shows the advantage of our method over LFR. In the images obtained by our method, both object regions are in-focus and properly shifted without noticeable artifacts. In contrast, both regions in the images interpolated by LFR suffer from blur or ghosting artifacts. These artifacts are caused by incorrect pixel correspondences due to the large distance between the cameras. Our method can prevent such incorrect pixel correspondences by properly shifting each object region. In addition, this shifting operation can be achieved simply by spatially invariant filtering of the reference images and does not require region segmentation. Mean-square errors (MSEs) in all color channels were computed as a quality measure between the interpolated images and the ground truth image, as shown in Fig. 12. MSEs in the LFR method were larger than those in the proposed method. The quality of images interpolated by our method was sufficiently good for all cases, whereas that of the images interpolated by LFR was much lower, becoming worse as the distance between the cameras increased. In the interpolated images, when the distance was 16 mm (bottom-left in the Fig. 11), occluded boundaries (the pencil in the foreground and letters in the background) appeared transparent due to shifting and blending of different textures by our method. Although our proposed method could not prevent these transparency effects, the blending of shifted textures in our apand proach had the effect of canceling out errors in , which are the intermediate images modeled in (5) and (6) before blending. These images are shown in Fig. 13. There are noticeable artifacts in color value because the recon, , , and struction filters for generating these images, , had much larger values at lower frequency and amplified the errors in the frequency. However, the finally interpolated image at the bottom-left in Fig. 11, which is a blended (average) of the intermediate images, has less visible errors, showing the error-canceling effect. B. Effect of Blur Estimation Error In this section, we examine the robustness of our proposed view interpolation method against estimation error of blur

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD

277

Fig. 12. Performance evaluation of the proposed method and the conventional LFR for center view interpolation. MSEs were evaluated between the interpolated view images and the actually captured all-in-focus image (the ground truth image) at the center position.

Fig. 13. Generated f (x; y ; 0:5) and f (x; y ; 0:5) that are averaged to produce the final interpolated image shown at the bottom left in Fig. 11.

Fig. 11. Comparison between the proposed method and LFR method. (a) The ground truth image, which is the image actually captured at the center of the two cameras with a small aperture (F-number: 13). (b) Interpolated images by (left) the proposed method and (right) LFR method at the center of the two cameras for various distances between the cameras, namely (from top to bottom) 4, 8, 12, and 16 mm.

amount for the same scene. Setting various amounts of blur from 1.0 to 6.0 pixels, we interpolated intermediate images at the center of two cameras placed 8 mm apart and measured

MSEs between the interpolated images and the ground truth image. The measured MSEs of three color channels are shown in Fig. 14. Both MSEs in the red and green channels were pixels; this blur amount is not the same minimized at as the estimated blur amount of pixels described in Section III-B. This error is, however, allowed and does not degrade the quality of the intermediate image, as shown in Section IV-A. This is because the MSE in the Red channel is smaller than 30 in the wide range of blur amounts from 3.2 to 5.3 pixels, where the MSEs in the other color channels are lower than 23. Therefore, our method is robust against blur estimation errors. This result also shows that our method allows the scene to have some degree of depth variation so long as the amount of blur caused in the region is within that range. The reason why the MSE in the Blue channel becomes lower as the blur amount increases is due to the low blue component in the images. V. CONCLUSION This paper has presented a novel two-steps approach for IBR. Unlike the conventional IBR methods, the presented method uses aperture cameras to capture reference images with different focuses at different camera positions for a scene consisting of two depth layers. The first step is a view interpolation for densely generating all-in-focus intermediate images be-

278

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2007

Fig. 14. Error evaluation of the proposed method for the center view interpolation under various blur amounts. MSEs were evaluated between the interpolated view image and the ground truth image at the center position.

tween the camera positions. The obtained set of intermediate images can be used as the dense light-field data for high-quality view synthesis via LFR in the second step. We showed that the view interpolation can be achieved effectively by spatially invariant filtering of the reference images, without requiring region segmentation. The presented view interpolation method works well even for the case of a camera spacing sparser than that required for nonaliasing LFR.

REFERENCES [1] M. Oliveira, “Image-based modeling and rendering techniques: A survey,” RITA—Rev. Inf. Teor. Apl., vol. IX, pp. 37–66, 2002. [2] U. R. Dhond and J. K. Aggarwal, “Structure from stereo: A review,” IEEE Trans. Syst., Man, Cybern., vol. 19, no. 6, pp. 1489–1510, Jun. 1989. [3] C. Zhang and T. Chen, “A survey on image-based rendering—representation, sampling and compression,” EURASIP Signal Process.: Image Commun., vol. 19, pp. 1–28, 2004. [4] H.-Y. Shum, S. B. He, and S.-C. Chan, “Survey of image-based representations and compression techniques,” IEEE Trans. Circuit Syst. Video Technol., vol. 13, no. 11, pp. 1020–1037, Nov. 2003. [5] E. H. Adelson and J. R. Bergen, “The plenoptic function and the elements of early vision,” in Computat. Models of Visual Processing, M. S. Landy and J. A. Movshon, Eds. Cambridge, MA: MIT Press, 1991, pp. 3–20. [6] M. Levoy and P. Hanrahan, “Light field rendering,” in Proc. ACM SIGGRAPH, 1996, pp. 31–42. [7] H.-Y. Shum and L.-W. He, “Rendering with concentric mosaics,” in Proc. ACM SIGGRAPH, 1999, pp. 299–306. [8] J-X. Chai, X. Tong, S.-C. Chany, and H.-Y. Shum, “Plenoptic sampling,” in Proc. ACM SIGGRAPH, 2000, pp. 307–318. [9] R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, and P. Hanrahan, “Light Field Photography With a Hand-Held Plenoptic Camera,” Tech. Rep. CSTR 2005–02, Comput. Sci. Dept., Stanford Univ, Stanford, CA, 2005. [10] D. G. Aliaga, T. Funkhouser, D. Yanovsky, and I. Carlbom, “Sea of images,” IEEE Comput. Graph. Appl., vol. 23, no. 6, pp. 22–30, Jun. 2003. [11] A. M. K. Sju and E. W. H. Lau, “Image registration for image-based rendering,” IEEE Trans. Image Process., vol. 14, no. 2, pp. 241–252, Feb. 2005. [12] R. T. Collins, “Space-sweep approach to true multi-image matching,” in Proc. CVPR, 1996, pp. 358–363.

[13] S. M. Seitz and C. R. Dyer, “Photorealistic scene reconstruction by voxel coloring,” in Proc. CVPR, 1997, pp. 1067–1073. [14] K. Kutulakos and S. M. Seitz, “A theory of shape by space carving,” Int. J. Comput. Vis., vol. 38, no. 3, pp. 199–218, Jul. 2000. [15] R. Yang, G. Welch, and G. Bishop, “Real-time consensus-based scene reconstruction using commodity graphics hardware,” in Proc. Pacific Conf. Computer Graphics and Applications, 2002, pp. 225–235. [16] I. Geys, T. P. Koninckx, and L. V. Gool, “Fast interpolated cameras by combining a GPU based plane sweep with a max-flow regularisation algorithm,” in Proc. Int. Symp. 3D Data Processing, Visualization and Transmission, 2004, pp. 534–541. [17] K. Takahashi, A. Kubota, and T. Naemura, “A focus measure for light field rendering,” in Proc. Int. Conf. Image Processing, 2004, pp. 2475–2478. [18] C. Zhang and T. Chen, “A self-reconfigurable camera array,” in Proc. Eurographics Symp. Rendering, 2004, pp. 243–254. [19] J. Ens and P. Lawrence, “An investigation of methods for determining depth from focus,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 5, pp. 97–107, May 1993. [20] S. K. Nayar and Y. Nakagawa, “Shape from focus: An effective approach for rough surfaces,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 16, no. 8, pp. 824–831, Aug. 1994. [21] N. Rajagopalan, S. Chaudhuri, and U. Mudenagudi, “Depth estimation and image restoration using defocused stereo pairs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1521–1525, Nov. 2004. [22] K. Aizawa, K. Kodama, and A. Kubota, “Producing object based special effects by fusing multiple differently focused images,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 2, pp. 323–330, Feb. 2000. [23] M. Subbarao, T.-C. Wei, and G. Surya, “Focused image recovery from two defocused images recorded with different camera settings,” IEEE Trans. Image Process., vol. 4, no. 12, pp. 1613–1627, Dec. 1995. [24] A. Kubota and K. Aizawa, “Reconstructing arbitrarily focused images from two differently focused images using linear filters,” IEEE Trans. Image Process., vol. 14, no. 11, pp. 1848–1859, Nov. 2005. [25] N. Asada, H. Fujiwara, and T. Matsuyama, “Seeing behind the scene: Analysis of photometric properties of occluding edges by the reversed projection blurring model,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 2, pp. 155–167, Feb. 1998. [26] R. Hsu, K. Kodama, and K. Harashima, “View interpolation using epipolar plane images,” in Proc. IEEE Int. Conf. Image Processing, 1994, pp. 745–749. [27] A. Katayama, K. Tanaka, T. Oshino, and H. Tamura, “A viewpoint dependent stereoscopic display using interpolation of multi-viewpoint images,” in Proc. SPIE Int. Conf. Stereoscopic Displays and Virtual Reality Systems II, 1995, vol. 2409, pp. 11–20. [28] M. Droese, T. Fujii, and M. Tanimoto, “Ray-space interpolation constraining smooth disparities based on loopy belief propagation,” in Proc. Int. Workshop Systems, Signals, Image Processing, 2004, pp. 247–250. [29] A. Isaksen, M. Leonard, and S. J. Gortler, “Dynamically reparameterized light fields,” MIT-LCS-TR-778, 1999.

Akira Kubota (S’01–M’04) received the B.E. degree in electrical engineering from Oita University, Oita, Japan, in 1997, and the M.E. and Dr.E. degrees in electrical engineering from the University of Tokyo, Tokyo, Japan, in 1999 and 2002, respectively. He is currently an Assistant Professor with the Department of Information Processing at the Tokyo Institute of Technology, Yokohama, Japan. He was a Postdoctoral Research Associate at the High-Tech Research Center of Kanagawa University, Japan, from August 2004 to March 2005. He was also a Research Fellow with the Japan Society for the Promotion of Science from April 2002 to July 2004. From September 2003 to July 2004, he visited the Advanced Multimedia Processing Laboratory, Carnegie Mellon University, Pittsburgh, PA, as a Research Associate. His research interests include image representation, visual reconstruction, and image-based rendering.

KUBOTA et al.: RECONSTRUCTING DENSE LIGHT FIELD

Kiyoharu Aizawa (S’83–M’88) received the B.Eng., M.Eng., and Dr.Eng. degrees in electrical engineering from the University of Tokyo, Tokyo, Japan, in 1983, 1985, and 1988, respectively. He is currently a Professor with the Department of Electrical Engineering and and the Department of Frontier Informatics, University of Tokyo. He was a Visiting Assistant Professor at the University of Illinois at Urbana-Champaign, Urbana, from 1990 to 1992. His current research interests are in image coding, image processing, image representation, computational image sensors, multimedia capture, and retrieval of life log data. wDr. Aizawa is a member of IEICE, ITE, and ACM. He received the 1987 Young Engineer Award, the 1990 and 1998 Best Paper Awards, the 1991 Achievement Award, the 1998 Fujio Frontier Award from ITE Japan, the 1999 Electronics Society Award from the IEICE Japan, and The IBM Japan Science Prize 2002. He serves as Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and is on the editorial board of the Journal of Visual Communications and the IEEE Signal Processing Magazine. He has served for many national and international conferences including IEEE ICIP, and he was the General Chair of the 1999 SPIE VCIP. He is a member of the Technical Committee on Image and Multidimensional Signal Processing and the Technical Committee on Visual Communication of the SP and CAS societies, respectively.

279

Tsuhan Chen (S’90–M’93–SM’04) received the B.S. degree in electrical engineering from the National Taiwan University, Taipei, Taiwan, R.O.C., in 1987, and the M.S. and Ph.D. degrees in electrical engineering from the California Institute of Technology, Pasadena, in 1990 and 1993, respectively. He has been with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, since October 1997, where he is currently a Professor, and where he directs the Advanced Multimedia Processing Laboratory. From August 1993 to October 1997, he was with the Visual Communications Research Department, AT&T Bell Laboratories, Holmdel, NJ, and later at AT&T Labs-Research, Red Bank, NJ, as as Senior Technical Staff Member and then a Principle Technical Staff Member. He has co-edited a book titled Multimedia Systems, Standards, and Networks. He has published more than 100 technical papers and holds 15 U.S. patents. His research interests include multimedia signal processing and communication, implementation of multimedia systems, multimodal biometrics, audio-visual interaction, pattern recognition, computer vision and computer graphics, and bioinformatics. Dr. Chen helped create the Technical Committee on Multimedia Signal Processing, as the founding Chair, and the Multimedia Signal Processing Workshop, both in the IEEE Signal Processing Society. His endeavor later evolved into his founding of the IEEE TRANSACTIONS ON MULTIMEDIA and the IEEE International Conference on Multimedia and Expo, both joining the efforts of multiple IEEE societies. He was appointed the Editor-in-Chief of the IEEE TRANSACTIONS ON MULTIMEDIA for 2002 to 2004. Before serving as the Editor-in-Chief of the IEEE TRANSACTIONS ON MULTIMEDIA, he also served on the Editorial Board of the IEEE Signal Processing Magazine and as an Associate Editor of the IEEE TRANSACTION ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, the IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANSACTIONS ON SIGNAL PROCESSING, and the IEEE TRANSACTIONS ON MULTIMEDIA. He received the Benjamin Richard Teare Teaching Award in recognition of excellence in engineering education from Carnegie Mellon University. He received the Charles Wilts Prize for outstanding independent research in electrical engineering leading to a Ph.D. degree at the California Institute of Technology. He was a recipient of the National Science Foundation CAREER Award, titled “Multimodal and Multimedia Signal Processing,” from 2000 to 2003. He is a member of the Phi Tau Phi Scholastic Honor Society.

Suggest Documents