TEMPORALLY CONSISTENT ADAPTIVE DEPTH MAP ... - CiteSeerX

0 downloads 0 Views 1MB Size Report
Video (3DV), Adaptive Cross-Trilateral Median Filtering, Adaptive. Asymmetric ... The goal of view synthesis techniques is to generate virtual views from original ...
TEMPORALLY CONSISTENT ADAPTIVE DEPTH MAP PREPROCESSING FOR VIEW SYNTHESIS Martin K¨oppel‡ , Mehdi Ben Makhlouf‡ , Marcus M¨uller∗ , and Patrick Ndjiki-Nya∗ ‡

Image Communications Group, Technical University of Berlin, Einsteinufer 17, 10587 Berlin, Germany ABSTRACT In this paper, a novel Depth Image-based Rendering (DIBR) method, which generates virtual views from a video sequence and its associated Depth Maps (DMs), is presented. The proposed approach is especially designed to close holes in extrapolation scenarios, where only one original camera is available or the virtual view is placed outside the range of a set of original cameras. In such scenarios, large image regions become uncovered in the virtual view and need to be filled in a visually pleasing way. In order to handle such disocclussions, a depth preprocessing method is proposed, which is applied prior to 3-D image warping. As a first step, adaptive crosstrilateral median filtering is used to align depth discontinuities in the DM to color discontinuities in the textured image and to further reduce estimation errors in the DM. Then, a temporally consistent and adaptive asymmetric smoothing filter is designed and subsequently applied to the DM. The filter is adaptively weighted in such a way that only the DM regions that may reveal uncovered areas are filtered. Thus, strong distortions in other parts of the virtual textured image are prevented. By smoothing the depth map image, objects are slightly distorted and disocclusions in the virtual view are completely or partially covered. The proposed method shows considerable objective and subjective gains compared to the state-of-the-art one. Index Terms— Depth Image-based Rendering (DIBR), 3-D Video (3DV), Adaptive Cross-Trilateral Median Filtering, Adaptive Asymmetric Gaussian Filtering 1. INTRODUCTION The goal of view synthesis techniques is to generate virtual views from original camera perspectives. These methods can be used in applications such as 3-D Video (3DV), free viewpoint video, 2-D to 3-D conversion and virtual reality. Given a set of captured images of a real scene, the synthesis of photo-realistic virtual views of the same scene at slightly different viewpoints by processing the original images is also referred to as Image-based Rendering (IBR) [1]. Stateof-the-art 3-D world IBR representation methods can be classified into three categories according to the amount of geometric information used [2]: (1) Rendering without geometry, (2) rendering with implicit geometry and (3) rendering with explicit geometry. Methods belonging to category (1) utilize several aligned images from different view angles in a scene to generate virtual views using rayspace geometry, without requiring geometric information [3]. The This work was funded in part by the German Research Foundation [Deutsche Forschungsgemeinschaft (DFG)] under grant WI 2032/4-1.



Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, Einsteinufer 37, 10587 Berlin, Germany

methods belonging to category (2) rely on implicit geometry. Such implicit geometries are typically expressed in terms of feature correspondences among the known images [4]. The methods belonging to category (3) utilize explicit geometry information. Such information is often available in form of depth maps or 3-D geometry [5]. Methods of category (3) usually offer the most flexibility in view synthesis, as they allow almost any virtual view independently of the camera position and the camera angle to be rendered. If Depth Maps (DMs) or disparity maps are used as explicit 3-D geometry, then the methods of category (3) are also called Depth Image-based Rendering (DIBR). The method proposed in this paper utilizes dense DMs as explicit 3-D geometry and thus belongs to the DIBR methods. A fundamental problem of the DIBR concept is the fact that not every sample in the virtual view necessarily exists in the original textured image. Therefore, unknown image regions become uncovered in the virtual view. This is especially problematic in extrapolation scenarios, where only one original camera and its associated DMs is available or the virtual view position is placed outside of the original camera range. Several methods have been proposed in the literature in order to address the disocclusion problem. They can be classified into three main categories: (1) DM preprocessing [5, 6], (2) disocclusion/hole filling [7, 8, 9] and (3) image domain warping [10]. DM preprocessing methods smooth the DM in such a way that disocclusions (holes) in the virtual view are diminished or removed. Hole filling methods use existing texture samples to predict the holes in the virtual view. Image domain warping approaches distort nonsalient image regions so that no disocclusions are revealed. In this paper a method from the DM preprocessing category is presented. DM preprocessing methods allow the reduction of the number and the size of the disoccluded areas by smoothing depth discontinuities, usually with the means of a Gaussian filter [6]. Nevertheless, smoothing the whole DM can introduce strong distortions. In order to minimize filter-induced distortions, several adaptive filter methods have been proposed [5, 11]. These filters smooth the DM only in the vicinity of salient edges in the DM. In [5], Daribo et al. extracted salient contours through a directional edge detection method, so that only one edge side is detected. Based on the warping direction, they infer on which side of the Foreground (FG), object disocclusions will occur. Daribo et al. [5] then computed the shortest city-block distance to a detected contour. Given the sample-wise distance information, a symmetric Gaussian smoothing filter is adaptively weighted and then applied to the DM. In this paper, a new DM preprocessing method is presented. Our approach can be used to close small holes when short baselines need to be compensated. Nevertheless, in a wide baseline scenario, disocclusions can be diminished with the proposed method in order to

Preprocessing Edge detection

3.1. Adaptive Cross-Trilateral Median Filtering Distance map computation Temporal correction

Depth maps Original images

Depth map optimization

Edge dependent filtering Image warping

Hole filling

Warped images

Fig. 1: Overview of the proposed framework.

improve the outcome of a subsequent hole filling method. 2. PROPOSED FRAMEWORK The proposed depth preprocessing framework is shown in Fig. 1. Our approach is based on the framework proposed by Daribo et al. [5]. Each preprocessing module proposed in [5] (gray blocks in Fig. 1) is substantially optimized in this work. Additionally, an advanced DM optimization method is incorporated and the preprocessing method is extended to 2D+t (blue blocks in Fig. 1) in our workflow. In a first step, the depth values in the DMs are improved in order to better align them to the textured frames, using adaptive crosstrilateral median filtering [12] (cf. Sec. 3.1). Afterward, the edges that may reveal uncovered areas are extracted with a new directional edge detection method (cf. Sec. 3.2). Then, a Distance Map (DiM) is computed using a more accurate distance measurement than the one proposed in [5] (cf. Sec. 3.3). Next, the DiM is updated taking the DiMs of the neighboring frames into account (cf. Sec. 3.4). In this way, the temporal consistency of the virtual view is maintained. In a final preprocessing step, an asymmetric Gaussian filter is adaptively weighted according to the distance function in the DiM and applied to the DMs (cf. Sec. 3.5). Then, the textured pictures are warped to the virtual position by using the preprocessed DMs. Subsequently, the remaining holes in the virtual image are finally occluded with a fast line-wise hole filling method [13] (cf. Sec. 4). In the following section, the image to be filled will be denoted as F , the associated depth map as D and the associated disparity map as E. A sample position in a picture or in a DM will be denoted as (x, y) or (u, v). 3. PREPROCESSING OF DEPTH MAPS In this section, the problem of closing/diminishing the disoccluded regions in a synthesized view by the means of a preprocessing step is considered. As discussed in Sec. 1, one solution to address the disocclusion problem consists of smoothing strong discontinuities in the DM. However, smoothing the whole depth map can introduce severe distortions. Hence, this problem is addressed by using an adaptive filter, which applies a stronger smoothing in the vicinity of important depth discontinuities. By further selecting optimized filter coefficients and window sizes, geometrical distortions in the warped pictures can be further reduced [5]. Temporal consistency is achieved by considering several adjacent frames in the adaptive filtering process. However, disocclusions may still remain in the virtual view that can be tackled as described in Sec. 4. Before applying the steps described above, depth discontinuities are aligned to corresponding color discontinuities in the corresponding textured picture in order to improve the DM (cf. Sec. 3.1).

Stereo correspondence estimation methods often produce noise and imprecise DMs [cf. Fig 2 (b)]. Therefore, adaptive cross-trilateral median filtering [12] is utilized to improve the estimated DMs by aligning depth discontinuities to color discontinuities [cf. Fig 2 (c)]. Given the improved DMs, we assume that the subsequent preprocessing steps can be applied more reliably. Additionally, more accurate disparities for the 3-D image warping are expected. The bilateral filter is used to smooth images while preserving edges. Similar to the Gaussian filter, the bilateral filter is defined as a weighted average of samples. The main difference between the two filters is that the bilateral filter takes the variation of intensities into account, in order to preserve edges in the image. If the filter weights of a bilateral filter are calculated in one image and then applied to another one, the filter is called cross-bilateral. Nevertheless, one problem of cross-bilateral filters is the computation of new depth values due to averaging. We prevent the computation of new values by using a weighted median instead of a weighted mean. Furthermore, two filter parameters must be considered; the photometric spread and the geometric spread. The selection of the geometric spread is a critical issue, as a large geometric spread leads to a strong blur which can remove details in the DM, while a small spread may not be sufficient to eliminate noise in large homogenous image regions. Hence, we adapt the geometric spread automatically according to local texture properties. For this a simple and efficient texture measure is applied which examines the intensity deviations of three neighboring horizontal and vertical samples in order to decide whether the image region is homogenous or textured [12]. If the maximum intensity difference of the samples is bigger than a predefined threshold, a small geometric spread is chosen. To minimize the influence of badly estimated depth values on the filtering result, an additional confidence kernel is introduced that takes the reliability of the depth estimates into account. For this two reliability measures, the correlation outcome and left-right consistency, are used to compute two confidence values per sample. The correlation confidences are calculated as follows: ( 0, if corrval(x, y) < tcorr ccorr (x, y) = corrval(x, y), otherwise (1) with corrval(x, y) = zncc((x, y), E(x, y)), corrval ∈ [−1, 1] , where zncc is the zero-normalized cross-correlation [12] and tcorr is a predefined threshold. The left-right confidences are computed as follows:   if dif f (x, y) > tdif f 0, clr (x, y) = 1, if dif f (x, y) = tdif f  (2) 1/dif f (x, y), otherwise with

dif f (x, y) = |Elr (x, y) + Erl (x + Elr (x, y), y)| ,

where Elr (x, y) is the disparity from the left to the right image and Erl (x, y) is the disparity from the right to the left image at position (x, y). tdif f is a predefined threshold. The weights of the confidence kernel (wck ) are then computed as follows: wck (x, y) = clr (x, y) · ccorr (x, y).

(3)

Subsequently, the confidence kernel and the color kernel are used to determine the optimal filter weights for each filter window [12]. Note that the confidence kernel is computed from the disparity

3.2. Extracting the Contours of Interest

(a)

(b)

(c)

Fig. 2: Results of the depth map optimization method. (a) Original texture image (“Balloons”, Cam. 3, Frame 172). (b) Depth map provided by MPEG. (c) Adaptive cross-trilateral median filtered depth map.

Texture samples Depth samples

210 Intensity 150 90

Texture samples

210 Intensity 150

Hole

90

30

30 0

4

8 12 16 20 24 28 32 36 Image row

0

4

8 12 16 20 24 28 32 36 Image row

(a)

(b) Texture samples Depth samples

210 Intensity 150 90 30

210 Intensity Hole

Texture samples

150 90

Edges capture local intensity variations and are an important image property. They indicate the boundaries of objects and show the transitions between segments from the FG and the Background (BG). In the virtual views, disocclusions mostly appear as closed to strong depth discontinuities. Hence, given the warping direction, the CIs can be detected before operating 3-D image warping by applying a directional edge detector. Methods such as the hysteresis approach used in [5] aim to find appropriate contour information by utilizing three fixed thresholds. Therefore this method cannot adapt itself to changing depth conditions. Additionally, some important edges that are capable of revealing disocclusions are ignored using the this approach [5] [cf. Fig. 4 (c)]. To tackle this problem, an appropriate threshold that separates FG and BG areas in the DM is automatically computed utilizing the method proposed in [14]. The threshold as introduced in this method is a gray-scale level at its optimal value, where the class variance within is at its minimum [14], i.e. the chances of a sample falling into FG or BG is the same. Using this approach, the edge extraction method can adapt itself to depth changes. Once the threshold is computed, the value serves to separate the FG areas from the BG texture by creating a binary mask [cf. Fig. 4 (b)]. Subsequently, a directional edge detector is applied on the binary mask to find the appropriate CI, which is stored in a Contours of Interest Map (CIM).

30

0

4

8 12 16 20 24 28 32 36 Image row

(c)

0

4

8 12 16 20 24 28 32 36 Image row

(d)

Fig. 3: Improvements through depth map dilation. (a) Sample intensities and depths for a horizontal line. (b) Sample values after 3-D image warping. Ghosting effects appear. (c) Sample intensities and dilated depths for a horizontal line. (d) Sample values after 3-D image warping. Ghosting effects are prevented.

map that is derived from the DM, while the color kernel is computed from the textured image. The filter is then applied iteratively and converges after a few iterations. Due to the usage of an additional confidence kernel, the filter is called trilateral instead of bilateral. The optimized DM fits the objects and regions of the image better than the original one (cf. Fig. 2). Using adaptive cross-trilateral median filtering aligns the depth edges to the edges in the frame. Nevertheless, the edges in the textured images can be transitions over several samples and adaptive cross-trilateral median filtering may place the edges in the DMs into these edge transitions [cf. Fig. 3 (a)]. This leads to ghosting effects in the virtual view [cf. Fig. 3 (b)]. One solution for this problem is to dilate the DMs in such a way that the FG objects in the DM are slightly bigger than the corresponding objects in the textured image [cf. Fig. 3 (c)]. Ghosting effects are thus prevented [cf. Fig. 3 (d)]. A structure element 6 × 6 large is used for this gray scale dilation operation. As can be seen in Fig. 2, adaptive cross-trilateral median filtering [12] leads to improved DMs and therefore the further (pre)processing steps can be applied more reliably. In Sec. 3.2 the edges that may reveal holes in a single virtual frame are extracted from the optimized DM using a new directional edge detection method (cf. Sec. 3.2). These edges will be referred to as Contours of Interest (CI) in the following section.

3.3. Computing the Distance Map If the contour information is given, the shortest distance of each sample to a CI can be determined. The distance information is then stored in a DiM [cf. Fig. 4 (e), (f)]. All values in the DiM are normalized to lie in the range [0,1]. In our approach, the sample-wise distances to a CI are computed using the Euclidean metric to achieve more precise measures as compared to the city-block metric used in [5]. The DiM is thus computed as follows: DiM (x, y) = min{d(D(x, y), CIM (u, v))}, (x, y) ∈ D; (u, v) ∈ CI,

(4)

where d(..., ...) represents the Euclidean distance of each sample in D to a CI in CIM. For a given distance r to a CI, the error η induced by the city-block geometry compared to the Euclidean geometry could reach: √ r (5) η = ( √ ) · ( 2 − 1). 2 This error affects the precision of the generated DiM and has therefore a significant impact on the weighted smoothing filter. Thus, avoiding this error by using the Euclidean distance, yields improved rendering results [cf. Fig. 4 (h)]. The distance information in the DiM [cf. Fig. 4 (e), (f)] is then temporally corrected (cf. Sec. 3.4) and subsequently used as a weight function to adapt the smoothing filter (cf. Sec. 3.5). 3.4. Temporal Correction The CIs are extracted for each frame separately. This means that detected CIs in adjacent frames may not be consistent. The DiM is however used as weighting a function for the adaptive smoothing filter and is derived from the CI. Hence, temporal inconsistencies in the CIs lead to temporal deviations in the DiM and therefore also in the virtual view. This can lead to pumping effects. Such effects occur when a CI is detected in one frame but not in a subsequent image.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 4: Depth preprocessing steps pictured on the sequence “Balloons” (frame 142, camera 3). (a) The adaptive cross-trilateral median filtered depth map. (b) Binary mask created from the depth map. (c) CI detected by hysteresis approach. (d) CI detected by the proposed approach. (e) Distance map computed with the city-block method. (f) Distance map computed with the Euclidean method. (g) and (h) enlarged sub-frames of the the virtual image after 3-D image warping using (g) the city-block and (h) the Euclidean method.

This typically leads to object distortions in one frame at the CI position, while in the following frame the same image region appears non-distorted. The Human Visual System is very susceptible to temporal changes and thus these errors should be prevented or at least reduced in order to compute subjectively pleasing results. Therefore the actual DiM is temporally corrected by considering DiMs from the adjacent frames. This can be formalized as follows: τ

DiM (x, y) = ̟1 · DiMt−1 (x, y) + ̟2 · DiMt (x, y)+ ̟3 · DiMt+1 (x, y),

(6)

where DiM τ is the temporally corrected DiM, DiMt−1 is the DiM of the previous frame, DiMt is the DiM of the current picture and DiMt+1 is the DiM of the subsequent frame. ̟1 , ̟2 and ̟3 are parameters to weight the three DiMs. Hereby, ̟1 + ̟2 + ̟3 = 1. By merging the actual DiM with the DiMs of adjacent frames, the temporal transitions between adjacent DiMs are smoothed and the pumping effect is reduced. 3.5. Edge Dependent Weighting of the Smoothing Filter Once the DiM is computed, a low-pass filter can be applied adaptively to the CI neighborhood. The smoothing strength increases when a CI is closer. The smoothing filter is thus adaptively weighted according to the following operation: e D(x, y) = (1 − DiM τ (x, y)) · D(x, y)+ DiM τ (x, y) · Df (x, y),

(7)

where Df represents the fully low-pass filtered DM, which is computed as follows using firstly the vertical and secondly the horizontal Gaussian convolution kernel (h) (cf. Sec. 5): where Df represents the fully low-pass filtered DM, which is computed as follows using

firstly the vertical and secondly the horizontal Gaussian convolution kernel (h) (cf. Sec. 5): hg (u, v) h(u, v) = P P , u v hg

Df = D ∗ h, hg (u, v) = e

with

−(u2 +v 2 ) 2σ 2

.

(8)

In contrast to [5], an asymmetric Gaussian filter is utilized in this work to smooth the DM. This is based on the fact that the Human Visual System is more sensitive to the depth cues received from horizontal gradients than those received from vertical ones in images projected to the left and right eyes [6]. Hence, a stronger filter can be applied in the vertical direction compared to the horizontal one. Furthermore, the asymmetric nature of the filter allows the reduction distortions in the synthesized view by assigning a stronger smoothness to the vertical direction [cf. Fig. 5 (d), (h)]. The advantages of using an asymmetric Gaussian filter instead of an symmetric Gaussian have been evaluated for DIBR scenarios where the whole DM was smoothed [6]. Nevertheless, in this work it is shown that using an asymmetric filter for partially smoothing DMs also leads to strong improvements. We utilized the filter settings proposed in [6] for the horizontal (σh , ωh ) and the vertical direction (σv , ωv ) (cf. Sec. 5). 4. WARPING AND HOLE FILLING In order to warp the original sample values to the new virtual image position, an existing state-of-the-art renderer [13] is utilized. The sample positions have to be shifted along the epipolar lines according to the preprocessed disparity values. Due to fact that all utilized Moving Pictures Expert Group (MPEG) 3DV test sequences are rectified, the epipolar lines are horizontally aligned with the image rows. The warping process therefore simplified to a 1-D problem. This means that the original samples only need to be shifted along the image rows.

Table 1: Objective results for SPSNR and TPSNR Seq.

S1

S2

Cam. 3→1 3→2 5→6 5→7 2→0 2→1 4→5 4→6

SPSNR (dB) Daribo [5] Prop. 50.60 50.74 51.11 51.31 52.31 52.38 51.71 51.98 46.03 47.22 46,86 48.01 47.98 48.11 47.41 46.91

TPSNR (dB) Daribo [5] Prop. 49.20 49.33 49.80 49.92 50,74 50.95 50.03 50.45 48.04 48.75 49.69 50.82 52.04 52.55 50.84 49.45

Finally, the remaining disocclusions in the virtual image are covered using the fast line-wise filling method proposed in [13]. The filling method cannot recover high-frequency textures or vertical/tilt edges. But, due to the fact that the focus of this work is the preprocessing step, a fast and simple hole filling method [13] is chosen. Nevertheless, advanced texture synthesis methods [7, 8] can lead to further improvements. 5. EXPERIMENTAL RESULTS The proposed approach is compared to the DM preprocessing method proposed in [5]. For the evaluation of the proposed algorithm, two MPEG MVD test sequences are used: “Balloons” (S1, 300 frames) and “Newspaper” (S2, 300 frames). S1 and S2 have a resolution of 1024 × 768 samples and a camera spacing of 5cm. We evaluated our framework with the filter settings that are proposed and evaluated in [6]. Hence the parameters are chosen as follows, Symmetric Gaussian: σh = σv = 4, ωh = ωv = 13; Asymmetric Gussian: σh = 4, σv = 12, ωh = 13, ωv = 41. The parameter settings for tdif f and tcorr are chosen according to [12]. The extrapolation capabilities of the proposed approach are evaluated as follows: the extrapolated virtual frame is rendered from the original view with its associated preprocessed DM The baseline between the virtual view and the original camera is set to one or two (cf. Table 1, Cam. “3 → 1” means for example: original camera three warped to the virtual camera position one). In this work, each module is compared to its corresponding module in [5]. In order to obtain meaningful results, only the modules which should be compared have received different parameter settings. The remaining parameters are assigned the same values. Hence, it is possible to judge each module individually and to determine the best option for it. As shown in [15] and [16], classical full-reference objective measurements such as PSNR and SSIM do not match the perceived quality of test subjects for rendered virtual views. This is especially the case when image distortions are utilized to cover the holes. Hence, our framework was evaluated with the no-reference objective measurements SPSNR [17], which measures the spatial consistency, and TPSNR [17], which measures the temporal consistencies. These measures were especially designed to evaluate the performance of view synthesis algorithms. The objective results for the evaluated sequences are shown in Table 1. As can be seen, our method outperforms the approach proposed in [5] in terms of both the spatial and temporal consistency. The method proposed by Daribo et al. [5] performs better only for the virtual view “S2: 4 → 6” because our result is sharper but quite noisy. Fig. 4 and 5 show subjective results for some frames and subframes from S1 and S2. An electronical magnification is maybe re-

quired to see all the details. In Fig. 4 (a)-(h), results for frame 142 of S1 are depicted. Fig. 4 (a) shows the adaptive cross-trilateral median filtered DM and Fig. 4 (b) the binary mask created from the DM using [14]. Our edge detection method is depicted in Fig. 4 (d) and the one proposed in [5] in Fig. 4 (c). As can be seen in Fig. 4 (d), more contours that are capable of revealing disocclusions are found using our method. Fig. 4 (e) and (f) show two DiMs computed with the (e) city-block and with the (f) Euclidean distance. The brighter the luminance values are in the DiM, the shorter the distance to the nearest CI. Fig. 4 (g) and (h) show sub-frame results of the virtual view 2 computed from the original view 3. Fig. 4 (g) shows a sub-frame result, using an adaptive asymmetric smoothing filter. The filter is weighted with a DiM that has been computed utilizing the city-block distance [cf. Fig. 4 (g)]. As can be seen in Fig. 4 (g) annoying stair effects appear, which can be omitted using a DiM that has been computed utilizing the Euclidean distance [cf. Fig. 4 (h)]. In Fig. 5 (a)-(d), the superiority of adaptive asymmetric filtering compared to adaptive symmetric filtering is pictured. The DM of frame 131 of S2 of camera 4 is shown in Fig. 5 (a), and the adaptively smoothed DM using an asymmetric Gaussian filter is shown in Fig. 5 (b). Sub-frame results of the virtual view 5 rendered from the original camera 4 are shown in Fig. 5 (c) and (d). Hereby, (c) is rendered using a symmetrically Gaussian filtered DM, while Fig. 5 (d) is rendered using the DM shown in 5 (b) to warp the original view to the virtual position. By utilizing an asymmetric filtered DM, strong distortions in the virtual view can be reduced [cf. 5 (d)]. In Fig. 5 (f)-(h) virtual sub-frame results of the virtual camera 1 rendered from original view 3 for the frame 130 of the sequence S1 are shown. The original sub-frame (camera 3) is pictured in Fig. 5 (e). Fig. 5 (f) shows the virtual result utilizing the original DM to warp the original frame to the virtual position. Disocclusions are marked white in Fig. 5 (f). The sub-frame result using the framework proposed in [5] is shown in Fig. 5 (g), while the final sub-frame result utilizing the proposed method is shown in Fig. 5 (h). By using the proposed approach temporally inconsistencies, filter-induced artifacts and distortions can be substantially diminished. 6. CONCLUSIONS AND FUTURE WORK In this paper a novel DIBR method is presented. The approach is mainly developed to close or diminish uncovered areas in extrapolation scenarios, where only one camera and its associated DMs is available or the virtual camera lies outside the range of a set of original captured views. The proposed method preprocesses the DMs in such a way that objects in the virtual image are slightly distorted and thus holes are covered or diminished. As a first step the DMs are optimized using adaptive cross-trilateral median filtering. Then an adaptive and temporally consistent smoothing filter is designed, which only processes DM regions that are capable of revealing disocclusions. By using an adaptive smoothing filter, filter-induced artifacts are reduced. The remaining uncovered areas are finally covered with a fast line-wise filling method. The proposed method shows significant objective and subjective gains compared to the state-ofthe-art. In future work, the results will be evaluated in detailed subjective experiments. Furthermore, advanced texture synthesis methods will be incorporated into the framework. 7. REFERENCES [1] L. McMillan and G. Bishop, “Plenoptic modeling: An imagebased rendering system,” in Proc. of ACM SIGGRAPH, Los

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 5: (a)-(d) Results for the sequences “Newspaper” frame 131 and (e)-(h) “Balloons” frame 130. (a) The adaptive cross-trilateral median filtered depth map of camera 4. (b) Edge dependent smoothing using an asymmetric Gaussian filter. (c) and (d) Sub-frame results of camera 6 rendered from the original camera 4, where a symmetric Gaussian filter is used in (c) and an asymmetric Gaussian filter is used in (d) to smooth the DM in the preprocessing step. (e) Original sub frame of camera 3. (f) Virtual sub-frame of camera 1 warped from camera 3 using the original depth map. Disocclusions are marked white. (g) Virtual sub-frame using the method proposed in [5] and (h) using the proposed approach.

Angeles, USA, Aug. 1995, pp. 39–46. [2] H.-Y. Shum and S. B. Kang, “A review of image-based rendering techniques,” in Proc. of SPIE, USA, Dec. 2000, pp. 2–13. [3] M. Levoy and P. Hanrahan, “Light field rendering,” in Proc. of ACM SIGGRAPH, New York, USA, Aug. 1996, pp. 31–42.

[11]

[4] S. Chen and L. Williams, “View interpolation for image synthesis,” in Proc. of ACM SIGGRAPH, USA, Aug. 1993, pp. 279–288.

[12]

[5] C. Zhu, L. Yu, and M. Tanimoto, 3D-TV Systems with DepthImage-Based Rendering, chapter 6 written by Daribo et al.: Hole Filling for View Synthesis, Springer, 2012.

[13]

[6] L. Zhang and W. A. Tam, “Stereoscopic image generation based on depth images for 3D TV,” IEEE Trans. on Broadcasting, vol. 51, no. 2, pp. 191–199, 2005. [7] P. Ndjiki-Nya, M. K¨oppel, D. Doshkov, H. Lakshman, P. Merkle, K. M¨uller, and T. Wiegand, “Depth image-based rendering with advanced texture synthesis for 3-D video,” IEEE Trans. on Multimedia, vol. 13, no. 3, pp. 453–465, 2011. [8] Y. Mori, N. Fukushima, T. Yendo, T. Fujii, and M. Tanimoto, “View generation with 3D warping using depth information for FTV,” Elsevier Signal Processing: Image Communication, vol. 24, pp. 65–72, 2009. [9] K. M¨uller, A. Smolic, K. Dix, P. Merkle, P. Kauff, and T. Wiegand, “View synthesis for advanced 3D video systems,” EURASIP Journal on Image and Video Processing, 2008, Article ID 438148, 11 pages. [10] M. Farre, O. Wang, M. Lang, N. Stefanoski, A. Hornung, and A. Smolic, “Automatic content creation for multiview

[14]

[15]

[16]

[17]

autostereoscopic displays using image domain warping,” in Proc. IEEE International Conference on Multimedia & Expo (ICME), Barcelona, Spain, July 2011, pp. 1–6. S.-B. Lee and Y.-S. Ho, “Discontinuity-adaptive depth map filtering for 3d view generation,” in Proc. IMMERSCOM, Brussels, Belgium, Jun. 2009. M. Mueller, F. Zilly, and P. Kauff, “Adaptive cross-trilateral depth map filtering,” in Proc. of 3DTV-CON, Tampere, Finland, Jun. 2010. H. Schwarz, C. Bartnik, S. Bosse, H. Brust, T. Hinz, H. Lakshman, D. Marpe, P. Merkle, K. M¨uller, H. Rhee, G. Tech, M. Winken, and T. Wiegand, “Description of 3D video technology proposal by fraunhofer HHI (HEVC compatible; configuration B),” JTC1/SC29/WG11 Doc. m22571, Nov. 2011. N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Trans. on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66, 1979. E. Bosc, R. Pepion, P. Le Callet, M. K¨oppel, P. Ndjiki-Nya, M. Pressigout, and L. Morin, “Towards a new quality metric for 3d synthesized view assessment,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 7, pp. 1332– 1343, 2011. E. Bosc, R. Pepion, P. Le Callet, M. K¨oppel, P. Ndjiki-Nya, L. Morin, and M. Pressigout, “Perceived quality of DIBRbased synthesized views,” in Proc. of SPIE Optical Engineering + Applications, San Diego, USA, Aug. 2011. K.-J. Oh, S. Yea, A. Vetro, and Y.-S. Ho, “Virtual view synthesis method and self-evaluation metrics for free viewpoint television and 3d video,” Int. J. Imaging Syst. Technol., vol. 20, no. 4, pp. 378–390, Dec. 2010.

Suggest Documents