Spatial and Temporal Enhancement of Depth Images ... - CiteSeerX

3 downloads 0 Views 612KB Size Report
we improve depth quality by performing a newly- designed joint bilateral filtering, color segmentation- based boundary refinement, and motion estimation-.
2010 International Conference on Pattern Recognition

Spatial and Temporal Enhancement of Depth Images Captured by a Time-of-flight Depth Sensor 1

1

Sung-Yeol Kim, 2Ji-Ho Cho, 1Andreas Koschan, and 1Mongi A. Abidi Imaging, Robotics and Intelligent Systems Lab, The University of Tennessee, Knoxville 2 Dept. of Mechatronics, Gwangju Institute of Science and Technology 1 {sykim, akoschan, abidi}@utk.edu,[email protected] Abstract

different reflectivity of IR light according to color variation in objects and lighting condition. Figure 1 shows color and depth images captured by a TOF depth sensor and spatial problems of raw depth images.

In this paper, we present a new method to enhance depth images captured by a time-of-flight (TOF) depth sensor spatially and temporally. In practice, depth images obtained from TOF depth sensors have critical problems, such as optical noise existence, unmatched boundaries, and temporal inconsistency. In this work, we improve depth quality by performing a newlydesigned joint bilateral filtering, color segmentationbased boundary refinement, and motion estimationbased temporal consistency. Experimental results show that the proposed method significantly minimizes the inherent problems of the depth images so that we can use them to generate a dynamic and realistic 3D scene.

(d) (c)

(b)

(a) Color and depth images captured by a TOF depth sensor Depth image

Color image

Black Hair Region

1. Introduction (b) Optical noise

(d) Lost depth data

Figure 1. A frame by a TOF depth sensor and inherent problems of depth images in the spatial domain.

As a 3D video representation, video-plus-depth [1], which is an image sequence of synchronized color and depth images, is widely accepted as visual media in future 3D multimedia applications. For a practical use of video-plus-depth in industry, it is very important to estimate accurate depth information from a real scene. A time-of-flight (TOF) depth sensor [2] directly provides depth information from a natural scene by integrating a high-speed pulsed infrared (IR) light source with a conventional video camera. The depth sensor produces more accurate depth information on textureless, depth discontinuous and occluded regions than conventional depth estimation methods [3]. Nevertheless, we sometimes suffer from handling depth images captured by TOF depth sensors due to their inherent problems as, for example, optical noise, unmatched boundaries between a depth image and its color image, lost depth information on shiny and dark surface, and temporal depth flickering artifacts on static and stationary objects. The spatial and temporal problems of the depth images usually arise because of 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.577

(c) Unmatched boundary

These problems limit the use of TOF depth sensors in applications involving motion detection and motion tracking. Our goal of this work is to provide a solution to improve depth quality in a TOF depth sensor system and to show that its application can be extended for the generation of a realistic and dynamic 3D scene. The contributions of this paper are developing a newlydesigned joint bilateral filter for optical noise reduction, boundary refinement using a color segment set, and motion estimation-based temporal consistency.

2. Related Works During the past years, a variety of solutions have been developed to enhance depth images captured by a TOF depth sensor. For optical noise minimization, a method used adaptive sampling, mesh triangulation, and Gaussian smoothing [4]. For depth data recovery 2350 2362 2358

of hair regions of human actors, a method employed face detection and quadratic Bézier curve [5]. Recently, a hybrid camera system that combines a stereoscopic camera and a TOF depth sensor has introduced to provide high-quality depth images [6]. However, the previous works have concentrated on handling optical noise of the depth image in the spatial domain and mainly focused on reconstructing a static 3D scene, not a dynamic 3D scene. In this paper, we introduce a spatial and temporal enhancement method of depth images to generate a dynamic 3D scene.

3. Depth Image Enhancement 3.1 Spatial Enhancement In order to reduce optical noise in depth images, bilateral filtering was widely used in previous works. A bilateral filter reduces the noise while preserving important sharp edges. We note that the region of depth discontinuity in depth image D usually corresponds to the edges in its corresponding color image C. In this paper, we design a new joint bilateral filter considering the color and depth information at the same time. Formally, for a pixel position p of depth image D, the filtered value Jp is represented by Eq. 1 Jp 

1 kp

 G  p  q  G  C s

r1

p

 



 C q  G r 2 D p  Dq Dq

q 

if

n( D( si ))

T n( si ) Otherwise

After detecting a color segment set, we refine the boundary of the depth image from the color segment set by removing outside pixels from the object boundary or extending inside pixels to the boundary using linear interpolation. Linear interpolation is performed by the unit of a color segment in the color segment set applying 1 W / 2  W / 2  D ( x, y ) k    (3)  D( x  i, y  j ) k n i  W / 2  j  W / 2  where D(x, y)k is the interpolated depth pixel value at the hole position (x, y) of the kth color segment using the valid neighboring depth pixel value D(x+i, y+j)k in the kth color segment. The term n is the valid number of pixels within a W×W window. 3.2 Temporal Enhancement In order to minimize temporal depth flickering artifacts on stationary objects, we detect stationary regions by applying block matching to previous and current frame color images. Block matching predicts movement of objects by estimating similarity between blocks in the temporal domain. As a similarity measurement, we use mean absolute difference (MAD). For a M×N block at the position (x, y) in the tth frame color image Ct, the MAD value is calculated between this block and another M×N block at the position (x+k, y+l) in the (t-1)th frame color image Ct-1 . Therefore, a motion vector v(x, y) of the block, which is a factor to determine motion existence, can be represented by Eq. 4

(1)

where Gs is the space weight. Gr1 and Gr2 are the weights of color and depth difference, respectively. These weights are calculated by each Gaussian kernel. The term Ω denotes the spatial support of the weight Gs, and Kp is a normalizing factor. In order to minimize the unmatched boundary problem between a depth image and its corresponding color image, we develop a framework for boundary refinement using a color segment set. After colorsegmenting a color image [7], we extract the color segment set to detect object boundaries. The segment set is computed by Eq. 2

S ( x, y ) RS ( x, y )  i

0

(b) Color segment set (a) Color segmentation Figure 2. Object boundary detection.

v( x, y )  arg min MAD( k ,l ) ( x, y ) ( x, y )

(2)

(4)

In block matching, we use a full search method and assume that block regions having zero motion vectors in both x- and y- direction are stationary. The motion image Mt is represented by Eq.5

where Rs indicates an image represented by a color segment set and Si is the ith color segment of color image C. The term n(si) is the total count of pixels in Si and n(D(si)) is the total count of depth pixels on the region Si in depth image D when we fold it onto color image C. We only count depth pixels whose intensity is more than 0 in depth image D. T is a threshold. Figure 2 shows an example of color segmentation and its color segment set.

0, M t ( x, y ) 

255,

if vt ( x, y ) x  0, vt ( x, y ) y  0 Otherwise

(5)

where vt(x, y)x and vt(x, y)y indicate x- and y- direction motion vectors, respectively. Then, a stationary region image St is extracted from the tth frame depth image Dt by Eq. 6 2359 2363 2351

S t ( x, y )  I t ( x, y ) & M t ( x, y )

of the depth image efficiently while preserving some important sharp features more without any unintended visual artifacts.

(6)

where the operator “&” indicates BIT-AND operation. Finally, the enhanced depth image D't considering temporal consistency is calculated by Eq. 7 S ( x, y), Dt ( x, y)  t 1

Dt ( x, y),

if S t 1 ( x, y )  0, S t ( x, y )  0 Otherwise

(7)

Figure 3(a) and Figure 3(b) show the motion image Mt and the stationary region image St, respectively.

(a) Original

(b) Bilateral filtering (c) Our method

Figure 5. Results of noise reduction. (a) Motion image Mt

(b) Stationary region image St

Figure 3. Motion and stationary region images. (a) Original

(b) Boundary matching

4. Experiment Results

Figure 6. Results of boundary matching.

We have tested the performance of our method using two test image sequences, ACTOR1 and ACTOR2, obtained from a TOF depth sensor [8]. The ACTOR1 and ACTOR2 sequences were composed of 200 and 100 frames, respectively. Figure 4 shows some example frames in the ACTOR1 and ACTOR2 sequences. The image resolution of the test image sequence was 720×480.

Figure 6 shows the results of depth images after boundary refinement for the 1st frame of the ACTOR2 sequence. In the experiment, we set the threshold T to 0.5 for color segment set generation and employed the previous work [5] to recover the lost hair region for the woman. The rectangular regions were enlarged and shown after we folded them onto the color image. As shown in Fig. 6(b), the boundary unmatched problem could be minimized since we compensated the unmatched region with neighboring depth data in a color segment set.

(a) Original

(a) ACTOR1 sequence

(b) ACTOR2 sequence

Figure 4. Test image sequences.

(b) Spatial enhancement only

Figure 5 shows the results of depth images after noise reduction in the spatial domain for the 1st frame of the ACTOR1 sequence. In the experiment, we used a joint bilateral 3×3 filter and set the standard deviations of each Gaussian kernel for the weights Gs, Gr1, and Gr2 in Eq. 1 to 3, 0.1, and 0.1, respectively. In Fig. 5, the rectangular regions in the first row are shown magnified in the second row. We notice that the proposed joint bilateral filter reduced the optical noise

(c) Spatial-temporal enhancement

Figure 7. Results of temporal enhancement. Figure 7 shows the results of temporal consistency for the 1st, 10th, 20th, and 30th frame in the ACTOR1 sequence. The left person’s knee and a table in the

2360 2364 2352

scene are stationary regions. As shown in Fig. 7(a), the optical noise and temporal inconsistency in the original depth images caused serious distortions when we generated a 3D scene from them. In the experiment, some flickering artifacts on the region of the table and the knee still appeared after spatial depth enhancement, as shown in Fig. 7(b). Our temporal consistency method using motion estimation reduced the distortions significantly, as shown in Fig. 7(c). Figure 8 shows the results of 3D scenes generated from two test image sequences. Since our method minimized the spatial-temporal inherent problems of depth images, we could successfully generate dynamic and realistic 3D scenes.

Table 1. Computational time for depth Enhancement Processing Noise reduction Region recovery Boundary matching Temporal consistency Total

ACTOR1 0.41 s/f 0.55 s/f 1.45 s/f 0.71 s/f 3.12 s/f

ACTOR2 0.18 s/f 0.53 s/f 1.13 s/f 0.67 s/f 2.51 s/f

5. Conclusions In this paper, we have proposed a new method to enhance depth images captured by a TOF depth sensor spatially and temporally. As shown in the experimental results, the proposed method could minimize inherent problems of depth images successfully. In addition, we showed the possibility that TOF depth sensors can be used to generate dynamic and realistic 3D scenes.

Acknowledgements

(a) 3D scene from ACTOR1 sequence

This work was supported by DOE-URPR (GrantDOE-DEFG02-86NE37968) in USA and in part by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2009-352-D00277). (b) 3D scene from ACTOR2 sequence

References

Figure 8. Dynamic 3D scene generation [1] C. Fehn, R. Barré, and S. Pastoor. Interactive 3-DTVConcepts and Key Technologies. Proceedings of the IEEE, 94(3): 524-538, 2006. [2] G.J. Iddan and G. Yahav. 3D Imaging in the Studio and Elsewhere. Proc. of Videometrics and Optical Methods for 3D Shape Measurements, pp. 48-55, 2001. [3] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Jour. of Computer Vision, 47(13):7-42, 2002. [4] S.M. Kim, J. Cha, J. Ryu, and K.H. Lee. Depth Video Enhancement of Haptic Interaction Using a Smooth Surface Reconstruction. IEICE Trans. on Information and System, E89-D: 37-44, 2006. [5] J.H. Cho, I.Y. Chang, S.M Kim, and K.H. Lee. Depth Image Processing Technique for Representing Human Actors in 3DTV Using Single Depth Camera. Proc. of IEEE 3DTV conference, paper no. 15, 2007. [6] J. Zhu, L. Wang, R. Yang, J. Davis. Fusion of Time-offlight Depth and Stereo for High Accuracy Depth Maps. Proc. of IEEE Computer Vision and Pattern Recognition, pp. 231-236, 2008. [7] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Belief Propagation for Early Vision. International Jour. of Computer Vision, 70(1): 41-54, 2006. [8] Realistic Broadcasting Research Center, Gwangju Institute of Science and Technology, http://rbrc.gist.ac.kr.

Figure 9. Lost depth information on black color region However, as shown in Fig. 9, we could notice that we should develop a general algorithm to recover lost depth information on shiny and dark surface, such as black hair and black pattern regions. In addition, we need to find a measurement for an objective evaluation of 3D scene reconstruction. Table 1 shows the results of average computational time for depth image enhancement using a PC with CPU 2.4 GHz and 1.5 GB of RAM. In average, 3.12 sec/frame and 2.51 sec/frame for ACTOR1 and ACTOR2 sequences were needed for depth image enhancement in the spatial and temporal domain.

2361 2365 2353

Suggest Documents