An Efficient Algorithm for Depth Image Rendering Cesar Palomo∗ Computer Science Department PUC-Rio
(a) Left view
Marcelo Gattass† Computer Science Department PUC-Rio
(b) Right view
(c) Composite view
Figure 1: Pair of reference images are warped to a new viewpoint and composited in real time.
Abstract
synthesize a new view at a different angle to help referees inspect for events such as fouls or offsides.
As depth sensing devices become popular and affordable, end-users can have the freedom to choose a different point of view in a multicamera transmission. In this paper, we propose an image-based rendering (IBR) algorithm to generate perceptually accurate virtual views of a scene in real-time. The proposed algorithm is implemented and tests with publicly available datasets are presented. From these results one can evaluate the efficiency and quality of the algorithm.
With the recent development and promissing popularity of depth sensing devices, such as depth range sensors and time-of-flight cameras, practical techniques for segmentation, compression, filtering and rendering of depth images arise as interesting and useful research topics. This work proposes an algorithm for rendering synthetic novel views of a scene captured by real cameras, capable of generating visually accurate images at interactive rendering rates, using a relatively small number of input views. Calibration data and depth images, i.e., color images along with their dense depth maps, are used as the sole input of our algorithm.
CR Categories: I.3.3 [Computer Graphics]: Picture/Image Generation—Viewing algorithms; I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Virtual reality; I.4.9 [Image Processing and Computer Vision]: Applications
We aim to provide an IBR algorithm that runs entirely on the GPU, not only to guarantee good performance, but also to leave the CPU free to perform other tasks, e.g. user interface tasks. An important property of the proposed algorithm is that it allows for automatic processing of the imagery, without need for pre- or post-processing stages.
Keywords: Image-based rendering, GPU programming, depth images
1
Introduction
The rest of this document is organized as follows. Section 2 presents a review of related research in IBR. Section 3 depicts the depth image representation and how images can be composited for virtual synthesis. Section 4 details all the steps involved in the proposed method and their implementation on the GPU. Test results are shown in Section 5, and concluding remarks and future work directions are presented in Section 6.
The generation of novel views from acquired imagery is motivated by several applications in computer games, sports broadcasting, TV advertising, cinema and entertainment industry. In case of ambiguity in a soccer game, for instance, many input views may be used to ∗ e-mail:
[email protected]
† e-mail:
[email protected]
2
Related Work
After pioneer works on IBR in the mid-90’s [Levoy and Hanrahan 1996][McMillan and Bishop 1995][Gortler et al. 1996], several systems have been proposed to extract models from images and use them as geometric proxies during rendering. That made possible to considerably reduce the density of input images necessary for high quality rendering. A good review on dense two-frame stereo, a commonly used technique to obtain models from input images, can be found in [Scharstein et al. 2002].
Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
[email protected]. VRCAI 2010, Seoul, South Korea, December 12 – 13, 2010. © 2010 ACM 978-1-4503-0459-7/10/0012 $10.00
271
[Pulli et al. 1997] represent the scene as a collection of textured depth maps and blend three closest depth maps to generate an image from a novel viewpoint. They also introduce a soft z-buffer to deal with occlusions and blend reference cameras’ pixels based on the proximity to the new viewpoint. [Debevec et al. 1998] propose an efficient method for view-dependent texture mapping, allowing for real-time implementations using graphics hardware built-in technology. [Buehler et al. 2001] present a principled study on the compositing stage.
zmin and zmax represent respectively the minimum and the maximum value of z in each depth map. d = 255
zmax
3.3
(2)
Forward Mapping
Although straightforward to implement in graphics hardware, the 3D warping technique fails to produce good results at occluded regions, which are not sampled in a given reference viewpoint but can get revealed as the virtual viewpoint moves. Although additional reference cameras can be used to fill the missing information, occlusions still need to be identified and handled to avoid the rubber sheets problem[Mark et al. 1997].
To establish notation, in this section we review basic concepts of depth image rendering.
Pinhole Camera Model
3.3.2 Compositing
The image acquisition process can be represented by a pinhole camera model[Hartley and Zisserman 2000], in which the imaging process is a sequence of transforms between different coordinate systems. Each input camera contains associated calibration data: a view matrix V4,4 that determines camera’s position and pose, and a calibration data K3,4 with intrinsic properties used for perspective projection: field of view, aspect ratio, skew factor and image optical center[Hartley and Zisserman 2000].
Soft z-buffer and weighting are generally used to compose multiple views[Buehler et al. 2001]. The soft z-buffer uses a tolerance for the z-test among warped pixels. Pixels with similar z-values get composited through weighting based on proximity to the virtual viewpoint. Otherwise only the closest pixel contributes to the final color at the virtual image.
4
Using the pinhole camera model, a point p(x, y, z) written originally in a global coordinate system can be converted to the camera’s coordinate system and then finally to the image coordinate system pi (u, v) through the sequence of transforms defined in equation 1 (using homogeneous coordinates):
3.2
.
zmin
3.3.1 Occlusions
Depth Image Rendering
pi (uw, vw, w)T = KVp(x, y, z, 1)T .
−
1 z 1
View-dependent texture mapping[Debevec 1996][Debevec et al. 1998] has been proposed to render a virtual image by compositing multiple views of a given scene. It is a forward mapping method: a textured 3D mesh is unprojected from a reference camera’s coordinate system to a global coordinate system, and finally warped to the virtual viewpoint. Basically, Equation 1 can be used for unprojecting a pixel pi (u, v) and its depth z, and finally the virtual camera’s projection matrix is applied to warp to the novel view.
This paper proposes an algorithm based on the ideas presented in [Zitnick et al. 2004]. The devised method does not require pre- or post-processing stages and can be fully implemented in the graphics hardware, avoiding costly CPU-GPU transfers.
3.1
−
The inverse of Equation 2 can be used to derive z from a pi (u, v) fetched from the depth map.
High quality rendering results have been reported by [Zitnick et al. 2004], using a modest number of input reference cameras. Inspired by Layered Depth Images[Shade et al. 1998], they augment depth maps adding a second layer with information only at locations near depth discontinuities. This layer contains matting information to improve the rendering quality at objects borders. The identification of depth discontinuities and the matting calculation are computed in an offline pre-processing. They also build a separate mesh in the CPU to deal with discontinuities in depth.
3
1 zmax 1
GPU Algorithm
The steps of the proposed method are depicted in Figure 3. The following subsections present each of the steps and give details on how they can be implemented in graphics hardware. 4.0.3 Mesh Creation
(1)
The first step of the method is to create a 3D mesh for warping. A set of W x H vertices, corresponding to the resolution of the input images, are stored in a vertex buffer. That data structure is created once and reused at every render cycle, optimizing for speed. If all input images from different reference cameras have the same resolution, only a single buffer needs to be created.
Depth Map Representation
The representation composed of a color image and a dense depth map, with one associated depth value for each pixel in the image, is called throughout this paper as a depth image. An example is shown in Figure 2.
The x and y components of the vertex buffer are the actual pixel locations in the depth map, i.e., x ∈ (0..W − 1), y ∈ (0..H − 1). That defines a regular grid where the z coordinate will be calculated in the GPU, before warping.
Depth stored in a depth map can be relative to any selected coordinate system. Let us assume that depth in a collection of depth maps is written in a common global coordinate system. It means that each pixel pi (u, v) in a depth map stores an associated z value. Equation 1 can be used to retrieve a pixel’s global coordinate p(x, y, z).
4.0.4 Render Reference Cameras
One detail to notice is that depth maps are usually gray-level images, commonly in 8-bit format. For that reason, a linearization of actual depth z into a depth level d is usually applied before storage. Equation 2 shows a possible linearization of d in range [0, 255].
Similar to [Zitnick et al. 2004], relative to the position of the new viewpoint, the closest pair of reference cameras are chosen and used for render. The use of only two reference cameras may cause some visible artifacts in regions not sampled by either camera, but that
272
Figure 2: Example of a depth image: color image + dense depth map. Darker pixels in depth map mean greater depth. Courtesy of Zitnick et al.
Figure 3: After an initial 3D mesh creation in the CPU, 3D warping, occlusions identification, soft Z-buffering and blending stages run entirely on graphics hardware. The other steps represent inexpensive render calls by the CPU.
problem should not happen when the input cameras baseline is not too wide.
4.0.6 Compositing
After both reference cameras results have been rendered to textures with color and depth information, a quad is drawn to the screen to activate the compositing step, which is entirely done by a programmable fragment shader.
The pair of reference cameras is rendered one at a time: input camera’s calibration and viewing matrices are pre-multiplied and sent to the GPU as a single matrix, and the vertex buffer created during setup is rendered. Both color image and depth map are sent to the GPU as textures for use during warping.
4.0.7 Soft Z-Buffering and Blending
The basic functioning of the soft Z-buffer follows. Depths from both reference cameras are fetched from the corresponding depth textures. When they differ above a threshold τz , only the closest pixel contributes to the final color at the rendered image, at full opacity. Otherwise, blending is performed.
4.0.5 3D Warping and Occlusions Identification
These steps are implemented in a programmable vertex shader. They are performed once for each reference camera. For a vertex (u, v), depth level d is fetched in the corresponding position in the depth map, and converted to the actual z value. Then the world coordinates p(x, y, z) for that vertex are calculated. That point is finally warped to the new viewpoint using the current view and projection matrices for the new vantage point.
The blending contribution weights are based on the angular distances[Porquet et al. 2005], as shown in Figure 4. We unproject the pixel to the world coordinate system and define line segments linking that point to the reference cameras’ center. Angular distances θi and θi+1 are used to measure the influence of views i and i + 1. For smooth color interpolation, [Buehler et al. 2001] suggests a cosinebased weight. To avoid two cosine calculations, we calculate only weight wi for view i so that wi ∈ [0, 1] using Equation 4, and set the other reference camera’s weight as wi+1 = 1 − wi .
In the algorithm presented here, the identification of occlusions is done using the gradient of depth z(u, v) in the depth map, defined in Equation 3. The vertex shader makes three texture accesses to retrieve the depth levels and then converts them to actual depth. We define a threshold τG so that when ||∇z(u, v)|| > τG , this vertex is considered occluded and its alpha channel is set to 0. Otherwise, its alpha is set to 1.
wi = 0.5(1 + cos
πθi ). θi + θi+1
(4)
We incorporate in the soft Z-buffer algorithm the occlusion information α stored as the alpha channel at the color image’s pixel. The final pixel color can be calculated with:
∇z(u, v) = (z(u + 1, v) − z(u, v), z(u, v + 1) − z(u, v)) . (3)
color =
The output of the vertex shader is the occlusion label stored in the alpha channel, the vertex position and texture coordinates for color texture, which will be interpolated by the hardware rasterizer. Finally a fragment shader performs the perspective division and calculates pixel’s depth and color (along with the alpha channel for occlusion handling). The results of this step are stored in a frame buffer object for further compositing.
αi wi colori + αi+1 wi+1 colori+1 . αi wi + αi+1 wi+1
(5)
When the denominator in Equation 5 is zero (when pixels from both cameras are marked occluded), we use arbitrarily one of the cameras pixel color to fill occlusions. Figure 5 illustrates some steps of the proposed framework.
273
Figure 4: Visibility, angular distances and occlusions are used to determine final color.
Figure 5: Results for a reference camera and final result. The first image shows rubber sheets at objects borders, which are identified using the module of the depth gradient and encoded using the alpha channel, shown in the second image. Those regions appear erased in the third image. The last image shows the final result after compositing the pair of reference images.
5
Experimental Results
P SN R = 10 · log10
To verify the effectiveness of the proposed method, we used the sequences Ballet and Breakdancers generated and provided by the Interactive Visual Media Group at Microsoft Research. Both sequences consist of 100 color images each of real dynamic scenes, taken from 8 different stationary vantage points, along with highquality depth maps[Zitnick et al. 2004]. All depth images have resolution of 1024 x 768 pixels, captured at 15 FPS.
6
The visual accuracy metric used in the experiments was the peak signal-to-noise ratio (PSNR). PSNR is commonly used to measure the human perception of reconstruction quality, working as an evaluation function based on the squared-error between two images, calculated with Equation 6.
.
(7)
Conclusion and Future Work
In this paper, we proposed an integrated algorithm to render virtual viewpoints using depth images and calibration data, which makes massive use of graphics hardware for mantaining high interactive rates. The perceptual visual quality of images rendered using the proposed method is comparable to considered state-of-the-art work[Zitnick et al. 2004]. In our evaluation, possible visible artifacts could be significantly diminished through color balancing input images. The algorithm could also benefit from matting information, but since current methods are done offline, they were not added to our framework.
The tests methodology consisted in using a reference camera’s color image I as basis and recreate the color image Iv from that vantage point. Iv with resolution W x H could then be compared against I using PSNR with the mean squared error pixelwise (luminance values from 0 to 255 were used) as shown in equation 7. A 10-pixel wide border of the resulting image was cropped prior to the PSNR calculation, since most algorithms do not deal well with image borders. W −1 H−1 1 [I(i, j) − Iv (i, j)]2 . WH
We ran tests for each of the 100 frames for both depth images sequences, setting the camera 4 as the virtual viewpoint and using reference cameras 3 and 5 during rendering (refer to [Zitnick et al. 2004] for more information on their cameras setup). Figure 6 depicts the PSNR results for the color images in the Ballet sequence, while Figure 7 shows the results for the Breakdancers sequence. In both cases, compared to an implementation using 3D warping followed by blending based on angular distances, the proposed algorithm considerably increases the PSNR perceptual quality, maintaining highly interactive rendering rates.
The proposed algorithm was implemented and tested in a workstation equipped with a Intel Core 2 Quad 2.4GHz, with 2GB RAM and a NVidia GeForce 9800 GX2 GPU, with 512 MB memory. Although a more systematic test should be performed to attest the interactive performance, the observed rendering rates (above 100 FPS for tested images) suggest that the proposed system could be applied to higher resolution images while keeping the real-time performance.
M SE =
2552 M SE
The presented algorithm can be seen as a good tradeoff between visual quality and speed, and can be easily implemented for free virtual viewpoint control by the user in live transmissions with fairly good results, enhancing the viewer experience. The proposed pipeline also allows for easy extension and improvement.
(6)
i=0 j=0
274
Figure 6: Results for tests with all 100 frames of Breakdancers sequence.
Figure 7: Results for tests with all 100 frames of Ballet sequence.
275
References B UEHLER , C., B OSSE , M., M CMILLAN , L., G ORTLER , S., AND C OHEN , M. 2001. Unstructured lumigraph rendering. In In Computer Graphics, SIGGRAPH 2001 Proceedings, 425–432. D EBEVEC , P., Y U , Y., AND B OSHOKOV, G. 1998. Efficient view-dependent image-based rendering with projective texturemapping. Tech. Rep. UCB/CSD-98-1003, EECS Department, University of California, Berkeley. D EBEVEC , P. E. 1996. Modeling and Rendering Architecture from Photographs. PhD thesis, University of California at Berkeley, Computer Science Division, Berkeley CA. G ORTLER , S. J., G RZESZCZUK , R., S ZELISKI , R., AND C OHEN , M. F. 1996. The lumigraph. In SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, ACM, New York, NY, USA, 43–54. H ARTLEY, R. I., AND Z ISSERMAN , A. 2000. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521623049. L EVOY, M., AND H ANRAHAN , P. 1996. Light field rendering. In SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, ACM, New York, NY, USA, 31–42. M ARK , W. R., M C M ILLAN , L., AND B ISHOP, G. 1997. Postrendering 3d warping. In In 1997 Symposium on Interactive 3D Graphics, 7–16. M C M ILLAN , L., AND B ISHOP, G. 1995. Plenoptic modeling: an image-based rendering system. In SIGGRAPH ’95: Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, ACM, New York, NY, USA, 39–46. P ORQUET, D., D ISCHLER , J.-M., AND G HAZANFARPOUR , D. 2005. Real-time high-quality view-dependent texture mapping using per-pixel visibility. In GRAPHITE ’05: Proceedings of the 3rd international conference on Computer graphics and interactive techniques in Australasia and South East Asia, ACM, New York, NY, USA, 213–220. P ULLI , K., C OHEN , M., D UCHAMP, T., H OPPE , H., S HAPIRO , L., AND S TUETZLE , W. 1997. View-based rendering: Visualizing real objects from scanned range and color data. In In Eurographics Rendering Workshop, 23–34. S CHARSTEIN , D., S ZELISKI , R., AND Z ABIH , R. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47, 7–42. S HADE , J., G ORTLER , S., H E , L.- W., AND S ZELISKI , R. 1998. Layered depth images. In SIGGRAPH ’98: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, ACM, New York, NY, USA, 231–242. Z ITNICK , L. C., K ANG , S. B., U YTTENDAELE , M., W INDER , S., AND S ZELISKI , R. 2004. High-quality video view interpolation using a layered representation. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers, ACM, New York, NY, USA, 600–608.
276