va - Ad nced
Re
se
ar
Advanced Research in Scientific Areas December, 1. - 5. 2014
fi
C
ti
ON
ch in S cien
FERENCE A
SA
R
c
Ar
e as
- VIRT
UA
L
Depth map generation for the automatic 2D–3D conversion Miroslav Galabov Computer Systems and Technologies Department St. Cyril and St. Methodius University Veliko Turnovo, Bulgaria
[email protected] rigid objects [2], it is more complex for blurred or semitransparent images, such as smoke clouds, and for all objects, which do not cover the whole of the pixels in which they appear, such as hair, fur and netting. The segmentation operation must precisely define discontinuity lines in the depth map;
Abstract— This article describes methods developed for 2D-3D conversion of images based on Depth map generation for multiview autostereoscopic displays. We use the Mean Squared Error (MSE) metric to evaluate the 2D-to-3D video-conversion algorithms, that used depth. In the objective evaluation part, we measure the differences between the propagated depth maps and the ground truth.
Missing image generation: a new image is generated by lateral shifting of the pixels in the original image, over a distance defined by the value of the depth map at each specific point. The depth level corresponding to a null parallax, which will show the pixel in the screen plane, is defined in the depth script. This decision is essentially of an artistic nature. The image generation stage is automatic. The values of the maximal positive parallax respect the limit imposed for the background in order to avoid spectator view divergences. The image generation procedure using a depth map is often known as depth image based rendering (DIBR). Various open-source implementations are available, notably using MATLAB [3]. As the horizontal shifts, used to generate new pixel, are generally fractional, the source and destination images are usually oversampled before DIBR. Upsampling by a factor 5 is often used, as the human visual system is able to detect shifts of the order of one-fourth of a pixel (for high-definition images with a width of 2 K pixels) [4]. Note that it is also possible to generate two symmetrical left and right images from a “central” view, creating lateral views by shifts with opposite signs and a distance reduced by half. This solution offers advantages in that artifacts are distributed equally across the two views, and their size is reduced;
Artifact removal: the pixel shifts involved in the previous stage remove certain pixels, generating holes in the image. The missing pixels must be recreated by a disocclusion filling procedure using a variety of possible sources, including the previous or following images in the temporal sequence, or adjacent pixels in the current image. Pixels are recreated by duplication, interpolation or extrapolation using these sources.
Keywords - 2D to 3D image conversion, Depth maps, stereoscopic image.
I.
INTRODUCTION
3D can be regarded as the next revolution for many applications such as television, movie, and electronic games. However, the lacking of 3D contents is becoming a severe bottleneck for the entire 3D system chain which contains the generation of 3D contents and its representations, coding, transmission, and visualization. Various schemes are proposed to re-produce the 3D content through 3D displays such as holographic techniques, volumetric 3D display, multiview and binocular autostereoscopic displays. Compared with the prosperous development in 3D displays, an unfortunate thing is that there are few 3D contents to be demonstrated on these 3D displays. The production cost of 3D contents is still very high at present and the demand of efficient 2D to 3D conversion techniques is very urgent. The task of 2D to 3D conversion is to infer the 3D information from monocular 2D images. Most 2D to 3D conversion algorithms for generating stereoscopic videos and ad-hoc standards are based on the generation of a depth map. 2D–3D conversion involves creating the second image of a stereoscopic pair from the first image. The process may be divided into four stages [1]:
Depth map generation: the depth map is a bitmap image of the same size as, and in exact correspondence with, the starting image. Each pixel is defined as the distance between an object visible in the scene and the camera, rather than a color. This map may easily be shown in gray scale and manipulated visually in the same way as any monochrome image;
Segmentation: in this crucial stage, we define the contours of the different objects that make up a scene and must appear in different depth planes in the image. Thus, we separate the main objects or characters, often situated in the medium plane, from the background. While this operation is not particularly difficult for
The 3rd year of Advanced Research in Scientific Areas http://www.arsa-conf.com
II.
THE AUTOMATIC 2D-3D CONVERSION
2D–3D conversion combines complex algorithms with knowledge and abilities of experienced human operators. To
SECTION Informatics
- 350 -
va - Ad nced
Re
se
ar
Advanced Research in Scientific Areas December, 1. - 5. 2014
fi
C
ti
ON
ch in S cien
FERENCE A
SA
R
c
Ar
e as
- VIRT
UA
L
date, no 100% automatic procedure has been able to provide wholly satisfactory results in all cases. A certain number of professional 2D–3D converters are available on the market, but none of these converters can guarantee perfect results for video sequences of unknown origin. Certain specific error types are generally encountered with 2D–3D converters. Objects in the foreground of an image may appear in the background and vice versa. Transparency, and fine structures, which are difficult to segment, such as rain, hair or the holes in a tennis racket, are generally badly interpreted or completely ignored by converters. The segmentation process may consider fine structures as noise and integrate them into the background, or over-segment them, integrating parts of the background. In both the cases, the results are unconvincing in terms of depths. A variety of simple and inefficient converters are available to the public, both as software [5] and specific devices [6]. The simplest of these methods consists of taking the stereoscopic pair of the following image or the previous image of a video sequence as the second image. If the camera captures a regular panoramic lateral view, the disparity between two consecutive images produces a 3D effect; the faster the camera moves, the stronger the effect is. This approach rarely works and, if the camera moves in a vertical plane, the results are very uncomfortable. Moreover, if the camera changes direction, the depth effect is inverted and the background switches to the foreground. A more sophisticated variation of this method only conserves the horizontal image shift component, reducing undesirable effects, but this does not prevent the 3D effect from disappearing when the camera ceases to move. 2.1. CONVERSION STAGES The purpose of the industrial video conversion workflow is to create the second view of a stereoscopic pair using the first view. For reasons of efficiency, the various stages in the procedure are carried out by dedicated workers or programs. As automatic algorithms are unable to produce perfect results for each stage, visual checks by experienced staff are necessary after each of the stages mentioned below:
All of these stages, with the exception of step 6 which is completely automatic, are semi-automatic, meaning that they use algorithms and computing tools manipulated by human users. One example of this is the automatic detection of contours, which are then corrected using spline tools. III.
DEPTH MAPS: CALCULATION AND PROPAGATION
There are three commonly used depth estimation methods from 2D to-3D conversion applications:
Depth from blur: The basic idea is to estimate the depth information based on the amount of blur of the object;
Vanishing Point based Depth Estimation: The main idea is to find out the vanishing point that is the farthest point of the whole image;
Depth from Motion Parallax: This is based on the fact that objects with different motions usually have different depths. For example, near objects move faster than far objects, and so relative motion can be used to estimate the depth map. This method is widely used for the depth estimation in 2D-to-3D video conversion.
Once the relative distances between the elements of the scene have been obtained, each element is assigned an absolute distance in relation to the camera. This stage may, clearly, be carried out manually, but an intelligent program can prove effective and make use of a number of depth cues used by the human brain: a priori knowledge of objects, the relative size of several identical objects, etc. A human head in closeup, for example, will clearly be close; a human silhouette of onefourth of the height of the screen will be approximately 10 m away, a car of a few pixels in length will be in the background, and the sky will be considered to be in the background.
1) Detection of key images (in the original, left view), between which movements in different depth planes are sufficiently linear and/or predictable. 2D image
2) Segmentation of foregrounds, characters and midshots.
Depth map
Figure 1. 2D and Depth map images.
3) Evaluation of the depth Z of the center of gravity of each element. 4) Spatial propagation of Z to all of the pixels of each element in the image. 5) Temporal propagation of the segmentation and depth maps of each element to images between two key images.
An automatic algorithm begins by seeking a horizon as exterior shots are often used. It then applies a first estimation: the sky is in the background, and the ground approaches in a linear manner from the horizon to the foot of the camera. All possible depth cues are explored and used, whether automatically or manually:
6) Generation of right images for the whole sequence. 7) Correction of disocclusion artifacts produced during the previous stage.
The 3rd year of Advanced Research in Scientific Areas http://www.arsa-conf.com
SECTION Informatics
If the camera offers a limited depth of field, blurring may be used to evaluate which elements form the center of interest in the image and are thus close to the zero parallax distance;
- 351 -
va - Ad nced
Re
se
ar
Advanced Research in Scientific Areas December, 1. - 5. 2014
Ar
e as
- VIRT
UA
L
c
fi
C
ti
ON
ch in S cien
FERENCE A
SA
R
Perspective, receding lines and the position of receding points are very useful in locating the relative depth of buildings, roads, edges of sports pitches, rooms, etc.;
measure the differences between the propagated depth maps and the ground truth. TABLE I.
For objects with a strong incline in relation to the image plane, we do not define a depth, but a depth gradient. This is typically the case for the ground, or for walls;
MSE EVALUATION RESULT
Methods Seq1 Seq2 Seq3
Movements in highly dynamic scenes are used to determine the distance of an object of known size, such as a vehicle, a ball or a character. If the position of the ground has already been determined, for example, moving persons will be located at the distance where their feet touch the ground.
At this stage, we obtain a depth map made up of several planes that can be recognized by their different brightnesses. Each element in the image then needs to be refined; these elements are generally memorized in several distinct superposed layers, and the depth of each is determined by imposing a light level variation in the depth map, giving an inclination or roundness. Once again, monocular depth cues are used to add details to the result, such as shadow and light effects showing the shape of a face. A priori knowledge of objects and characters is also very important. This knowledge is used, for example, to round out faces or balls by directly modifying the depth map by hand or by creating computer models of these elements from which the depth map may be extracted. This operation is generally supervised and corrected by hand. Thus, the operator ensures that characters are correctly anchored to the ground by giving their feet the same depth (and thus the same shade in the depth map) as the ground below them. At the end of this stage, the key image has a perfect depth map, and a first result may be visualized on screen before proceeding to the next step. Next, using these depth maps of key images of a video sequence, interpolation techniques are applied to create depth maps corresponding to intermediate images. Simple linear interpolation may be used if the movement is sufficiently uniform. In all cases, the aim is for the depth values between two key images to follow movement, while remaining coherent in relation to the depth specified in the key images. The quality of segmentation and of movement estimation is crucial in ensuring successful interpolation. Temporal filtering of depth maps may also be used to improve quality, increasing the coherence of depth across a whole sequence [7]. Manual validation is always helpful, as a movement that initially appears linear may not, in fact, be completely linear; segmentation contours, depth values and other parameters may then need to be adjusted. IV.
The 3rd year of Advanced Research in Scientific Areas http://www.arsa-conf.com
Seq4
Seq5
Seq6
Seq7
Seq8
Bilateral 42.59 7.76 87.14 607.15 131.40 Filter
71.18 85.70 387.67
Philips 40.91 7.55 94.83 548.94 124.75 IDP [8]
70.01 79.78 360.47
Cao's Method 47.46 8.51 40.04 245.50 249.98 191.53 40.48 227.29 [9] Li’s Method 16.89 5.51 41.98 190.77 [10]
V.
86.97
69.24 19.27 105.81
CONCLUSION
2D–3D conversion is on the border between technical and artistic activities, and human participation remains essential. A depth script is established for each sequence for conversion prior to execution of a variety of partly manual and partly automated stages. The 2D–3D conversion workflow is broken down into a series of stages, notably depth map generation, segmentation, missing image generation and artifact suppression. Certain commercial solutions offer fully automated 2D–3D conversion, but the results are generally unsatisfactory, with the exception of very specific cases where the geometry of the scene is subject to strong constraints, movements are linear and predictable and segmentation is simple. Not all content is equally suited to 2D–3D conversion. Conversion is notably facilitated when planned during the filming phase. In these cases, the costs and complexity of filming are identical to those involved in 2D filming, and the conversion stage, with the benefits of favorable framing, staging and other conditions, does not have excessive effects on the production budget. REFERENCES [1] [2]
[3]
QUALITY ASSESSMENT
We acquired eight 3D video clips with associated groundtruth depth maps, which we used to evaluate the 2D-to-3D video-conversion algorithms, that used depth. These results of the Mean Squared Error (MSE) metric (listed in Table 1) suggest that the efficacy of the algorithms can be distinguished. In the objective evaluation part, we employ the MSE metric to
THE MSE TEST ON 2D-TO-3D CONVERSION ALGORITHMS, THAT USED DEPTH.
[4]
[5]
SECTION Informatics
MICHEL B., “La conversion 2D–3D”, in La Stéréoscopie Numérique, Eyrolles, Chapter 5, 2011. XU F., LAM K.-M., DAI Q., “Video-object segmentation and 3Dtrajectory estimation for monocular video sequences”, Image and Vision Computing Journal, vol. 29, no. 2–3, pp. 190–205, 2011. DA SILVA V., “Depth image based stereoscopic view rendering for MATLAB”, available at http://www.mathworks.com/matlabcentral/fileexchange/27538-depthimage-based-stereoscopic-view-rendering, 2010. GRAZIOSI D., TIAN D., VETRO A., “Depth map up-sampling based on edge layers”, Signal Information Processing Association Annual Summit and Conference (APSIPA ASC), Hollywood, CA, pp. 1–4, 3–6 December 2012. GMBH E.M., “MakeMe3D software”, available at http://www.makeme3d.net/convert_2d_to_3d.php, 2010.
- 352 -
va - Ad nced
Re
se
ar
Advanced Research in Scientific Areas December, 1. - 5. 2014
e as
- VIRT
UA
L
[8]
Ar
[7]
c
[6]
fi
C
ti
ON
ch in S cien
FERENCE A
SA
R
IPP, “3D media converter box”, available at http://ippstore.com/3D_Media_Converter_Box.html, 2011. BLEYER M., GELAUTZ M., “Temporally consistent disparity maps from uncalibrated stereo videos”, Proceedings of the 6th International Symposium on Image and Signal Processing and Analysis (ISPA), Salzburg, pp. 383–387, 16–18 September 2009. Varekamp.C and Barenbrug.B, “Improved depth propagation for 2d to 3d video conversion using key-frames,” 4th European Conference on Visual Media Production (IETCVMP), pp. 1–7, 2007.
The 3rd year of Advanced Research in Scientific Areas http://www.arsa-conf.com
Xun Cao, Zheng Li, and Qionghai Dai, “Semi-automatic 2d-to-3d conversion using disparity propagation,” IEEE Transactions on Broadcasting, vol. 57, pp. 491–499, 2011. [10] Zheng Li ,Xudong Xie, Xiaodong Liu. An efficient 2D to 3D video conversion method based on skeleton line tracking, 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video, 2009. [9]
SECTION Informatics
- 353 -