posed scalable teleconferencing configuration. This scal- able system is configured using building blocks defined as. Stereo Camera Analysis Blocks (SCAB).
A MODULAR APPROACH TO VIRTUAL VIEW CREATION FOR A SCALABLE IMMERSIVE TELECONFERENCING CONFIGURATION Eddie Cooke, Ingo Feldmann, Peter Kauff, Oliver Schreer Fraunhofer-Institut f¨ur Nachrichtentechnik, Heinrich-Hertz-Institut, Germany Email:{cooke, feldmann, kauff, schreer}@hhi.de ABSTRACT This paper details a new modular approach to virtual view creation that is designed to work in conjunction with a proposed scalable teleconferencing configuration. This scalable system is configured using building blocks defined as Stereo Camera Analysis Blocks (SCAB). SCABs consist of axe-parallel narrow baseline stereo cameras. These provide design flexibility and improve the image analysis process. Virtual view creation is modular in such that we can add or remove SCABs based on our particular requirements without having to modify the view synthesis algorithm. A new approach to virtual view creation from multiple images is defined. This contains two separate processes: surface segmentation, which identifies surfaces, and surface selection and merging, which selects and integrates the best view of a required surface into the virtual view. The surfaces are identified via a sampling density function while the surface selection procedure is based on a weighting scheme. 1. INTRODUCTION Teleconferencing systems endeavour to provide a level of user participation within a virtual environment that is comparable to the real world conferencing scenario i.e., gestures, eye contact, parallax viewing, etc.; [1]. Today’s state of the art systems are constructed using the popular widebaseline convergent stereo camera configuration with virtual views being created for each camera pair, fig. 1(a). This approach however, tends to create very system specific solutions; implying that if the number of conferees is extended or the system set-up changes then both a new system configuration and synthesis algorithm are required. In such a stereo camera set-up the wide-baseline allows a maximum amount of information to be captured, while the camera convergence is required to ensure enough overlap in the reference images for disparity estimation. The inherent problem with this configuration is that it aspires to achieve too much through each camera pair: maximum information and reliable disparity maps [2]. The disparity map generation is error prone due simply to the differences in critical surface information each reference image captures e.g.,
(a)
(b)
Fig. 1. (a) Wide-baseline convergent camera pairs (A,B) and (C,D). (b) Scalable 4 SCAB configuration. hand gestures. Such problems can be reduced through the use of segmentation masks or 3D models [3] but these lead to the development of very inflexible solutions. Several methods for virtual view creation in teleconferencing systems have been proposed. Kauff et al [4] concentrate on maximising sampling density by combining images along a disparity based separation line. While Lei and Hendriks [5] produce a virtual view at the midpoint of the baseline through a simplified 3D warp. However both these approaches restrictively: create an initial virtual view along the baseline and therefore require a hidden layer to handle occlusions that may arise due to further 3D warps, use a maximum of two reference images for virtual view creation, and have only been validated on wide-baseline systems. In this paper we present a teleconferencing system that is both scalable and modular. In section 2 we introduce the building block of our scalable system configuration, the Stereo Camera Analysis Block (SCAB). We present a new modular approach to virtual view creation in section 3. In section 4 experimental results are provided. Finally section 5 provides conclusions. 2. SCALABLE CONFIGURATION As stated, the inherent problem of the wide-baseline convergent stereo camera set-up is that it aspires to generate both reliable disparity maps and maximise the information captured by the individual cameras. The problems generated trying to obtain this trade-off have been mentioned. We
(a)
(b)
Fig. 2. Block diagram of (a) SCAB (b) Modular system. will now present a new system configuration devised to efficiently achieve both these goals. This configuration is based on a building block called the SCAB, fig. 2(a). A SCAB consists of stereo cameras that are axe-parallel and have a narrow baseline. This set-up reduces the disparity estimation complexity by ensuring that differences in surface orientation are minimised via a more consistent overlap between reference images. In order to ensure that we still have enough visual information to create the required virtual views we use multiple SCABs. Comparing the set-up of fig. 1(a) with that of fig. 1(b) we can see that camera A has been replaced with the SCAB (A1 ,A2 ) and camera B with SCAB (B1 ,B2 ). Whereas previously disparities were estimated between cameras (A,B), using SCABs disparities are estimated between cameras (A1 ,A2 ), and between cameras (B1 ,B2 ). This approach guarantees reliable disparity maps for each camera in a SCAB. Due to the similarities between the individual reference images of a SCAB we need only select one image from each for the virtual view creation process. This approach provides a scalable system where depending on the display size and number of conferees SCABs can be added or removed. In the following subsection we will determine the relationship between SCAB disparity estimation precision and virtual view synthesis quality. 2.1. Required Level of Disparity Precision Our teleconferencing scene is limited to a depth range between zmin and zmax , the near and far depth clipping planes respectively. The relationship between the depth z, baseline length ∆t, and disparity d is expressed in eq. (1), where f is the focal length. d = f ∆t/z (1) It has been shown in [6] that equal disparity values define a specific depth layer within this depth range. It is also known that in order to avoid depth dependent artifacts during synthesis the movement of the virtual camera should be restricted to the baseline. Hence, there is a dependency between the scene depth, disparity estimation precision, and
virtual view synthesis quality. In [6] Chai et al examine the dependencies between the baseline, image and depth resolution Npix , and number of depth layers NLayers , for the case of an equidistant infinite camera array. Such a camera array corresponds to a special version of a light field introduced in [7, 8] where the cameras are parallel to each other and placed regularly on a straight line. One of the basic results Chai et al present is that between two neighbouring reference cameras any virtual view can be reconstructed without loss of information if: 2wh zmin zmax ∆t = (2) NLayers Npix f (zmax − zmin ) where wh is the horizontal sensor size. In order to determine the correct disparity precision for our SCAB module the maximal required depth resolution must be determined for the renderer. In this case the head of the user can be interpreted as the center of the virtual camera, the display as its image plane and the distance between them as focal length. The movement of the users head corresponds to that of the virtual camera. Applying eq. (2) it is possible to determine the minimal required number of depth layers according to the maximum expected navigation range of the virtual camera and the near and far depth clipping planes. Using this virtual camera obtained depth layer quantity and the information defined by our SCAB module, with its narrow baseline, we can determine what level of precision, pixel or sub-pixel, our disparity estimation process requires in order to render artifact free virtual views at positions outside the baseline. 3. MODULAR VIRTUAL VIEW CREATION Modular virtual view creation implies we are able to create a virtual view based on input provided by an N SCAB set-up. Hence, we can add or remove SCABs based on our particular requirements without having to modify the synthesis algorithm. The Multi-View Synthesis (MVS) block, fig 2(b), indicates our virtual view creation process. The most important task of the MVS process is to discern from which reference image a virtual view required surface should be taken. Current approaches to this selection are virtual view independent, implying they restrictively identify and select surfaces in the reference images without ever considering the actual virtual view position [4, 5, 9]. We adopt a virtual view dependent approach to surface identification and selection. This implies that all reference images are initially warped to the virtual viewpoint before the surface identification and selection process occurs. The MVS block contains two separate processes: (i) surface segmentation, which allows us to identify surfaces within warped reference images, and (ii) surface selection and merging, which allows us to select and integrate the best view of a surface into the final virtual view.
3.1. Surface Segmentation The warped reference images are segmented into surfaces using sampling density information. This is possible based on the observation that each pixel position represents a point sample of a 2D surface that lies on our object in 3-space. In order to compute the sampling density of a warped pixel q we examine its movement with respect to that of its four diagonal neighbours, q1 , q2 , q3 , q4 , after the 3D warping. The sampling density ρ(q) is computed using eq. (3): ( 4 ) X ρ(q) = mean |q − qi | (3) i=1
A sampling density value is computed for each pixel in the warped reference image and stored in a sampling density map. Surfaces are located based on the notion that adjacent samples in the reference image, which belong to one surface, are warped to virtual view adjacent positions, therefore creating neighbourhoods of similar ρ values. This grouping process functions as follows: let q be the current pixel under investigation, and Ni be its set of immediate neighbouring pixels. For each q we compute, δ(q), a measure of the difference between its sampling density value and that of its neighbouring pixels, eq. (4): (N ) i X (4) ρ(r) δ(q) = ρ(q) − mean r=1
We compute the value δ(q) for each pixel q in order to locate where spikes in movement occur during the warping of an image. These spikes indicate the boundary position of overlapping surfaces at different depths. We use these boundaries to divide the warped image into surfaces based on the severity of their individual displacements. This leads to a visually better final view related surface division as opposed to a strict depth or colour surface segmentation. Fig. 3 illustrates how this approach functions. Fig. 3(a) and (b) are two reference images taken from a 2 SCAB set-up. Both images are warped to the same location. Fig. 3(c) and (d) illustrate the identified hands and arms in the warped images. Note the filled area (red) indicates a surface which is occluded in the reference image. The output of the surface segmentation process is a sampling density map, which for every pixel provides a corresponding sampling surface bitmask. This bitmask contains the following information: a surface identifier; a surface boundary indicator for boundary positions; a surface sampling density value; and a value to indicate if this surface has already been processed. 3.2. Surface Selection and Merging Our approach to specifying the quality of a surface Q is based on a weighting scheme that is dependent on three sep-
(a)
(b)
(c)
(d)
Fig. 3. (a) & (b) reference images from a 2 SCAB set-up. (c) & (d) contain close-up of sampling density map of hands after the warping of images (a) & (b) respectively. arate weights. We will now explain the significance of these individual weights. Across the SCABs the reference images are all rectified with respect to the same plane, this implies that the reference viewpoint with the closest proximity to the virtual viewpoint will provide the best initial virtual view. The weight αd favours reference views closer to the virtual view and is therefore surface independent. Weight αρ is based on the current surface’s sampling density as indicated by its bitmask information. It is designed to favour surfaces with sampling densities equal or close to one i.e., perfectly represented surfaces. The final weight αt is dependent on the complexity of the surface’s texture. Clearly if a texture is very homogeneous then interpolation of an under sampled surface will produce a final surface representation of the same quality as a perfectly sampled view of the surface. This weight is designed to identify such situations. The surface merging process defines how the weights are used to specify the best view of a required surface and how the chosen surfaces are then combined in order to create a final virtual view. We identify three approaches to surface merging, presented in increasing order of complexity. (1) Hierarchical αd ordering: Accept all the identified surfaces in the sampling density map from the reference view with the largest αd weight. Areas where surfaces are missing are then selected from the reference view with the next largest αd . (2) Best αρ : Iterate through the sampling density maps selecting the non processed surface with the highest αρ weight on each pass. (3) Best combined weight: Commencing with the surface density map from the reference view with the largest αd weight, we identify a non-processed surface via the sampling surface bitmask information. We compute the combined weight αQ for each identified instance of the surface, eq. (5). We choose the surface with the largest αQ as the best selection for our virtual
view. αQ = αd · αρ,Q · αt,Q
(5)
Integrating the best view of each required surface means that neighbouring surfaces in the final virtual view may be supplied from different original reference images. In order to lessen the effects of specularity we implement a weighted blending at the surface boundaries. 4. EXPERIMENTAL RESULTS Here we compare the virtual view created using reference images and disparity maps taken from a wide baseline convergent camera set-up, (A,B) in fig. 1(a), with that of a 2 SCAB system, (A1 ,A2 ) and (B1 ,B2 ), illustrated in fig. 1(b). Fig. 4(a) shows the results of the former set-up. The view synthesis algorithm used is that defined in [5] where the virtual camera is placed at the midpoint of the baseline. The encircled area, enlarged in fig. 4(b), indicates the shortcomings of the system set-up. Here due to critical differences in the reference images the disparity estimation process is unable to correctly identify the hand depth information, hence after warping it is incorrectly reconstructed. In the virtual view of fig. 4(c) created using the best αρ approach the hand is correctly reconstructed, fig. 4(d). This is due to the fact that: (i) each SCAB provides more image overlap and hence a more reliable disparity map, (ii) using 2 SCABs we have at least the same reference image information as the wide baseline approach.
(a)
(b)
(c)
(d)
Fig. 4. (a) Virtual view created on wide baseline system using algorithm defined in [5]. (b) Critical region enlarged. (c) Virtual view created using our algorithm on a 2 SCAB set-up. (d) Critical region enlarged. 6. REFERENCES [1] O. Schreer, and P. Kauff, ”An Immersive 3D Videoconferencing System based on a Shared Virtual Table Environment”, Proc. of Int. Conf. on Media Futures, 2001.
5. CONCLUSIONS In this paper we presented a scalable teleconferencing system solution. In order to achieve this we divided the system into two separate building blocks based on image analysis and synthesis respectively. We defined SCABs, which contain stereo cameras that are axe-parallel, have a narrow baseline, and produce more reliable disparity maps. We then defined a MVS block to which we can add or remove SCABs without having to modify the virtual view creation algorithm. We presented a new virtual view dependent approach to surface selection for virtual view creation. This is divided into two processes: surface segmentation, which enables us to identify surfaces within warped reference images; and surface selection and merging, which allows us to determine and integrate the best view of this surface. The surfaces are identified via a sampling density function while the surface selection procedure is based on a weighting scheme. Using these building blocks we have the potential to create a unique system that is both scalable: depending on the display size and number of conferees SCABs can be added or removed, and modular: the MVS algorithm being dynamic enough to produce virtual views from an N SCAB set-up.
[2] P. Pritchett, and A. Zisserman, ”Wide Baseline Stereo Matching”, Proc. Int. Conf. on Computer Vision, pp. 754-760, 1998. [3] Y. Wu and T.S. Huang, ”Hand Modeling, Analysis, and Recognition for Vision-Based Human Computer Interaction”, IEEE Signal Processing Magazine, 2001. [4] P. Kauff et al, ”Advanced Incomplete 3D Representation of Video Objects using Trilinear Warping for Novel View Synthesis”, Proc. PCS ’01, 2001. [5] B.J. Lei, and E.A. Hendriks, ”Middle View Stereo Representation”, Proc. ICIP’01, 2001. [6] J.-X. Chai et al, ”Plenoptic Sampling”, Proc. SIGGRAPH ’00, pp. 307-318, USA, July 2000. [7] S. J. Gortler et al, ”The Lumigraph”, Proc. SIGGRAPH ’96, New Orleans, USA, August 1996. [8] M. Levoy and P. Hanrahan, ”Light Field Rendering”, Proc. SIGGRAPH ’96, USA, August 1996. [9] E. Cooke et al, ”Imaged-based rendering for teleconference systems”, Proc. WSCG ’02, Czech Republic, Feb. 2002.