Int. Conf. on Computer Vision, Bombay, 1998. [5] J.-R. Ohm et al: âA real-time hardware system for stereoscopic video conferencing with viewpoint adaptationâ,.
Hybrid Recursive Matching and Segmentation-Based Postprocessing in Real-Time Immersive Video Conferencing Oliver Schreer, Nicole Brandenburg, Serap Askar, Peter Kauff Heinrich-Hertz-Institut für Nachrichtentechnik Berlin GmbH, Image Processing Department, Einsteinufer 37, D-10587 Berlin, Germany Email: {schreer, brandenburg, askar, kauff}@hhi.de
Abstract We present a novel, real-time disparity analysis frame work developed for immersive teleconferencing. This two-stage method computes a limited number of highly reliable disparities in the first step and then, filling the remaining holes based on segmentation information. The disparity algorithm is a hybrid recursive matching (HRM) approach. Its computational effort is minimised by the efficient selection of a small number of candidate vectors, guaranteeing both spatial and temporal consistency of disparities. In order to improve the disparity fields in the case of occlusions, a segmentation-based postprocessing is applied. The teleconferencing application requires video processing at ITU-Rec. 601 resolution. In the current version, the algorithm generates disparity maps in real time for both directions (LeftRight and RightLeft) on a Pentium III, 800 MHz processor.
1
Introduction
An immersive teleconferencing system enables conferees located in different geographical places to meet around a virtual table, appearing at each station in such a way to create a convincing impression of presence [1]. The purpose is to enable the participants to make use of rich communication modalities as similar as possible to those used in a face-to-face meeting (e.g., gestures, eye contact, etc) and eliminate the limits of nonimmersive teleconferencing, (e.g., face-only images in separate windows, unrealistic avatars). In the system we are developing, two stereo pairs collect images of each conferee in each station; from these, 3D disparity maps are computed at VMV 2001
frame rate, and used to generate synthetically true views of remote participants in the remote stations, adapted to the viewpoints of the local participant. A key ingredient in such a system is the disparity field computation. Requirements are exacting: full resolution video processing according to ITU-Rec. 601; computation in real time; no constraints on participants (e.g. clothes, visual markers); and cameras mounted around a wide screen, yielding a wide-baseline stereo geometry (see Figure 1). These requirements are not met in toto by existing approaches, typically based on hierarchical blockmatching [2] or optical flow [3]. Wide-baseline stereo has been investigated [4], but not necessarily in real-time applicative contexts. Several real-time stereo systems have been built around the parallel-camera configuration to minimise computational complexity (see [2],[5]). The proposed algorithm, the so-called hybrid recursive matching (HRM) provides sparse realtime disparity maps reduced by a factor of 8 with respect to full-size ITU-Rec. 601. HRM reaps the advantages of both block-recursive matching and pixel-recursive optical flow estimation. This algorithm has already been used successfully for fast motion estimation in MPEG coding [6] and standards conversion [7] and has recently been applied to disparity estimation. The crucial problem in 3D video conferencing is to produce an accurate and correct perspective view of the presented conferee in the virtual scene. Especially occlusions caused by the hands and arms result in visible artifacts in the final synthesis of the virtual view. Therefore, a two-stage method is proposed to generate spatially and temporally consistent disparity maps (Figure 1). The first stage is the fast and robust hybrid recursive disparity analysis, where mismatches are allowed. In the second stage, a detection of mismatches is Stuttgart, Germany, November 21-23, 2001
performed based on an efficient consistency check between LR and RL match and the substitution of inconsistent disparities by inter/extrapolation. In this step, the result of an efficient segmentation algorithm is used detecting the regions of the hands. original images
sparse disparity maps
the basis of these disparities. The epipolar constraint is exploited in the core of the algorithm, leading to a fast algorithm working in arbitrary stereo camera configurations [8]. block-recursive stage
pixel-recursive stage
dense disparity maps
update vector pixel-recursion and clamping onto the epipolar line
selection of update vector (smallest DPD)
selection of final vector (smallest DBD)
block vector
start vector left image
hybrid recursive matching
postprocessing
right image
LR
selection of start vector (smallest DBD)
block vector memory 3 candidate vectors
Figure 2: Outline of the HRM algorithm In more detail, the structure of the whole algorithm can be outlined in three subsequent processing steps (see Figure 2). 1. three candidate vectors are evaluated for the current block position by recursive block matching; 2. the candidate vector with the best result is chosen as the start vector for the pixel-recursive algorithm, which yields an update vector; 3. the final vector is obtained by comparing the update vector from the pixel recursive stage with the start vector from the block-recursive one.
RL
Figure 1: Block diagram of the two-stage approach based on rectified images This paper is organised as follows Section 2 reviews briefly the HRM algorithm. Section 3 sketches the issues of post-processing. Section 4 presents experimental results. Finally, we summarise and discuss our work.
2 Hybrid Recursive Matching (HRM)
2.1
The main idea of this new hybrid block- and pixelrecursive disparity estimator is to unite the advantages of block-recursive disparity matching and pixel-recursive optical flow estimation in one common scheme. The new hybrid analysis scheme has two main advantages in comparison to previous approaches. The recursive structure speeds up the analysis dramatically and allows the computation of a sparse disparity vector field for a full-size ITU-Rec. 601 stereo image pair on a common Pentium III processor. The combined choice of spatial and temporal candidates yields reliable disparity vectors and spatially and temporally consistent disparity fields due to an efficient strategy of testing particular vector candidates. The latter aspect is important to avoid temporal inconsistencies in disparity sequences, which may cause strongly visible and very annoying artifacts in virtual views synthesised on
Block-Recursion
Block recursion is performed in the spatial and temporal directions on the grid of a sparse disparity vector field, usually with 8x8 or 4x4 grid size. To cope with arbitrarily shaped video objects and to determine the spatial candidate vectors isotropically, the video frames are scanned in two interleaved, meandering paths, changing their scan directions in even and odd lines, and changing the order from frame to frame guided by a binary mask representing the shape of the video object. Three candidates are tested to select the best one for the current block-vector position: a vertical and horizontal predecessor, depending on the current scan direction of the scan path and a temporal predecessor, taken from the previous reference frame (Figure 3). If no candidate is available a default vector is taken.
384
left-to-right scan / top-to-bottom
spatial preceeding candidates
Multiple pixel-recursion processes are started at every first pixel position of the odd lines in the current block (see Figure 4 for a 4x4 block).
right-to-left scan / bottom-to-top
frame N
0 −1 ux = δf ( x, y ) δx
current pixel position
frame N-1
a)
temporal preceeding candidate
0 −1 uy = δ f ( x , y ) δy
b)
Figure 3: Choice of candidates for left-toright/top-to-bottom (a) and right-to-left/bottom-totop (b) scan directions
, if
δf ( x, y )