Combining Stereo and Monocular Information to Compute ... - IJCAI

C o m b i n i n g Stereo and M o n o c u l a r I n f o r m a t i o n to C o m p u t e Dense D e p t h M a p s t h a t Preserve D e p t h Discontinuities Pascal Fua ( f u a @ m i r s a . i n r i a . f r ) * INRIA Sophia-Antipolis SRI International 2004 Route des Lucioles 333 Ravenswood Avenue 06565 Valbonne Cedex Menlo Park, CA 94025 France USA Abstract In this paper, we show how simple and parallel techniques can be efficiently combined to compute dense depth maps and preserve depth discontinuities in complex real world scenes. Our algorithm relies on correlation followed by interpolation. During the correlation phase the two images play a symmetric role and we use a validity criterion for the matches that eliminates gross errors: at places where the images cannot be correlated reliably, due to lack of texture or occlusions for example, the algorithm does not produce wrong matches but a very sparse disparity map as opposed to a dense one when the correlation is successful. To generate dense depth map, the information is then propagated across the featureless areas but not across discontinuities by an interpolation scheme that takes image grey levels into account to preserve image features. We show that our algorithm performs very well on difficult images such as faces and cluttered ground level scenes. Because all the techniques described here are parallel and very regular they could be implemented in hardware and lead to extremely fast stereo systems.

1

Introduction

Over the years numerous algorithms for passive stereo have been proposed, they can roughly be classified in two main categories [Barnard et al. 82]: 1. F e a t u r e B a s e d . Those algorithms extract features of interest from the images, such as edge segments or contours, and match them in two or more views. *This research was supported in part under the Centre National d'Etudes Spatiales VAP contract and in part under a Defense Advanced Research Projects Agency contract.

1292

Vision

These methods are fast because only a small subset of the image pixels are used, but may fail if the chosen primitives cannot be reliably found in the images; furthermore they usually only yield very sparse depth maps. 2. A r e a B a s e d . In these approaches, the system attempts to correlate the grey levels of image patches in the views being considered, assuming that they present some similarity. The resulting depth map can then be interpolated. The underlying assumption appears to be a valid one for relatively textured areas; however if may prove wrong at occlusion boundaries and within featureless regions. Alternatively the map can be computed by directly fitting a smooth surface that accounts for the disparities between the two images. This is a more principled approach since the problem can be phrased as an optimization one; however the smoothness assumptions that are required may not always be satisfied. A l l these techniques have their strengths and weaknesses and it is difficult to assess their compared merits since few researchers work on similar data sets. However, one can get a feel for the relative performances of these systems from the study by Guelch [Guelch 88]. In this work, the author has assembled a standardized data set and sent it to 15 research institutes across the world. It appears that the correlation based system developed at SRI by Hannah [Hannah 88] has produced the best results both in terms of precision and reliability. Unfortunately this system only matches a very small proportion, typically less than 1 % , of the image points. In this paper we propose a correlation algorithm that reliably produces far denser maps with very few false matches and can therefore be effectively interpolated. In the next section we describe our hypothesis generation mechanism that attempts to match every point in the image and uses a consistency criterion to reject invalid matches. This criterion is designed so that when the cor-

r e l a t i o n f a i l s , i n s t e a d o f y i e l d i n g a n i n c o r r e c t answer, the a l g o r i t h m r e t u r n s N O answer. A s a r e s u l t , t h e d e n s i t y o f t h e c o m p u t e d d i s p a r i t y m a p i s a v e r y g o o d measure o f i t s r e l i a b i l i t y . T h e i n t e r p o l a t i o n t e c h n i q u e described i n t h e section t h a t follows combines the d e p t h m a p produced b y c o r r e l a t i o n a n d t h e g r e y level i n f o r m a t i o n present i n the image itself to introduce d e p t h discontinuities and f i t a piecewise s m o o t h surface. T h e s e a l g o r i t h m s have p r o v e n v e r y effective o n r e a l d a t a . T h e i r p a r a l l e l i m p l e m e n t a t i o n o n a C o n n e c t i o n M a c h i n e t m 1 relies o n l y o n l o c a l o p e r a t i o n s a n d o n nearest n e i g h b o r c o m m u n i c a t i o n ; they could be ported to a dedicated architecture, t h e r e b y m a k i n g fast a n d cheap systems possible.

2

Correlation

M o s t c o r r e l a t i o n based a l g o r i t h m s a t t e m p t t o f i n d i n t e r est p o i n t s o n w h i c h t o p e r f o r m t h e c o r r e l a t i o n . W h i l e this approach is justified when only limited computing resources are a v a i l a b l e , w i t h m o d e r n h a r d w a r e architectures a n d m a s s i v e l y p a r a l l e l c o m p u t e r s i t becomes possible t o p e r f o r m t h e c o r r e l a t i o n over t h e w h o l e image a n d retain only results t h a t appear to be " v a l i d . " T h e hard p r o b l e m i s t h e n t o p r o v i d e a n effective d e f i n i t i o n o f w h a t w e call v a l i d i t y a n d w e w i l l propose one b e l o w . I n o u r a p p r o a c h , w e c o m p u t e c o r r e l a t i o n scores for every p o i n t i n t h e i m a g e b y t a k i n g a f i x e d w i n d o w i n t h e f i r s t image a n d a s h i f t i n g w i n d o w i n t h e second. T h e second w i n d o w is m o v e d in t h e second image by integer i n c r e m e n t s a l o n g t h e e p i p o l a r line a n d a n a r r a y o f correl a t i o n scores is g e n e r a t e d . In t h i s w o r k we use c o r r e l a t i o n o f grey level values a n d t a k e t h e c o r r e l a t i o n score t o b e s = m a x ( 0 , 1 — c) where (1) I 1 a n d I 2 b e i n g t h e left a n d r i g h t image i n t e n s i t i e s , X t h e average v a l u e of X over t h e c o r r e l a t i o n w i n d o w a n d dx,dy the d i s p l a c e m e n t a l o n g t h e e p i p o l a r l i n e . 2 The m e a s u r e d d i s p a r i t y can t h e n b e t a k e n t o b e the one t h a t p r o v i d e s t h e highest v a l u e o f s . I n f a c t , t o c o m p u t e the d i s p a r i t y w i t h s u b p i x e l accuracy, we f i t a second degree c u r v e t o t h e c o r r e l a t i o n scores i n the n e i g h b o r h o o d o f the m a x i m u m a n d c o m p u t e t h e o p t i m a l d i s p a r i t y b y i n terpolation. 2.1

Validity of the Disparity Measure

A s s h o w n b y N i s h i h a r a [ N i s h i h a r a e t a l . 83], t h e p r o b a b i l i t y of a m i s m a t c h goes d o w n as t h e size of t h e correl a t i o n w i n d o w a n d t h e a m o u n t o f t e x t u r e increase. H o w ever, u s i n g large w i n d o w s leads to a loss of accuracy and 1

T M C Inc. We remove the mean to offset transformations of the i m ages t h a t may result from slightly different settings of the cameras. 2

Figure 1: Consistent vs inconsistent matches: the two rows represent pixels along two epipolar lines of I1 and I2 and thearrows go from a point in one of the images towards the point in the other image that maximizes the correlation score. The match on the left is consistent because correlating from l\ to I2 and from I2 to I\ yields the same match unlike the matches on the right that are inconsistent.

the possible loss of important image features. For smaller windows, the simplest definition of validity would call for a threshold on the correlation score; unfortunately such a threshold would be rather arbitrary and, in practice, hard to choose. Another approach is to build a correlation surface by computing disparity scores for a point in the neighborhood of a prospective match and checking that the surface is peaked enough [Anandan 89]. It is more robust but also involves a set of relatively arbitrary thresholds. Here we propose a definition of a valid disparity measure in which the two images play a symmetric role and that allows us to reliably use small windows. We perform the correlation twice by reversing the roles of the two images and consider as valid only those matches for which we measure the same depth at corresponding points when matching from I 1 into I 2 and I 2 into I 1 . As shown in Figure 1, this can be defined as follows. Given a point P\ in I 1 , let P 2 be the point of I 2 located on the epipolar line corresponding to P\ such that the windows centered on P\ and P 2 yield the optimal correlation measure. The match is valid if and only if P\ is also the point that maximizes the score when correlating the window centered on P 2 w i t h windows that shift along the epipolar line of I 1 corresponding to P2 For example, the validity test is likely to fail in presence of an occlusion. Let us assume that a portion of a scene is visible in I 1 but not I 2. The pixels in l\ corresponding to the occluded area in I 2 will be matched, more or less at random, to points of I 2 that correspond to different points of I 1 and are likely to be matched with them. The matches for the occluded points will therefore be declared invalid and rejected. We illustrate this behaviour using the portion of the tree scene of Figure 2 outlined in Figure 2(a). Different parts of the ground be-

Fua

1293

(a)

(b)

(c)

F i g u r e 2: (a) An outdoor scene w i t h t w o trees and a stump.. Tl T h e same scene seen from the left (b) and the right (c) so t h a t different parts of the ground are occluded by the trees.

tween t h e t w o trees a n d between t h e trees a n d t h e s t u m p are o c c l u d e d in F i g u r e s 2 ( b ) a n d ( c ) . In F i g u r e s 3 (a) a n d ( b ) , w e show t h e c o m p u t e d d i s p a r i t i e s for t h i s i m a g e w i n d o w after c o r r e l a t i o n w i t h t h e images s h o w n i n F i g ures 2 ( b ) a n d 2(c) respectively. T h e p o i n t s for w h i c h n o v a l i d m a t c h can b e f o u n d appear i n w h i t e a n d t h e areas where t h e i r d e n s i t y becomes very h i g h c o r r e s p o n d v e r y closely t o t h e o c c l u d e d areas for b o t h pairs o f images. These results have been o b t a i n e d u s i n g 3 x 3 c o r r e l a t i o n w i n d o w s ; these s m a l l w i n d o w s are sufficient in t h i s case because t h e scene is v e r y t e x t u r e d a n d gives o u r v a l i d i t y test e n o u g h d i s c r i m i n a t i n g power t o a v o i d e r r o r s . We use t h e face s h o w n in F i g u r e 4 to d e m o n s t r a t e ano t h e r case in w h i c h t h e v a l i d i t y test rejects false m a t c h e s . T h e e p i p o l a r lines are h o r i z o n t a l a n d i n F i g u r e 4 ( d ) w e show t h e r e s u l t i n g d i s p a r i t y i m a g e , using 7x7 w i n d o w s , i n w h i c h t h e i n v a l i d m a t c h e s a p p e a r i n black. I n F i g u r e 4(e) we show a n o t h e r d i s p a r i t y i m a g e c o m p u t e d after h a v i n g s h i f t e d one o f t h e images v e r t i c a l l y b y t w o pixels, thereby degrading the calibration and the correlation. N o t e t h a t t h e d i s p a r i t y m a p becomes m u c h sparser b u t t h a t n o gross errors are i n t r o d u c e d . I n p r a c t i c e , w e t a k e a d v a n t a g e o f t h i s b e h a v i o u r for p o o r l y c a l i b r a t e d images: we c o m p u t e several d i s p a r i t y m a p s by s h i f t i n g one of the images up or d o w n a n d r e t a i n i n g t h e same epip>olar lines, 3 t h e r e b y r e p l a c i n g t h e line by a b a n d , and r e t a i n t h e highest s c o r i n g v a l i d m a t c h e s . 4 I n t h e t w o examples described a b o v e , w e have s h o w n t h a t w h e n t h e c o r r e l a t i o n b e t w e e n the t w o images o f a stereo p a i r is degraded o u r a l g o r i t h m t e n d s , i n s t e a d o f m a k i n g m i s t a k e s , t o y i e l d sparse m a p s . Generally 3

assumed not to be exactly vertical T h e precision of the computed distance is then obviously degraded, but remains qualitatively correct for large enough baselines. 4

1294

Vision

s p e a k i n g , c o r r e l a t i o n based a l g o r i t h m s r e l y o n t h e f a c t t h a t t h e same t e x t u r e can b e f o u n d a t c o r r e s p o n d i n g p o i n t s in the t w o images of a stereo p a i r a n d are k n o w n to fail when: • T h e areas to be c o r r e l a t e d have l i t t l e t e x t u r e . • T h e disparities vary rapidly w i t h i n the correlation window. • T h e r e is an o c c l u s i o n . If we consider t h e l o c a l image t e x t u r e as a s i g n a l to be f o u n d in b o t h images, we can m o d e l these p r o b l e m s as noise t h a t c o r r u p t s t h e s i g n a l . I n a c o m p a n i o n r e p o r t [ F u a 9 l ] , w e use s y n t h e t i c d a t a t o f o r m a l i z e t h i s argum e n t a n d show t h a t as t h e noise to s i g n a l r a t i o increases, or e q u i v a l e n t l y , as the p r o b l e m s m e n t i o n e d above become m o r e a c u t e , t h e p e r f o r m a n c e o f our c o r r e l a t i o n a l g o r i t h m degrades g r a c e f u l l y i n t h e f o l l o w i n g sense: As the signal is being degraded, t h e density of m a t c h e s decreases a c c o r d i n g l y b u t t h e r a t i o o f correct t o false m a t c h e s r e m a i n s h i g h u n t i l t h i s p r o p o r t i o n has d r o p p e d v e r y l o w . In o t h e r w o r d s , a r e l a t i v e l y dense d i s p a r i t y m a p is a guarantee t h a t the m a t c h e s are c o r r e c t , at least up to the precision allowed b y t h e r e s o l u t i o n b e i n g used. I n the r e p o r t [ F u a 9 l ] , we also s h o w t h e effectiveness of a v e r y s i m p l e h e u r i s t i c : i f w e reject n o t o n l y i n v a l i d m a t c h e s b u t also isolated v a l i d m a t c h e s we can increase even m o r e the r a t i o c o r r e c t / i n c o r r e c t m a t c h e s w i t h o u t losing a large n u m b e r o f t h e c o r r e c t answers. O t h e r stereo systems i n c l u d e a v a l i d i t y c r i t e r i o n s i m i l a r t o ours b u t use i t a s o n l y one a m o n g m a n y o t h e r s . In o u r case, because we c o r r e l a t e over t h e w h o l e image and not only at interest or contour points, we do not need t h e o t h e r c r i t e r i a a n d can r e l y o n d e n s i t y alone. H o w e v e r , o u r v a l i d i t y test depends o n t h e f a c t t h a t i t i s

Figure 3: (a) The result of matching 2(a) and 2(b) for window of 2(a) delimited by the white rectangle, (b) The result of matching 2(a) and 2(c) for the same windows, (c) The merger of four disparity maps computed using the image of Figure 2 (a) as a reference frame, the two other images of 2 and two additional images. Invalid matches appear in white and become almost dense in occluded areas of (a) and (b). The closest areas are darker; note that they are few false matches although the correlation windows used in this case are very small (3x3).

improbable to make the same mistake twice when correlating in both directions and can potentially be fooled by repetitive patterns, which is a problem we have not addressed yet. 2.2

Hierarchical Approach and Additional Images

To increase the density of our potentially sparse disparity map, we use windows of a fixed size to perform the matching at several levels of resolution, 5 which is almost equivalent to matching at one level of resolution with windows of different sizes as suggested by Kanade [Kanade et al. 90] but computationally more efficient. More precisely, as shown by Burt [Burt et al. 82], it amounts to performing the correlation using several frequency bands of the image signal. We then merge the disparity maps by selecting, for every pixel, the highest level of resolution for which a valid disparity has been found. In Figure 4 (c) we show the merger of the disparity maps for two levels of resolution that is dense and exhibits more of the fine details of the face than the map of figure 4 (e) computed using only the coarsest level of resolution. The reliability of our validity test allows us to deal very simply with several resolutions without having to introduce, as in [Kanade et al. 90] for example, a correction factor accounting for the fact that correlation scores for large windows tend to be inferior to those for small windows. The computation proceeds independently at all levels of resolution and this is a departure from traditional hierarchical implementations that make use of the results generated at low resolution to guide the search at 5

Computed by subsampling gaussian smoothed images.

higher resolutions. While this is a good method to reduce computation time, it assumes that the results generated at low resolution are more reliable, if less precise, than those generated at high resolution; this is a questionable assumption especially in the presence of occlusions. For example in the case of the trees of Figure 2, it could lead to a computed distance for the area between the trunks that would be approximately the same as that of the trunks themselves, which would be wrong. Furthermore, it can be shown [Fua 9 l ] that, in the absence of repetitive patterns, the output of our algorithm is not appreciably degraded by using the large disparity ranges that our approach requires. As suggested by several researchers, more than two images can and should be used whenever practical. When dealing w i t h three images or more, we take the first one to be our reference frame, compute disparity maps for all pairs formed by this image and one of the others and merge these maps in the same way as those computed at different levels of resolution. In this way we can generate a dense disparity map, such as the one of Figure 3 (c): the three images of Figure 2 belong to a series of five taken by an horizontally moving camera. Taking the image of 2(a) as our reference frame, we merge the four resulting disparity maps, each of them relatively sparse, to produce a dense map with few errors. In this section we have presented an hypothesis generation mechanism that produces depth maps that are correct where they are dense and unreliable only where they become very sparse. Typically these sparse measurements occur in featureless areas that are usually smooth and at occlusion boundaries where one expects to find an image intensity edge. To fit a surface, one must therefore

Fua

1295

Figure 4: (a) (b) Left and right 256x256 images of a face, (c) The disparity map obtained by merging the results computed at two levels of resolution, (d) Disparity map computed at the highest resolution, (e) Disparity map computed at the highest resolution after shifting the right up by one pixel, (f) Disparity map computed at the lower resolution. interpolate those measures in such a way as to propagate the depth information in the featureless areas and preserve depth discontinuities. In the next section, we describe the model and algorithm we use to perform the interpolation.

3

Simple Interpolation M o d e l

Ideally, if we could measure with absolute reliability the depth, w0, at a number of locations in the image, we could compute a depth image w by minimizing the fol-

1296

(2)

Interpolation

We model the world as made of smooth surfaces separated by depth discontinuities. We also assume that these depth discontinuities produce gradients in grey level intensities due to changes in orientation and surface material. We first describe a simple interpolation model that is well suited to images with sharp contrasts and then propose a refinement of that scheme for lower contrast scenes. 3.1

lowing criterion:

Vision

where c x and c y are two real numbers that control the amount of smoothing. As discussed in the previous section, when a valid dis parity can be found it is reliable and can be used, along with the camera models, to estimate ; we then take .s to be the normalized correlation score of Equation 1. Assuming that changes in reflectance can be found at depth discontinuities, we replace the and of Equation 2, by terms that are inversely proportional to the image gradients in the x and y directions. In fact, we have observed that the absolute magnitudes of the gradients are not as relevant to our analysis as their local relative mag nitudes: boundaries can be adequately characterized as

the locus of the strongest local gradients, independent of the actual value of these gradients. We therefore write:

interpolated image: (6)

(3)

where is a constant equal to 2 in our examples. (b) Interpolate again the raw disparity map using the new and coefficients

where FNorm is the piecewise linear function defined by (4) and x 0 and x 1 are two constants. In all our examples, x 0 IS the median value of x in the image and x 1 its maximum value . The result is quite insensitive to the value chosen for x0 as long as it does not become so large as to force the algorithm to ignore all edges. What really matters is the monotoniaty of FNorm that allows the depth information to propagate faster in the directions of least image gradient and gives to the algorithm a behaviour somewhat similar to that of adaptative diffusion schemes (e.g [Perona et al. 87]). To compute W, the vector of all values of w, we discretize the quadratic criterion C of Equation 2; we then solve, using a conjugate gradient method [Szeliski 90; Terzopoulos 86], the linear equation

(70 In Figure 3.1, we show the depth map computed by interpolating the disparity map of Figure 4(c). Note that the main features of the face, nose, eyebrows and mouth have been correctly recovered. This simple interpolation technique is appropriate for the face of Figure 4 that presents few low-contrast depth discontinuities but produces a somewhat blurry result for the tree scene of Figure 2, as can be seen in Figure 6(a) To improve upon this situation, we propose a slightly more elaborate interpolation scheme that takes depth discontinuities explicitly into account. 3.2

Introducing Depth Discontinuities

The and coefficients defined by Equation 3 introduce "soft" discontinuities: when the contrast is low, some smoothing occurs across the discontinuity. The depth image, however, is less smoothed than in the complete absence of an edge resulting in a strong w gradient. We take advantage of this property of our "adaptative" smoothing by defining the following iterative scheme: 1. Interpolate using the

and

defined above.

2. Iterate the following procedure: (a) Recompute and as functions of both the intensity gradient and the depth gradient of the

The algorithm converges after a few iterations resulting in a much sharper depth map such as the one of Figure 6(h). This algorithm can be regarded as a continuation method on the depth discontinuities. We start without knowing their location, use the grey level information to hypothesize them and then propagate the results.

4

Conclusion

In this work we have described a correlation based algor i t h m that combines two simple and parallel techniques to yield reliable depth maps in the presence of depth discontinuities, occlusions and featureless areas: • The correlation is performed twice over the two images by reversing their roles and only matches that are consistent in both directions are retained, thereby guaranteeing a very low error rate • The disparity map is then interpolated using a technique that, takes advantage of the grey level information present in the image to preserves depth discontinuities and propagate the information across featureless areas. The depth maps that we compute are qualitatively correct and the density of acceptable matches provides us with an excellent estimate of their reliability. Because of the great regularity and simplicity of the techniques described here, 6 we hope to be able to build dedicated hardware that would implement them and could, for example, be used by a mobile robot in an outdoor environment.

References [Anandan 89] P. Anandan. A computational framework and an algorithm for the measurement of motion. International Journal of Computer Vision, 2(3):283 310, 1989. [Barnard et al. 82] S T . Barnard and MA Fischler. Computational stereo. Computational Surveys, 14(4):553-572, 1982. 6

All the algorithms described here are implemented in *lisp on a Connection Machine and use exclusively local operations and nearest neighbor communication.

Fua

1297

(a) Figure 5:

(b)

(a) Interpolated depth map for the face of Figure 4 (b) Shaded views generated by shining a light on the surface.

Figure 6: (a) Trees depth image computed by smoothing using the algorithm of section 3.1. (b) Depth image after four iterations of the iterative scheme of section 3.2. We have stretched the depth images to enhance the contrast so that the furthest areas appear completely white. Note that the trunks and the stump clearly stand out. [Burt et al. 82] P.J. B u r t , C. Yen, and X. X u . Local correlation measures for motion analysis. In IEEE PRIP Conference, pages 269-274, 1982.

[Nishihara et al. 83] H.K. Nishihara and T.Poggio. Stereo vision for robotics. In 1SRR83 Conference, Bretton Woods, New Hampshire, 1983.

[Fua 91]

[Perona et al. 87] P. Perona and J. Malik, Scale space and edge detection using anisotropic diffusion. In IEEE Computer Society Workshop on Computer Vision, pages 16-22, M i a m i , Florida, 1987.

P. Fua. A Parallel Stereo Algorithm that Produces Dense Depth Maps and Preserves Image Features. Research Report 1369, I N R I A , January 1991.

[Guelch 88] E. Guelch. Results of test on image matching of isprs wg iii / 4. International Archives of Photogrammetry and remote sensing, 27(III):254-271, 1988. [Hannah 88] M.J. Hannah. Digital stereo image matching techniques. International Archives of Photogrammetry and remote sensing, 27(III).280293, 1988. [Kanade et al. 90] T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptative window: theory and experiment. In Image Understanding Workshop, September 1990.

1298

Vision

[Szeliski 90] R. Szeliski. Fast surface interpolation using hierarchical basis functions. IEEE lYansactions on Pattern Analysis and Machine Intelligence, 12(6):513-528, June 1990. [Terzopoulos 86] D. Terzopoulos. Image analysis using multigrid relaxation methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(2):129- 139, March 1986.