Robust Real-Time Stereo Edge Matching by ... - Semantic Scholar

Robust Real-Time Stereo Edge Matching by Confidence-based Refinement Jonas Witt and Uwe Weltin IZT at the Hamburg University of Technology, Eißendorfer Straße 40, 21073 Hamburg, Germany {jonas.witt,weltin}@tuhh.de

Abstract. Edges are important features for tasks like object detection and vision-based navigation. In this paper, a novel real-time capable stereo edge refinement technique is presented. It propagates confidence and consistency along the detected edges, which reduces false matches significantly. Unmatched pixels are safely recovered by interpolation. We also investigate suitable support regions for edge-based matching. In the proposed solution, depth discontinuities are specifically accounted for. All approaches are extensively tested with the Middlebury benchmark datasets1 and compared to a sparse and several popular dense stereo algorithms. Keywords: Robotic vision, edge-based, real-time, sparse stereo

1

Introduction

Dense stereo correspondence algorithms have been thoroughly studied in the last decades. Many different approaches with individual performance characteristics exist [12]. However, for robotic applications like object detection and navigation, dense information is often not required. Point-based systems currently are most common, but fail in sparsely textured cases. Edge-based systems can fill the gap as shown e.g. by Tomono [13] and Chandraker et al. [3] with their SLAM systems. In [5] an object detection system was presented, which uses only edges with depth information. Matching edge-segments across two views poses different challenges than dense matching. Many edges lie on object borders, which can be a problem for correlation based algorithms if the matching window is not carefully chosen. Horizontal segments are particularly difficult. Also, since the matching is only sparse, one can not gain confidence in disparities over homogeneous surfaces. On the other hand the search space is significantly simplified due to the restriction of disparities to edge loci, resulting in less computational effort. Thus, an evaluation of the performance of sparse versus dense methods that are sparsified to edge locations is interesting. 1

Thanks to Daniel Scharstein and Richard Szeliski for maintaining their vision homepage http://vision.middlebury.edu/stereo/ and providing many benchmark datasets.

2

Robust Real-Time Stereo Edge Matching by Confidence-based Refinement

Previous papers present very different approaches to the problem of matching edges in two or three views. Different algorithms for straight lines have been proposed by [9], [8] and [1], the latter of which was also extended to parametric curves in [11]. A more recent publication proposes a multi-scale phase based algorithm with a probabilistic model for matching [14]. Here, we discuss a correlation-based approach which uses winner-takes-all (WTA) matching and an efficient confidence-based refinement technique which enforces smooth disparities along edges.

2

Stereo Edge Matching

For sparse edge matching one can not simply apply well known refinement techniques like median filtering or other local consensus-based methods due to sparsity. Also, edges often lie on object borders which specifically needs to be accounted for, as we will examine in Section 2.2. On the other hand, edges in most cases are more distinctive than an ordinary pixel on a smooth surface in the case of dense matching. And we can still gain matching confidence by incorporating knowledge about adjacent edge pixels. The common steps involved for stereo matching are preprocessing, cost aggregation, matching and refinement. Since we are interested in real-time capable algorithms for navigation purposes we investigate simple winner-takes-all matching with a more sophisticated but fast refinement step to significantly improve the matching performance. The preprocessing step basically consists of gaussian filtering and Canny edge detection [2] with subpixel refinement [4]. While still matching at the pixel level we gain subpixel accurate disparities for vertical and diagonal edges with minimal additional computations.

2.1

Matching Cost

Many different matching cost functions, similarity measures and transforms exist for the purpose of stereo matching [12]. Common matching costs for real-time stereo matching are the squared intensity difference and the absolute intensity difference. Both measures can be truncated to improve robustness in the face of outliers. Several tests on the Middlebury stereo sets and with a stereo camera in an office environment resulted in the truncated sum of absolute differences (SAD) being selected as the measure of choice in our case. Additionally we subtract the mean intensity difference µ(x, d) to improve matching of mildly shiny objects with specular reflections and also to cope with different camera sensor sensitivities. We also truncate this value at tµ = 10 to not match uniform surfaces with arbitrary intensity differences. The matching cost m(x, d) at location x with the support region Γ and disparity d = (d, 0)T accordingly is:


µ(x, d) = max min

1 X IL (x + s) − IR (x + s − d), tµ |Γ |

!

3

! , −tµ

(1)

s∈Γ

m(x, d) =

1 X min (IL (x + s) − IR (x + s − d) − µ(x, d), ttrunc ) |Γ |

(2)

s∈Γ

The term |Γ1 | normalizes the matching score by dividing by the number of pixels in the support region Γ . The truncation parameter was empirically adjusted and finally set to ttrunc = 30. Different matching windows are investigated in the following section. 2.2

Cost Aggregation

For the aggregation of the matching costs, using simple symmetric support regions as often used in real-time dense matching is not useful. This is due to the nature of edges, since they are intensity gradients which divide homogeneous intensity surfaces. These intensity gradients can occur on textured planar surfaces but also at depth discontinuities as a result of overlapping surfaces of different intensity. Accordingly, many edge pixels lie on object borders which have an intensity that is usually a mixture of both surface intensities. In effect the edge pixel intensity depends on the subpixel location α ∈ [0, 1] of the edge and both surface intensities I1 and I2 : Iedge = (1 − α)I1 + αI2

(3)

which results in an arbitrary value Iedge ∈ [I1 , I2 ] which depends on the orientation and position of the camera. Basically this is true for every pixel, but by definition edge pixels mark the locations where this effect has the biggest impact on pixel intensities. Accordingly, edge pixels themselves are not very suitable for including them in an intensity based matching score. Figure 1 shows block matching on an edge segment. Here, 20% of the pixels in the support region belong to the edge which can have a significant influence on the overall matching cost. Making the support region larger reduces this effect, but also decreases the ability to match small objects. The common occurence of depth discontinuities at edge locations also has to be specifically incorporated into the design of the support region. Consider again the edge depicted in Figure 1. If the region to the left of the edge belongs to a foreground object and the region to the right to the background, the actually matching pixels of the background will be shifted by the difference in disparities, which is three in this case. This can be accounted for, as described in [6], but at a computational cost. Shiftable filters as evaluated in [12] are a more efficient possibility. For the use in edges they need to be adapted, though. With simple block matching we may end up with less than 50% of the pixels in the support region being suitable matching candidates on object borders. If an object border

4

Robust Real-Time Stereo Edge Matching by Confidence-based Refinement Left Image

Right Image Simple block match

Actually matching Pixels

Fig. 1. Block matching with a 5x5 window. The window overlaps the edge which is problematic if the edge belongs to a depth discontinuity. Left Image

Right Image

Fig. 2. Shifted pixel-block matching windows for two edges with different angles. The pixel-blocks do not include the edge pixel itself (here 3x1 pixel-blocks are matched).

even extends in depth direction, we can even expect disparity offsets on the same object within the support region (an example of this are the two top rows of the support region in Figure 1). For these reasons we propose simplified shifted pixel-blocks which do not suffer of any of these problems and almost introduce no computational overhead. These shifted pixel-blocks are matched on either side of a candidate edge pixel, as shown in Figure 2. Only edges that differ by no more than αmatch in orientation in the left and right image are considered. The actual edge pixel is not included in the pixel-block, due to their intrinsic unsuitability for intensity-based matching. Depending on the edge orientation, left and right or top and bottom pixel-blocks are matched. This helps in disambiguating horizontal edge disparities. In either case only the minimum matching cost is taken. If an edge lies on an object border, the foreground disparity is retrieved and the consistency of the support region is preserved. Increasing the width of the support regions (e.g. from one to three or more pixels) makes the individual matches more robust, however in combination with the adjacent edge pixels this essentially yields no additional information, since support regions of edge pixels overlap in this case. Figure 3 shows the performance of several different support regions. For the combined support region (11x11 + 11x11 Block) the minimum cost of the three is taken.

Robust Real-Time Stereo Edge Matching by Confidence-based Refinement 20000

11x1 11x5

18000

11x1 11x5

85

11x11

11x11 11x11 + 11x11 Block

16000

90

5

11x11 + 11x11 Block 80

11x11 Block

11x11 Block

75 14000

70 12000

65 10000

60 8000

55

6000

50

(a) Number of initial matches 20000

11x1

(b) Percentage of edge pixels in the left image that were matched 100

11x1 11x5

11x5 18000

11x11

95

11x11 11x11 + 11x11 Block

11x11 + 11x11 Block 16000

11x11 Block

14000

85

12000

80

10000

75

8000

70

6000

65

(c) Number of correct initial matches 20000

11x1

11x11 Block

90

(d) Percentage matches

of

correct

initial

100

11x5 18000

11x11

95

11x11 + 11x11 Block 16000

11x11 Block

90

14000

85

12000

80

10000

75

8000

70

6000

65

11x1 11x5 11x11 11x11 + 11x11 Block

(e) Number of correct refined matches

(f) Percentage matches

11x11 Block

of

correct

refined

Fig. 3. Comparison of the influence of several different support regions on matching performance. Shifted pixel blocks (11x1, 11x5, 11x11), a simple block match (11x11 Block) and a combined variant (11x11 + 11x11 Block) are tried. Initial matches are straight WTA-matches while refined matches refer to the technique in Section 2.3.

Finally, if the resulting matching score is below the threshold tmatch , it is considered valid. Despite being quite fast, the results with pure winner-takes-all (WTA) matching are not yet overwhelming, as can be seen by the percentage of correct matches in Figure 3(d). It is visible, that the size of the support region has a considerable influence on the quality of the initial matches. In figures 3(e) and 3(f) the results of the refinement algorithm that is introduced in the next section are shown. What is specifically interesting is that the

6


dependency on the support region from Figure 3(d) has lessened significantly, which is due to the incorporation of edge connectivity information. Effectively, adjacent edge pixels build one big virtual support region along the edge, when disparity smoothness is enforced. It is also visible that the total number of correct matches for unshifted block matching (11×11 Block ) is the lowest (see Figure 3(e)). While this seems insignificant, the missing disparities often lie on object borders which are very interesting for robotic vision tasks. Another observation is, that the matching of an unshifted pixel block does not seem to yield much additional information, since the results of the 11×11 shifted support window and the combined 11×11 + 11×11 Block window are basically indistinguishable. The best robustness/performance trade-off seems to be the 11×5 and in most cases even the 11×1 window. 2.3

Confidence-Based Refinement

In this section we will introduce the novel refinement algorithm that enforces consistency and smoothness among the disparities of edge segments. The improvement over the initial WTA matches stems from the fact that many individual edge points are ambiguous, which leads to isolated and unsmooth disparities, if they are matched independently. The discriminative power of a whole edge segment in contrast is much higher. However, since common edge detectors do not yield perfect edges that do not cross object borders or produce other ”glitches” it is not trivial to take full advantage of the connectivity information. In order to refine our initial disparities we first need to rank the reliability of the found matches. We do this with the ratio of the best match m1st (x) and the second best match m1st (x) at a location x:  3, if m2nd (x)/m1st (x) > 2    2, if m2nd (x)/m1st (x) > 1.5 C(x) = 1, m1st (x) < tmatch    0, no match found

(4)

If the second best match has more than a doubled matching score, we know that our best match is probably the right one. We reward this with the highest confidence. A confidence value of one is usually assigned to ambiguous matches like horizontal edges or repetetive patterns. If no valid match is found the confidence is zero. Edge connectivity can be enforced by a simple consistency check: if an edge is traversed in the left image, the corresponding pixels in the right image have to be connected. This can be checked for the edge pixels x1 = (x1 , y1 , d1 )T and x2 = (x2 , y2 , d2 )T that are adjacent in the left image. The disparities are consistent if |(x1 − d1 ) − (x2 − d2 )| ≤ 1, meaning that the distance in x-direction of the corresponding edge pixels in the right image is less than or equal to one. In the following pseudo-code listing this check is referred to by the isConsistent(...)


7

function i n s e r t N e i g h b o u r ( p , edge , c u r C o n f ) i f vi sit ed (p) return f a l s e ; add p t o e d g e ; i f NOT i s C o n s i s t e n t ( p , p a r e n t ( p ) ) : // p i x e l s i n c o n s i s t e n t ! c u r C o n f := 0 ; r e f i n e d D i s p a r i t y ( p ) := UNKNOWN; // don ’ t t r u s t t h i s d i s p a r i t y e l s e i f c u r C o n f >= minFixConf : refinedDisp (p) = i n i t i a l D i s p (p ) ; // c o n f i d e n t d i s p a r i t y ! else // p i x e l s a r e c o n s i s t e n t but n o t enough c o n f i d e n c e y e t c u r C o n f += c o n f i d e n c e ( p ) ; // b u i l d up c o n f i d e n c e r e f i n e d D i s p ( p ) := − i n i t i a l D i s p ( p ) ; // s a v e d i s p a r i t y w i t h neg . s i g n i f c u r C o n f >= minFixConf : for a l l previous i n v a l i d p i x e l s : change s i g n o f n e g a t i v e d i s p a r i t i e s l i n e a r l y i n t e r p o l a t e UNKNOWN d i s p a r i t i e s return t r u e ; function f o l l o w E d g e ( p , parentEdge , c u r C o n f ) e d g e := new Edge ; l i n k edge to parentEdge ; c u r N e i g h b o u r s := p ; while s i z e O f ( c u r N e i g h b o u r s ) > 0 : i f sizeOf ( curNeighbours ) > 1 : // spawn c h i l d e d g e s f o r each n e i g h b o u r n o f c u r N e i g h b o u r s : c h i l d E d g e := f o l l o w E d g e ( n , edge , c u r C o n f ) ; l i n k edge to childEdge ; break ; i f NOT i n s e r t N e i g h b o u r ( c u r N e i g h b o u r s [ 0 ] , edge , c u r C o n f ) : break ; curNeighbours = neighbours ( curNeighbours [ 0 ] ) ; return e d g e ; function r e f i n e D i s p a r i t i e s ( ) f o r each matched p i x e l p : // s e a r c h f o r c o n f i d e n t s t a r t p o i n t s i f c o n f i d e n c e ( p ) == 3 : f o r each n e i g h b o u r n o f p : if c o n f i d e n c e ( n ) >= 2 AND c o n f i d e n c e ( n e x t N e i g h b o u r ( n ) ) >= 2 AND i s C o n s i s t e n t ( p , n ) AND i s C o n s i s t e n t (n , nextNeighbour ( n ) ) : e = f o l l o w E d g e ( n1 , 0 , minFixConf ) ; // good c o n f i d e n c e ! i f length ( e ) > 1: add e t o e d g e s ; return e d g e s ;

Listing 1. Pseudo-code of the confidence-based refinement algorithm.

function call. The function neighbours(p) searches the 8-connected neighbourhood of the pixel for adjacent edge pixels. It disregards the direction of its parent pixel, so we exclusively move forward along the edge. The underlying idea of the refinement algorithm is to propagate a confidence level along the edge (named curConf in listing 1). First, groups of three adjacent and consistent high-confidence edge pixels are searched for as starting point. Then, starting with maximum confidence, the edge is traversed, checking each pixel for consistency with its predecessor. If an unmatched pixel or an inconsistency is encountered, the confidence is dropped to zero. With each consistent pixel-pair the confidence value recovers until it is greater than the tuning parameter minFixConf. Then, the algorithm tries to recover the intermediate

8


disparities. For inconsistent or unmatched pixels, linear interpolation between the enclosing confident disparities is performed. This way it is possible to keep the total number of matches high and at the same time boost the percentage of correct matches.

3

Experimental Results

In the following, we benchmark the proposed edge matching by confidence-based refinement (EMCBR) with the middlebury database. Figure 5 shows the disparity errors of EMCBR with a 11 × 5 matching window to the previously used selection of 12 image sets, the quantitative results of which were shown in Figure 3(e) and 3(f). The parameterization was empirically investigated and set as follows: αmatch = π/16, tmatch = 12, minF ixConf = 8. To yield suitable sparse ground truth, the middlebury ground truth images were dilated with a 3 × 3 structuring element to always yield foreground disparities on object borders. Subsequently the images were sparsified by masking with the edge locations. A comparison with probabilistic phase-based sparse stereo (PPBSS, [14]) and several popular dense methods is given in table 1 and visually in Figure 4. The results of scanline optimization (SO), dynamic programming (DP) and graph cuts (GC) refer to [12], while semiglobal matching (SemiGlob) refers to [6], ADCensus (ADCensus, currently ranked first in the middlebury benchmark) to [10] and graph cuts with occlusions (GC+occl) to [7]. Thus, a diverse mix of scanlinebased algorithms to complex global optimization techniques is compared. Table 1. Edge matching performance comparison of sparse (EMCBR and PPBSS) and dense algorithms that have been sparsified to edge loci.

EMCBR PPBSS DP SO SemiGlob GC GC+occl ADCensus

Tsukuba Matches Errors 8550 8.8% 2350 17.0% 13775 14.6% 13778 16.9% 13765 8.6% 13801 9.0% 13803 6.3% 13900 20.8%

Teddy Matches Errors 10514 5.3% 17192 11.1% 17709 20.8% 18001 10.6% 17696 17.4% 17995 15.8% 18001 6.9%

Cones Matches Errors 16147 5.3% 23590 12.6% 24489 19.3% 24868 9.0% 24399 15.0% 24868 11.6% 24868 6.8%

Venus Matches Errors 11816 2.0% 1310 6.0% 14601 6.7% 14770 8.2% 14930 1.5% 14737 2.9% 14929 1.7% 14930 0.4%

Sawtooth Matches Errors 14202 2.4% 3079 4.0% 19393 6.0% 19598 7.1% 19560 4.9% 19777 0.8% -

The most obvious difference between the matching results of the sparse and the dense methods is the number of matches. This stems from inconsistently detected edges in the left and right images. For example the upper bound for correctly matched pixels in the Tsukuba image set without gap filling is 8529 matches. This number is calculated by taking the ground truth disparities at edge loci in the left image and checking if an edge with an edge angle difference smaller than αmatch exists at the corresponding location in the right image. Since

9

Cones 450x375

Teddy 450x375

Tsukuba 384x288


EMCBR

DP

GC+occl

ADCensus

Fig. 4. Comparison of EMCBR to sparsified popular dense algorithms. Green pixels depict disparity errors ≤ 1, red pixels are errors > 1 and white pixels correspond to unmatched pixels. The dense methods usually match almost all edge pixels.

the dense algorithms do not restrict their disparity search to edge loci, this is the main reason for the difference in match counts. However, this is nonrelevant for the applications of sparse methods. It is much more important to extract consistent edge segments on sparsely textured objects. The middlebury stereo sets can be regarded as a stress test for edge-based stereo matchers since they are highly textured, leading to many inconsistently detected edges. Nevertheless, EMCBR performs great in terms of error percentages, especially if one takes into account that (except for DP and SO) very sophisticated dense algorithms are compared which take at least seconds to execute on a modern CPU. Except for some long horizontal edge segments as in Tsukuba, the algorithm efficiently and reliably finds the correct disparities, while being much less of a computational burden. In comparison to PPBSS, many more features are detected while error rates are superior for all three compareable image sets. In further research the number of matches has been found to be critical for tasks like edge-based SLAM. Also, since PPBSS uses a set of gabor filters with different rotations and scales for detection one can appraise, that it is computationally more costly than EMCBR. The experiments were run on an Intel i7-2640M CPU (2.8 GHz) using both cores. For a maximum diparity of 64, the compuation times of EMCBR per image set were between 20ms and 45ms with the 11×5 support region including preprocessing and refinement. Cost aggregation and matching takes about 10ms to 15ms and about the same amount of time is spent on refinement.

10

4


Discussion and Future Work

In this paper we presented a robust and real-time capable stereo edge refinement technique and investigated the consideration of depth discontinuities at the matching location. We showed that good results are possible with sparse matching, producing better error rates than most dense algorithms while being computationally less demanding. The refinement algorithm enforces consistent disparities along edges and is able to reliably interpolate missing ones. The main remaining source for errors and unmatched edges is the quality of the edge detector. Often, object edges are distorted or disrupted by adjacent surfaces of similar intensity. This is especially true for highly textured objects. This situation can be improved by more costly edge detectors that take color, edge biases and scale-space into account. Unfortunately, sophisticated edge detectors would have a severe impact on the execution time. However, in further research with edge-based SLAM and real world indoor scenes the algorithm already showed good performance with the current edge detector.

References 1. Ayache, N., Faverjon, B.: Efficient registration of stereo images by matching graph descriptions of edge segments. IJCV 1(2), 107–131 (1987) 2. Canny, J.: A Computational Approach to Edge Detection. PAMI 8(6), 679 – 698 (1986) 3. Chandraker, M., Lim, J., Kriegman, D.: Moving in stereo: Efficient structure and motion using lines. In: Proc. of ICCV, pp. 1741–1748. IEEE (2009) 4. Devernay, F.: A Non-Maxima Suppression Method for Edge Detection with SubPixel Accuracy. Tech. rep., INRIA (1995) 5. Helmer, S., Lowe, D.: Using stereo for object recognition. In: Proc. of ICRA. pp. 3121–3127. IEEE (2010) 6. Hirschm¨ uller, H., Innocent, P.R., Garibaldi, J.: Real-Time Correlation-Based Stereo Vision with Reduced Border Errors. IJCV 47(1), 229–246 (2002) 7. Kolmogorov, V., Zabih, R.: Computing visual correspondence with occlusions using graph cuts. In: Proc. of ICCV, pp. 508–515. IEEE Comput. Soc (2001) 8. Li, Z.N.: Stereo correspondence based on line matching in Hough space using dynamic programming. TSMC 24(1), 144–152 (1994) 9. Medioni, G., Nevatia, R.: Segment-based stereo matching. Computer Vision, Graphics, and Image Processing 31(1), 2–18 (1985) 10. Mei, X., Sun, X., Zhou, M., Jiao, S., Wang, H., Zhang, X.: On building an accurate stereo matching system on graphics hardware. In: ICCV Workshops, pp. 467–474. IEEE (2011) 11. Robert, L., Faugeras, O.D.: Curve-based stereo: Figural continuity and curvature. In: Proc. of CVPR. pp. 57–62. IEEE (1991) 12. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV 47(1), 7–42 (2002) 13. Tomono, M.: Robust 3D SLAM with a stereo camera based on an edge-point ICP algorithm. In: Proc. of ICRA, pp. 4306–4311. IEEE (2009) 14. Ulusoy, I., Halici, U., Hancock, E.: Probabilistic phase based sparse stereo. In: Proc. of ICPR. pp. 84–87 Vol.4. IEEE (2004)


11

Fig. 5. Middlebury benchmark disparity results of EMCBR with a 11×5 shifted pixelblock support region. Green pixels depict disparity errors ≤ 1, red pixels are errors > 1 and white pixels correspond to unmatched pixels. The test images in the left column are Teddy (450×375), Sawtooth (434×380), Barn1 (432×381), Bull (433×381), Venus (434 × 383) and Poster (435 × 383). The images in right colomn are Cones (450 × 375), Tsukuba (384 × 288), Barn2 (430 × 381), Reindeer (447 × 370), Laundry (447 × 370) and Art (463 × 370).