Accuracy in Real-Time Depth Maps - CiteSeerX

1 downloads 0 Views 161KB Size Report
ages can provide an almost complete map of the environ- ment1. However .... The MRT Stereo system[7] produces accurate ray-traced images from a model ... with the same disparity lie on Veith-Muller circles circum- scribed about the chord ...
Accuracy in Real-Time Depth Maps John Morris School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 156-756, Korea and Department of Electrical and Electronic Engineering, The University of Auckland; Philippe Leclercq Department of Electrical and Computer Engineering, University of Western Australia, Nedlands, WA 6009, Australia

Abstract

If all vehicles were equipped with active sensors, distinguishing between extremely weak reflections and primary pulses from distant vehicles will present a daunting - and potentially insurmountable - problem for intelligent sensors. Passive systems, on the other hand, are much less sensitive to environmental interference. Stereo vision - or the use of pairs of images taken from different viewpoints - is able to provide detailed threedimensional maps of the environment. Typical cameras can provide 30 or more images per second and each pair of images can provide an almost complete map of the environment1 . However, processsing even small, low resolution  pixels) images takes more than a second in ( software - ignoring any post-processing needed to determine the velocity of an approaching object and the optimum strategy for avoiding it. This is well below the frame rates obtainable with commodity cameras and may be far too slow to enable even relatively slow moving objects to avoid each other. Furthermore, the accuracy of depth measurements depends on the range of disparities (differences between the projections of the same object point on the left and right image planes). With conventional parallel camera axes, a point, , at distance, , from the camera optical centres has disparity:    (1)

  where is the width of a sensor pixel, is the disparity measured in pixels and  is the camera focal length. In collision avoidance applications, factors such as the inertia, maximum speed of the vehicle and ‘opposing’ objects will determine a closest approach permitted for any object,

!#" , in order to allow collision avoidance strategies to be implemented. This effectively defines the maximum dis$ parity,    &%('

)!*"

Use of stereo vision systems for collision avoidance has been discussed extensively but only cursory attention has been paid to the accuracy of the depth maps obtained. We show that simply increasing the baseline to increase depth accuracy results in a significant drop in matching quality which may lead to failure of a collision avoidance system. Based on these observations, we propose the use of verging cameras to focus on a critical region directly in front of the moving vehicle. The camera alignment is chosen so that objects within the critical region appear with zero disparity (i.e. in the same position in left and right images) - or very small disparities. This allows one to constrain the search of disparity space to a narrow region, permitting faster searches in high resolution images which are necessary to obtain accurate measurements. Experiments with active illumination show that, unlike active ranging systems, stereo systems can actually benefit from random optical noise in the environment - or impressed active illumination, using, for example, eye-safe IR patterns.

1 Introduction A collision avoidance system for any mobile device - from a robot to a large vehicle - requires the ability to build a three-dimensional ‘map’ of its environment. Traditionally this has been accomplished by active sensors which send a pulse - either electromagnetic or sound - and detect the reflected return. Such active systems work well in low density environments where the number of moving vehicles is small and the probability that active sensors will interfere is small, so that simple techniques prevent a sensor from being confused by sensing pulses from other vehicles. For example, radar systems which detect impending aircraft collisions appear to be effective as do ultrasonic systems in small groups of robots. However, when the density of autonomous - and potentially colliding - objects becomes large, active systems create ‘noisy’ environments. For example, heavy traffic brings large numbers of vehicles with a wide range of speeds and directions into close proximity.

1 Whilst complete maps are, in general, not attainable because some parts of the environment are not visible to both cameras simultaneously, this does not present a significant problem for the collision avoidance application.

1

 Note that we measure and disparities in pixels (using , the dimension of a single pixel to convert to physical distances). This provides direct correlation with the complexity of matching algorithms, which typically work in pixels $ and evaluate all possible integer values of the disparity.2 in pixels is usually directly related to the speed Thus and space complexity of both software and hardware[1] al$ gorithms. (The physical extent of the image sensor also   +  &%(' for a sensor with &%(' pixels, requires that but this limit is rarely tested because when it is reached $ there is only one common pixel in each image.) The maximum disparity, , is increased by increasing camera reso lution (decreasing ) or increasing  - for any given camera configuration, i.e. fixed  . Increasing  without increasing camera resolution effectively narrows the field of view and thus is not generally an option in environments with mutually moving objects.

disparity range

P

P

left image

right image

Figure 1. Correlation based matching: the window centred on pixel, 7 , in the right image is moved through the disparity range until the best match (correlation) is found with a window centred at 7 in the left image. Aligning the cameras to meet the epipolar constraint ensures that 7 must lie on the same scan line in each image.

1.1 Depth Accuracy Assuming that the matching algorithm does not employ sub-pixel matching[2], so that the depth map generated produces integer values of the  disparities, the accuracy for depth, corresponding to a -pixel disparity is

,



-/. "

Thus, depth accuracy may be increased by increasing  or increasing camera resolution (decreasing to increase the measured ).

1.2 Matching Matching has been extensively studied: dozens of algorithms have been proposed and compared[3, 4]. In the baseline experiments reported here, we focussed on two algorithms which are fast, have good matching performance and are good candidates for real-time hardware implementation - a necessary criterion because software implementations operating on high resolution (Mpixel) images are too slow to produce real-time disparity maps. Matching is simplified if the system is aligned so as to meet the epipolar constraint - implying that matching pixels must be found in the same scan line in both images. We always assume either accurate camera alignment or post-processing (rectification) of images to meet the epipolar constraint.

80%

P2P(κ

occ

= 5, κ =40)

SAD(w=4)

r

70%

60%

50%

40%

30%

20%

10% 10

20

30

40

50

60

70

80

90

Figure 2. Percentage of Good Matches vs Baseline in pixels

20

P2P(κ

occ

18

= 5, κ = 40)

SAD(w = 4)

r

16 14 12 10 8 6 4

SAD ‘Sum of Absolute Differences’ is an area-based correlation algorithm, which attempts to find the best match between a window of pixels in one image and a window in the other. The matching process is illustrated in Figure 1 which shows how a right image window is shifted by an amount known 2 Sub-pixel matching algorithms have been proposed; they effectively introduce a smaller pixel size value, 01&230465 , and do not affect the discussion here since 5 is usually a small integer, e.g. 2.

2 0 10

20

30

40

50

60

70

80

90

Figure 3. Standard Deviation of Matching Errors vs Baseline in pixels

as the disparity until the best match is found with the reference window in the left image. In the SAD algorithm, the criterion for the best match is minimization of the sum of the absolute differences of corresponding pixels in a window, 8 : 9: =?; ,@ BA  : J;>= @   :  , ;K= @

'C DFE?G H I

I

9: J;>=?; ,@ is evaluated for all possible values of the dispar, $ ity, , and the minimum chosen. For parallel camera axes, ,  ranges from for objects at infinity to , corresponding to the closest , possible approach to the camera. However, verging axes, may take negative and positive values. Pixel-to-Pixel The Pixel-to-Pixel(P2P) algorithm is a dynamic algorithm proposed by Birchfield and Tomasi[5] which attempts to find a best path through the depth map using a dynamic algorithm and a cost function based on a ‘reward’ for a sequence of matches and a ‘penalty’ for an occlusion. It is a dynamic algorithm, generally produces good matches[3, 6] and is currently being evaluated for hardware implementation in our laboratory.

2 Long Baselines and Matching Quality The MRT Stereo system[7] produces accurate ray-traced images from a model description. The geometry is known so that a precise ground truth is also generated which may be used to measure matching performance accurately. We generated a series of image pairs for the ‘Corridor’ scene[7, 4] with various baseline values. We then ran the two best performing matching algorithms on these images and obtained the results shown in Figure 2 and Figure 3. These results show a clear degradation in matching quality (lower percentages of good matches and larger standard deviations) as the baseline is increased. Several factors contribute to this: 1. A purely statistical one: Increasing the number of possible disparity values increases the probability that an incorrect match will be made - particularly in regions of low texture or where there are too few features on which matching can work. 2. Perspective: Matching usually assumes that all objects present plane surfaces parallel to the line joining the camera optical axes which subtend the same number of pixels on both image planes. As the angle between the two viewpoints increases, the apparent widths of angled surfaces in the left and right images start to differ significantly. 3. Scattering: Simple matching models depend on the Lambertian - or ideal scattering - assumption Whilst non-reflective real surfaces can be expected to differ little in their reflectivity over small angles, the wider angles implied by longer baselines will cause greater deviations from ideal scattering behaviour.

4. Occlusions: The number of pixels which become monocularly visible will increase as the baseline increases. Half occlusions present difficulties for all matching algorithms, even those, such as Pixel-toPixel[5], which allow for them. The combination of these factors results in the observed steep drops in matching quality as the baseline increases. The increase in the standard deviation seen in Figure 3 also shows that increasing the baselineincreases the distance error through poor matching. Thus, whilst increasing the baseline should increase depth accuracy at any point in the scene, the decrease in matching quality will reduce the number of correct matches available to build a map of the environment and the ‘correct’ matches will have larger errors associated with them resulting in a less precise environment map. Thus increasing the baseline will certainly present problems for systems which rely on fast automated matching.

3 Verging Optics We have just shown how conventional arrangements with parallel camera axes provide highly degraded matching if the baseline is increased to provide better depth accuracy. Increasing the camera resolution has a similar effect because the disparity range in pixels increases. However, it is known that human eyes focus on a point of interest in the scene and thus use ‘verging optics’ - a system in which the camera axes are not parallel but set at a vergence angle, L . This complicates the restoration process in which actual depths are recovered from disparity maps. In parallel axis systems, objects appearing at the same disparity in images lie on planes parallel to the image planes and recovering depth from disparity values is simply a matter of calculating a reciprocal, see equation 1. With verging axes, objects with the same disparity lie on Veith-Muller circles circumscribed about the chord connecting the two camera optical centres[8]. Noting that depth information can only be recovered from regions in the common field of view (CFoV) of both cameras, we also observe that the parallel camera axis alignment ‘wastes’ a large portion of both image sensors because regions outside the CFoV are imaged on to them, see Figure 5. In contrast, verging axis systems can be configured so that objects in a region of interest occupy a CFoV which is imaged onto the full extent of both sensors.

3.1 Depth accuracy Depth accuracy along the line bisecting the chord connecting the two optical centres (and passing through the centre of the Veith-Muller circle)  can be  derived from the relations for the positions, and , that an arbitrary point : J;KM @ in the scene is projected onto the left and right image

Depth Resolution (f fixed) 0.8

Depth Resolution (f adjusted) 0.8

Parallel axes Vergence: 5 deg Vergence: 10 deg Vergence: 15 deg

0.6 Depth Resolution (x10^3)

Depth Resolution (x10^3)

0.6

0.4

0.2

0

Parallel axes Vergence: 5 deg Vergence: 10 deg Vergence: 15 deg

0.4

0.2

0

0.05

0.1 Distance

0.15

0.2

Figure 4. Depth resolution for verging optics without focal O N#P ; QN QP ;KR  length adjustment: parameters -  PS ;T VU WPXYZ . If the unit is metres, these values are reasonable for commercially available digital cameras.

planes, respectively:







 :  Y . : _  ^  &[]\ ][ \  M



 

@ `L @

(2)

   @ `L @  :  Y . :    &[]\ ][ \  M

(3)

M From these, we derive the displacement in the direction necessary to cause the disparity (the difference in the projected points) to shift by one pixel, i.e. the smallest measurable distance change. The depth accuracy along principal axis is plotted L  for several values of the vergence angle (including , the parallel axis case) in Figure 4. Although an increase in accuracy (represented by smaller depth resolution) is apparent in Figure 4, it is hardly dramatic and possibly not worth the additional computational complexity of working with the Veith-Muller circles instead of straight lines of constant disparity. a bV a bV a bV a bV a ba bV c dV c dV c dV c dV c dV c dV c dc dV a bV a bV a bV a bV a ba bV c dV c dV c dV c dV c PdV c dV c dc dV a bV a bV a bV a bV a ba bV c c c c c c c dc V d V d V d V d V d V d dV a bV a bV a bV a bV a ba bV c dV c dV c dV c dV c dV c dV c dc dV a bV a bV a bV a bV a ba bV c c c c c c c dc V d V d V d V d V d V d dV φ a bV a CFoV a bV a bV a ba bV bV c c c c c c c dc V d V d V d V d V d V d dV a bV a bV a bV a bV a ba bV cV cV d cV d cV d cV d CFoV d cV d cV d cd a bV a bV a bV a bV a ba bV c c c c c c c dc V d V d V d V d V d V d dV a bV a bV a bV a bV a ba bV c c c c c c c dc V d V d V d V d V d V d dV a bV a bV a bV a bV a ba bV c dV c dV c dV c dV c dV c dV c dc dV c dV c dV c θdV c dV c dV c dV c dc dV O

O

L

R

eefefefefe f ghghghghg h b

Dmin

O

O

L

Dmin

R

b

Figure 5. Change in the common field of view as the axes are inclined: note that !*" becomes smaller, but the   CFoV becomes relatively larger. i C are the camera optical centres. However, if we consider the collision avoidance problem, we have already noted that there will generally be a j*" or closest approach distance. The primary purpose of a

0

0

0.05

0.1 Distance

0.15

0.2

Figure 6. Depth resolution for verging optics L P with Ulk focal length adjustment. Note that the curve for is truncated by the smaller depth extent of the CFoV due to the much longer focal length used to maintain a common j*" for all four curves.

collision avoidance system is to prevent objects breaking through this limit: if they do, the stereo algorithm has failed and other damage control systems need to be invoked! If we simply take identically configured cameras and tilt them L inwards (increasing ), then the closest point in the CFoV approaches closer than !*" , see Figure 5. This effectively wastes part of the image planes (and contributes to the small differences observed in Figure 4). For verging axes, @ :L  ^`p

j*"