A CNN Solution for Depth Estimation from Binocular Stereo Imagery A. G. Radványi*, T. Kozek**, L. O. Chua** *
Analogical and Neural Computing Laboratory Computer and Automation Institute of the Hungarian Academy of Sciences P.O.B 63, Budapest, H-1502, Hungary, FAX: (+36) 1 2095264, E-mail:
[email protected] ** Electronics Research Laboratory University of California at Berkeley, Berkeley, CA 94720, USA FAX: (+510) 643 8869, E-mail:
[email protected],
[email protected]
ABSTRACT: Novel results and experiments are presented on the application of cellular neural networks to binocular stereo vision. A CNN universal machine algorithm is described for depth estimation as a part of a stereo-vision-based guidance system for autonomous vehicles. Being most amenable to revealing stereo correspondence, extraction of vertical edges is performed first. Then their distance from the observer in 3D space is established through a stereo matching scheme. The performance of the algorithm is demonstrated on real-life highway imagery and it is shown that very low latency real-time operation is attainable via the CNN-UM.
1. Introduction Inferring depth information from a pair of images taken at the same time of the same scene is a classic problem of machine vision. It is normally solved by systematically comparing segments of left and right stereo images (in case of a binocular system) in order to find matching segments. Spatial displacement of the best match of a particular image segment gives an estimate for the depth of the matching features and, in turn, the object(s) they belong to. There is a great number of stereo correlation algorithms proposed with well proven capabilities (e.g. [1]) and many innovative methods have also been suggested for efficient implementation (e.g. [2,3,11,12]). Establishing stereo correspondence, however, remains one of the most computationally intensive imaging tasks for machine vision systems. Here we examine how to utilize the massive parallelism of cellular neural networks [4,5] in solving a stereo vision problem in a relatively well defined environment with the intention of later extending the results to more general frameworks. In particular, we are investigating the application of high-speed CNN computing [6,7,8] to the real-time environment of autonomous vehicle control in highway traffic. Optical guidance systems are the direction systems of choice for autonomous vehicles as they do not require costly road improvements. Given a compact and reliable realization, these systems are to be used to improve driving safety as early alarm systems and later for creating automated highways to improve road capacity of congested highways. Our solution based on the CNN universal machine (CNN-UM) architecture is designed to fit into the particular framework demonstrated by Taylor, et. al. [9] and is aimed at providing a solution for compact, low cost, low power, real time implementation of sensor (2-D) to feature (3-D) mapping - the most computationally intensive computing step. Subsequent processing involving grouping of features into objects, object tracking, path planning and vehicle control are easily performed in real-time by conventional architectures. In the following, we present a CNN-UM algorithm for extracting vertical edges and determining their 3-D position by stereo correlation. Simulation results on actual traffic scenes are presented.
2. CNN-UM algorithm for Depth Detection Input to the CNN-UM algorithm are left and right stereo images as shown in Figure 3. The task is to extract features suitable for stereo matching and determine their depth or distance from the imaging plane. These features are then later combined by subsequent processing into objects, which in turn are tracked to supply information for the planning and vehicle control mechanisms.
2.1 Preprocessing
Input image
The scheme of the first stage of CNN processing is shown in Figure 1. To provide for invariance under various imaging conditions and eliminate disparity in camera settings, information belonging to low spatial frequencies is removed from input images. As the two cameras used for stereo imaging are located next to each other horizontally, depth information can be obtained from features containing sharp contrast in the vertical direction. A vertical edge detector template is therefore employed, followed by removal of single pixel noises which tend to mislead the later matching algorithm. Templates used in the algorithm are as follows: Diffusion:
0.5 1 0.5 A = 1 −4 1 0.5 1 0.5
– image
+ image
Vertical edge detector
(1) Single-pixel noise extraction
Vertical edge detector:
0 −0.25 0.5 −0.25 0 . 9 . , Z = −3 B = −15 −3 −3 −15 0 −0.25 0.5 −0.25 0
Diffusion
(2) Logical XOR
Single-pixel noise extractor:
− 01 . − 01 . − 01 . 0 0 0 . 0 − 01 . , Z = −0.8 A = 0 1 0 , B = − 01 − 01 0 0 0 . − 01 . − 01 . 2.2 Stereo Correlation
(3)
Binary edge map
Figure 1: CNN-UM algorithm for generating edge maps from the left and right stereo images
The stereo correlation algorithm proceeds by executing the following steps also shown in Figure 2.: 1) Through the fattened binary edge masks, in a suitable neighbourhood of edges, we produce a series of correlation images in gray-scale, each one corresponding to a certain depth. After shifting the right stereo image to the left by one pixel at a time, we measure the similarity of the left and right images at every point by calculating the pixel-wise absolute difference of the left and right images. By properly shifting the gray-level of the input edge masks, the procedure generates several local maxima as we proceed towards larger disparities/depths gradually shifting the two input images with respect to each other. 2) Taking into account camera specifications, we threshold the correlation images to produce corresponding-point candidates that likely belong to the same 3-D feature. The neighbourhood of each local maximum in the similarity measure along the depth axis will be converted into a series of first growing then shrinking horizontal mask segments, the longest of which is to mark a possible depth that is searched for. 3) Supposing disjoint mask segments, a supposition not always true, an overlay of all masks will result in an aggregate mask showing the longest mask segments each obtained from the correlation step that belongs to its corresponding depth. 4) To find the locally longest mask segment, we repeat the depth cycle of steps 1 and 2. The resulting mask segments can be compared to the aggregate mask of step 3. In case of disjoint mask segments, those segments matching exactly the aggregate mask are identified as ones belonging to a certain depth map and are output as result. The workings of the stereo correlation algorithm can be further explained through an example. For this we employ a synthetically generated stereo image pair shown in Figure 3a. The algorithm starts out with the extracted edges in the left and right images (Figure 3b). To generate the first piece of the correlation image sequence, left and right input images are added to the fattened edge masks and a proper offset is applied to produce the image pair to be compared (Figure 3c). These images contain the original inputs “showing through” the edge masks as illustrated by the magnified detail in Figure 3d.
Through a series of horizontal shifts, unmasked points Left edge Right edge mask mask of equal gray-level are identified and collected. At each step, the two images are subtracted from each other and the resulting pixels with zero gray-level should reveal the 1st correlation location of matching points. Given the fact, however, that (black) pixels corresponding to the masks also produce zero differences, following this method extraneous areas of zero gray levels would arise which would necessitate an additional masking step. To avoid this, the gray-level of Aggregate mask of possible pairs of one of the masked images is offset by a value that does not corresponding appear in the other image so that pixels with zero gray-level points mark exclusively the matching point candidates. Output from this process is shown in Figure 3e for 0, 7, 14, and 21 pixel disparities. (Given by the parameters of the stereoimaging setup, there are 22 of these depth ranges of which only four are shown here.) Matching segments appear as back patches on the dark-gray background. 2nd correlation By simple thresholding at zero, these areas of corresponding point candidates can be extracted to form a series of masks as shown in Figure 3f for the same disparities, and equivalently, depth values. The black Fitting horizontal segments in these images mark points in their spatial segments vicinity. To find where exactly they lie in space the point of maximal correspondence has to be determined. As shown in Figure 3g, this point is identified by the longest horizontal segment of the diamond-shaped black segment. Resulting depth masks Now the task of depth estimation is reduced to finding these longest segments. (It has to be noted that this is not a foolFigure 2: Flow diagram of the stereo proof procedure as overlapping diamonds can introduce correlation algorithm false matches.) Locating the maximal matching segments is done in two steps. First, all the entries in all disparity/depth planes are aggregated into a single mask through a cycle of local logical OR operations. This aggregate mask (Figure 3h) is then used in a subsequent round of matching to localize their spatial/depth position. In order to do that, the procedure for generating the individual corresponding-point masks is repeated but this time an additional operation is executed at every disparity/depth level. This operation erases all point candidates that do not exactly match the aggregate mask. The CNN template for this fitting operation is the following:
A = [ 1 2 1] , B = [ −1 −1 −1] , Z = 0
(4)
Masks produced by this final round of matching represent the output of the stereo correlation algorithm (Figure 3i). Each of the final masks mark features located at increasing depth measured from the imaging plane. The exact depth range of the extracted features can be readily computed given the geometry and optical parameters of the stereo imaging system.
3. Simulation results We have tested the proposed algorithm on stereo image sequences obtained from test drives on California highways. The images were captured at 20 frames/s with a resolution of 305 by 240 pixels. Figures 4, 5, 6, 7 and 8 below show the input images, three intermediate stages, and the output of the algorithm, respectively. Moving output from the proposed stereo correlation algorithm reveals robust registration in space for reallife imagery. At the same time, however, there is a noticeable amount of variance in determining the depth of a particular detail and misclassifications are also observed. Inaccuracy in the output is partly caused by temporal misalignment of some of the stereo image pairs. In the stereo image sequences available to the authors, corresponding left and right images are sometimes taken at slightly different instances causing inaccuracy in determining displacement of corresponding features and their depths. Further errors are caused by image noise which can mislead the matching algorithm. In the vehicle guidance framework [9] for which the proposed algorithm was designed, however, the observed amount of depth error is tolerable and can be eliminated by the subsequent feature grouping process which takes into account the temporal history of the spatial position of each feature. Such functions can be implemented on the CNN universal machine itself but these options are not considered here.
a)
f)
b)
c)
g)
d)
h)
e)
i)
Figure 3: Stages of the stereo correlation algorithm
4. Implementation Issues Based on performance data obtained from measurement of the first CNN universal chip prototypes [6,7], a reliable estimate can be given for the execution speed of the above stereo correlation algorithm when realized on a CNN universal chip. As much of the processing is done along the horizontal dimension, a suitable CNN system can be constructed from a long and narrow CNN processor array and suitable size analog buffers to temporarily store the incoming video signals. Using an input image resolution of 300x240 pixels as a basis for calculation and a CNN chip of 300x16 processors, input images can be processed in 16 segments (stripes) with a single pixel overlap between segments. The described algorithm requires three local analog and three local logic memories. These parameters are within the range of feasibility using current CMOS technologies. Given that the stereo image pairs require 21 horizontal shifts and assuming 30 µs (300 cells loaded sequentially at 10 MSa/s) loading time, 40 ns binary read-out from LLM cell (much like a static RAM), 100 ns time constant, and 2 µs typical running time for analog and 1 µs for logic operations, the algorithm can execute in 60+212+252=524 µs. Processing of an entire frame takes then 8.4 ms which is also the data latency introduced by the system. This allows for a processing frame rate of 119.3 Hz, two orders of magnitude better than a comparable single chip digital solution. It is worth noting that by adding more on-chip memory to the
Figure 6: Left and right edge masks
Figure 4: Left and right input images Figure 7: Aggregate mask CNN processor, it is possible to reduce the execution time of the algorithm by performing the two correlation cycles in the same loop. The high speed and low latency of the implementation makes the CNN solution highly desirable as low data latency translates into improved stability of the control loop incorporating the image sensor and stereo correlation stage.
5. Conclusions We have presented a CNN-UM algorithm for extracting vertical edges and determining their 3-D position through a stereo correlation algorithm. Simulation results on artificial images and actual traffic scenes show that the proposed method is a viable solution for realizing stereo matching in a real-life environment. By implementing the proposed algorithm on a CNN universal chip, stereo correlation with very low data latency is achievable. This property leads to significant improvement in system performance and control stability. Moreover, the small size and low power consumption of a CNN-based design may pave the way for the more widespread use of optical vehicle guidance systems.
6. Acknowledgement Support for this work from the (MURI) ARO Grant No. DAAH04-96-1-0341 is gratefully acknowledged.
7. References [1] A. Cozzi, B. Crespi, F. Valentinotti, F. Worgotter: “Performance of phase-based algorithms for disparity estimation”, Machine Vision and Applications, Vol.9, pp. 334-40, 1997 [2] H. R. Arabnia: “Distributed stereo-correlation algorithm,” Computer Communications, Vol.19, pp. 707711, 1996 [3] M. Pilu: “A direct method for stereo correspondence based on singular value decomposition”, Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 261-266, San Juan, Puerto Rico, June 1997 [4] L. O. Chua, T. Roska: “The CNN paradigm”, IEEE Trans. on Circuits and System-I, Vol. 40, pp. 147156, March 1993. [5] T. Roska, L. O. Chua: “The CNN Universal Machine: An analogic array computer”, IEEE Trans. on Circuits and System-I, Vol. 40, pp. 163-173, March 1993. [6] J. M. Cruz, L. O. Chua, T. Roska: “A fast, complex and efficient test implementation of the CNN Universal Machine”, Proc. of IEEE Int. Workshop CNNA’94, pp. 61-66, Rome, 1994. [7] R. Domínguez-Castro, S. Espejo, A. Rodríguez-Vázquez, R. Carmona: “A CNN Universal Chip in CMOS Technology”, Proc. of IEEE Int. Workshop CNNA-94, pp. 91-96, Rome, 1994.
[8] L. O. Chua, T. Roska, T. Kozek, Á. Zarándy: “CNN Universal Chips Crank up the Computing Power”, IEEE Circuits and Devices Magazine, pp. 18-28, July 1996 [9] C. J. Taylor, J. Malik, J. Weber: “A real-time approach to stereopsis and lane-finding”, Proc. of the 1996 IEEE Intelligent Vehicles Symposium, pp. 207-212, Tokyo, Japan, 1996. [10] T. Kozek, K. R. Crounse, T. Roska, L. O. Chua: “Multi-scale image processing on the CNN Universal Machine”, Proc. of IEEE Int. Workshop CNNA’96, pp. 69-74, Seville, June 1996 [11] A. G. Radványi: “Solution of stereo correspondence in real scene: An analogic CNN algorithm”, Proc. of IEEE Int. Workshop CNNA’94, Rome, 1994 [12] A. G. Radványi: “Spatial depth extraction in analogic CNN framework”, Int. Journal on Circuit Theory and Applications, Vol 14, No. 1, pp 69-92, 1996
I
I
I
I
Figure 8: Output of stereo matching for a single image pair. Black pixels in the output binary masks identify features extracted at different depths. Each mask image represents a slice of the 3-D space in front of the cameras: features appearing in the first frames are far away while ones in the last frames are immediately in front of the vehicle. The distance range attributed to each frame is determined by the physical parameters of the stereo imaging system.