AN ALTERNATIVE TO SEQUENTIAL ARCHITECTURES TO IMPROVE THE PROCESSING TIME OF PASSIVE STEREOVISION ALGORITHMS Abdelelah Naoulou, Jean-Louis Boizard, Jean Yves Fourniols, Michel Devy Laboratoire d’Analyse et d’Architecture des Systèmes du CNRS, 7 Avenue du colonel Roche, 31077 Toulouse cedex 04, France email:
[email protected],
[email protected],
[email protected],
[email protected] ABSTRACT This paper describes an architecture dedicated to the realtime processing of census correlation in the context of the realization of passive stereovision sensors. Although DSP circuits have dramatically increased their performances in terms of frequency (about 600 MHz today), DSP cores (several Multipliers Accumulators) and pipelines (Super Harvard Architectures for example), FPGA circuits remain the best way to design massive parallel architectures when ultra fast algorithms computation are needed like it is the case in real time vision systems for collision avoidance.
stereovision algorithm using the Census Transform with a 3D image rate of 130 frames/sec and compare the performances with conventional DSP (ADSP 21161N). 2. THE PASSIVE STEREOVISION The passive stereovision principle is based on the acquisition, with two cameras, of two images (left and right) of a same scene from two different view points like it is shown in figure 1. In this case a particular point P(X,Y,Z) of the scene can be observed in the left and right images at different pixel coordinates according to its 3D position with respect to the cameras.
1. INTRODUCTION Nowadays, due to the constant evolution of FPGA and CMOS-APS image sensors, the passive stereovision is one of the best ways to achieve three-dimensional mapping without requiring costly components. We participate to the project PICAS$O, which aims to design an integrated platform adapted to multi-sensory fusion. The final PICAS$O contribution is to design and develop the prototype of an integrated camera, with a Sensor board (two CMOS matrices to acquire 3D data using stereo and one micro bolometer matrix to acquire FIR images in the 812µm wave length), several FPGA boards (image processing, stereo, 3D/FIR fusion …) and a USB2 interface, to connect this device to a host computer. This paper presents the results of the second stage in PICAS$O project, focusing on the integration of a pixel-based stereovision algorithm, on a dedicated hardware architecture for obstacle avoidance applications. Obstacle detection from stereovision data has been proposed by numerous authors; the more classical approach exploits the stereo-correspondence between pixels or features extracted from the left and right images [1], [2], [5], [6]; for every match, a 3D data can be reconstructed and obstacles are detected from the 3D image. The section 2 will very briethly recall the basics of the stereo reconstruction and will focus on our algorithm specificity, especially on the Census Transform. In section 3, we present an architecture designed to perform the
c 1-4244-0 312-X/06/$20.00 2006 IEEE.
Fig. 1. Principle of stereovision with aligned images. The coordinates difference D measured from the two images is used, from a simple triangulation method, to compute the three coordinates of P. 2.1. Basic Principle Using the calibration parameters (intrinsic parameters of the two cameras, left-right transform), the initial images are corrected from the lens distortion, and rectified so that epipolar lines are the image lines. Every pixel (u,v) of the left image is then compared with a pixel of the same line of the right image (see figure 2) in an interval (u,vDmax),(u,v). The difference between the column coordinates v of the matched pixels is called the disparity D. For every possible disparity D in this interval, the similarity between the left pixel (u,v) and the right one (u,v-
D) is evaluated , using either a numerical ZNCC score or a CT score computed after applying a Census Transform to the two images.
Fig. 2. Pixel matching.
Fig. 4. Basic architecture of passive stereovision.
The 3D reconstruction in the left camera reference frame (like in figure 1), is simply obtained, using B (distance between the two optical centers) and similar intrinsic parameters for the rectified images (αu, αv, u0, v0), by the equations:
In the first stage we acquire the pixels of our images at the pixel frequency of the cameras (i.e.: 40 MHZ) and store three successives images lines in a circular buffer to achieve, in the second stage, the calculus of means on a 3 x 3 sub-window. The result of this step is stored in another memory to calculate the Census transformation (step 3), providing two images from which a pixel matching algorithm extracts the disparities D. Here in we’ll compare both sequential (DSP) and parallel (FPGA) architectures in order to evaluate the benefits of this second solution.
X=
α v B(u1 − u0 ) B(v1 − v 0 ) αB ; Y= ; Z= v d α ud d
In order to realize our real time architecture we have developed a specific experimental setup (Figure 3) combining two CMOS cameras (JAI CV-A33) and an uncooled IR camera realized in our laboratory.
Fig. 3. PICAS$0 experimental setup. 3. REAL TIME COMPUTATION OF THE DISPARITY IMAGE First we’ll describe the basic architecture of passive stereovision in which all operations are realized sequentially like it is the case in microprocessors or DSP circuits. Then we describe the parallel architecture that we developed in our project. Figure 4 represents the basic architecture which contains four main steps: the acquisition of images, the calculus of means, the Census Transformation and the calculus of disparities (Census correlation).
3.1. Sequential architecture After steps 1 and 2, where benefits are not significant, we get a pixel from the left census image and compare it with its corresponding window’s pixels in the right census image sequentially to evaluate all the scores (degree of matching). Then we look for the maximum score to locate the associated pixel and disparity. Although this architecture is simple, it is worth saying that it takes a lot of time because of the loops we have to scan: for each pixel in the left census image we need to scan all pixels in the range of the corresponding pixels (ex. 64 pixels if we consider this number for the maximum disparity) in the right image. In the same manner, for each elligible pixel in the right image, we have to compute a score which is achieved with a loop whose value depends on the census window (ex: 72-1 for a 7x7 census window). These two loops are the main in charge for the increase of the processing time. 3.2. Full parallel architecture The parallel architecture is based on a completely parallel test between the left census pixels and their correspondents in the right census image. It will consume much more storage blocks but we shall have better real-time performances. In this method we apply, in the same time, the XOR function between the pixel of the left census image and all pixels in the range of the corresponding pixels in the right
census image. With this architecture, we can calculate the image of disparities at the image acquisition frequency (i.e. up to 130 images/sec for a 40MHZ pixel clock and a 640x480 image size). 4. PARALLEL ARCHITECTURE In this part, we will describe the stages of the massive parallel architecture that we have developed in our project PICAS$O and their real-time performances. In the parallel architecture the acquisition of images is achieved by using a sequencer which passes right and left images pixels to the means calculus unit with respect to the camera link protocol of the camera i.e., this sequencer respects its signals of synchronization (FVAL: frame clock, LVAL: line clock, DVAL: data clock, Pclock: pixel clock). 4.1. The means calculus unit The calculus of means is achieved in two steps: a horizontal addition and a vertical one. To realize the horizontal addition we have used three shift registers. The means window is a three by three pixels matrix, and at the same time we connect a parallel adder with three inputs of 8 bits so we get a ten bits coded addition. The result of horizontal addition has been stored in three lines of memory sets at a size of the images width. After have stored three lines containing the addition of three horizontally successive pixels, we apply, in the next cycle of LVAL, the addition at the front edge of PCLK (pixel clock) to get the addition of nine pixels (means calculus window) which is 12 bits coded. This value has been considered as the means of the window of nine pixels. So in this stage we need to store only three lines of 10 bits multiplied by the width of the image and we have latency depending on this step of storage. 4.2. Census transformation unit (CT) In this stage we scan the means image by using a window which determines the searching resolution of the correspondence between the left and right images. To get the value of the census transformation we have built a series of shift registers (49 in the case of census 7 x 7), which outputs the census code as strings of 48 bits. So the result of the last stage must be stored, for this case, in seven ranges of memory. So we need a storage zone for seven lines of 12 bits multiplied by the width of the image and with latency depending on this storage. We can say that this unit represents the algorithm heart and it’s, as shown in figure 5, the complex node in the parallel architecture of stereovision. It’s divided into two principal stages: the calculus of scores and the search for the maximum one.
Fig. 5. Simultaneous searching of the best score. The inputs of this unit consist of the left census pixel and its correspondent range of right census pixels (see figure 5). The first step is realized by applying the XNOR function, bit by bit, in parallel on the left census pixel and all pixels of the right correspondent range. Then we build a LUT (Look up Table) to ensure the calculus of the score which is the number of ‘1’ in the XNOR step output. So we get at the output two elements: the score number and the index of its associated right census pixel in the searching range. Now we’ll describe the second step: the search for the maximum scores. As shown in figure 5, this step is composed of a system of comparison units; each unit has four inputs, two scores numbers and their correspondent indexes. The function of this unit is a multiplexer of scores numbers and their indexes controlled by the comparison between the two numbers of scores. In this architecture the maximum disparity is 64 pixels so to realize the searching step we need six layers of comparison units; in the first layer there are 32 units then 16, 8, 4, 2 and the output unit which outputs the maximum scores number and its index which is the disparity D of the correspondent right census pixel. To achieve the execution of the algorithm with real time constraints (a pixel clock of 40MHz and a rate of 130 frames/sec for image’s resolution of 480 x 640), we have pipelined this searching system. So each comparison level needs one cycle of the pixel clock and this will add a delay of latency of seven clock cycles i.e. about 0.175µs. 4.3. Verification unit When we search for the best scores, we get in some cases many identical maximum values that will give false disparities. So to avoid this phenomenon we have added a checking stage to eliminate the false disparities. To achieve this step, we have implemented in parallel another stage of disparities calculus: the previous one looked for the
correspondent pixel of left census pixel in the right image and this one looks for the correspondent pixel of right census pixel in the left image. The pair of results of these two stages (disparity left-right and disparity right-left) are stored in two ranges of memories whose size is of the image’s width. The contents of these two memories are sequentially read and supply two blocks which are composed of 64 shift registers (if Dmax=64 pixels) and then we apply the condition of verification to keep the right disparity and eliminate the false one as shown in figure 6.
solution, the performances obtained for images size of 640x480 pixels and Dmax of 32 pixels, according to the Census window size. FPGA (Altera ADSP-21161N Stratix 1S40) 3x3 Census >130 images/sec 0.35 images/sec window 5x5 Census >130 images/sec 0.13 images/sec window 7x7 Census >130 images/sec 0.05 images/sec window Fig. 7. Image throuput rate (FPGA versus DSP).
6. CONCLUSION
Fig. 6. Bidirectional verification of disparity. The architecture is based on a completely parallel test between the left census pixels and their correspondents in the right census image. It consumes much more resources but this test is very effective and eliminates many wrong disparities values. In spite of the bidirectionnal checking we can calculate the image of disparities at the image acquisition frequency with latency depending on the number of storage lines and on the use of pipelined architectures.
This paper has described a real time parallel architecture which has been developed for the PICAS$O project to achieve the calculus of depth maps at a rate better than 130 images/s, which is needed in many applications such as computer aided medical surgery, robotics, vehicle driving assistance and many other applications. We have shown the interest of our real-time parallel architecture for passive stereovision by explaining the real-time aspects of the architecture steps and comparing them to DSP solutions. Finally this work will be used also in the frame of the realization of a qualitative and quantitative stereolaparoscope for robotic surgery which is the end of our current project. 7. REFERENCES
4.4. Resources consumption To estimate the consumption of our architecture in terms of memory and logic cells we have simulated all the stages in the ALTERA Quartus II development tool [4]. The architecture was described with the VHDL language. The evaluation is based on two types of characteristics: an estimate of the number of consumed logic elements and an estimate of the memory size used in the FPGA. For an image of 640 X 480 pixels, a Census window of 7 X 7 and a disparity max of 64 pixels, our architecture uses about 11100 logic elements and a memory size of 174 Kbits. 5. FPGA/DSP COMPARISON Although it is worth saying that FPGA circuits are better suited than DSP for passive stereovision algorithms, we have compared the two solutions (parallel and sequential) in order to get an idea of the performances benefits. The evaluation is based on the use of the Analog Devices ADSP-21161N processor (working frequency of 100 MHZ, application program in assembly language) and for which we only give the results. The figure 7 gives, for each
[1]
A.Bensrhair, M.Bertozzi, A.Broggi, P.Miche, S.Mousset, and G.Toulminet. “A cooperative approach to vision-based vehicle detection”, In Proc. ITSC, Japan, October 2001.
[2]
P.Grandjean and P.Lasserre, “Stereo Vision Improvements”, IEEE International Conference on Advanced Robotics, Barcelona (Spain), September 1995.
[3]
R.Zabih and J.Woodfill, “Non-parametric local transforms for computing visual correspondence”, Third European Conference on Computer Vision, Stockholm, Sweden, June 1994.
[4]
Altera Quartus reference (http://www.altera.com/literature/lit-qts.jsp).
[5]
U.Franke and A.Joos, “Real-time stereo vision for urban traffic scene understanding”, IEEE Intelligent Vehicle Symposium, Dearborn, USA, October 2000
[6]
J. Woodfill et B. Von Herzen, ‘Real-Time Stereo Vision on the PARTS Reconfigurable Computer’, 5th IEEE Symposium on FPGAs for Custom Computing Machines, pp. 201–210, Napa Valley, USA, avril 1997.
manual.