FPGA Design and Implementation of a Real-Time Stereo ... - CiteSeerX

11 downloads 65194 Views 1MB Size Report
to a software program operating on a conventional computer, and shows stronger ..... is another important design decision to achieve good perfor- mance and scalability. ..... From 1994 to 1995, he was with Samsung. Electronics Company ...
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

15

FPGA Design and Implementation of a Real-Time Stereo Vision System S. Jin, J. Cho, X. D. Pham, K. M. Lee, S.-K. Park, M. Kim, and J. W. Jeon, Member, IEEE

Abstract—Stereo vision is a well-known ranging method because it resembles the basic mechanism of the human eye. However, the computational complexity and large amount of data access make real-time processing of stereo vision challenging because of the inherent instruction cycle delay within conventional computers. In order to solve this problem, the past 20 years of research have focused on the use of dedicated hardware architecture for stereo vision. This paper proposes a fully pipelined stereo vision system providing a dense disparity image with additional sub-pixel accuracy in real-time. The entire stereo vision process, such as rectification, stereo matching, and post-processing, is realized using a single field programmable gate array (FPGA) without the necessity of any external devices. The hardware implementation is more than 230 times faster when compared to a software program operating on a conventional computer, and shows stronger performance over previous hardware-related studies. Index Terms—Field programmable gate arrays, integrated circuit design, stereo vision, video signal processing.

I. Introduction TEREO vision is a traditional method for acquiring 3-D information from a stereo image pair. Stereo vision has many advantages over other 3-D sensing methods in terms of safety, undetectable characteristics, cost, operating range, and reliability [1]. For these reasons, stereo vision is widely used in many application areas including intelligent robots, autonomous vehicles, human–computer interfaces, and security and defense applications [2]–[5].

S

Manuscript received October 29, 2008; revised January 16, 2009. First version published July 7, 2009; current version published January 7, 2010. This research was performed for the Intelligent Robotics Development Program, one of the 21st Century Frontier Research and Development Programs, funded by the Ministry of Commerce, Industry and Energy, Korea. This paper was recommended by Associate Editor Y.-K. Chen. S. Jin and J. W. Jeon are with the School of Information and Communication Engineering, Sungkyunkwan University, Suwon, Gyeonggi-do 440-746, Korea (e-mail: [email protected]; jwjeon@ yurim.skku.ac.kr). J. Cho is with the Department of Computer Science and Engineering, University of California, San Diego, CA 92093-0404 USA (e-mail: [email protected]). X. D. Pham is with the Information Technology Faculty, Saigon Institute of Technology, Hochiminh City, Vietnam (e-mail: [email protected]). K. M. Lee is with the Department of Electrical and Computer Engineering, Seoul National University, Seoul 151-742, Korea (e-mail: [email protected]). S.-K. Park and M. Kim are with the Center for Intelligent Robotics, Korea Institute of Science and Technology, Seoul 136-791, Korea (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TCSVT.2009.2026831

In spite of its usefulness as a range sensor, stereo vision has limitations for real-life applications due to its considerable computational expense [6]. The instruction cycle time delay caused by numerous repetitive operations causes the real-time processing of stereo vision to be difficult when using a conventional computer. For example, several seconds are required to execute a medium-sized stereo vision algorithm for a single pair of images on a 1 GHz general-purpose microprocessor [7]. This low frame rate results in limited applicability of stereo vision, especially for real-time applications, because quick decisions must be made based on the vision data [1]. To overcome this limitation, various approaches have been developed since the late 1980s in order to perform the 3-D depth calculation with stereo vision in real-time using hardware-based systems such as digital signal processors (DSP), field programmable gate arrays (FPGA), and application-specific integrated circuits (ASIC). Kimura et al. [8] designed a convolver-based nine-eye stereo machine called SAZAN which generates dense stereo depth maps with 25 stereo disparities with 320 × 240 images at 20 frames per second (f/s). Woodfill et al. [1] developed a DeepSea stereo vision system based on the DeepSea processor, an ASIC, which computes absolute depth at 200 f/s with a 512 × 480 input image pair and 52 stereo disparities. FPGA-based stereo vision systems have also been introduced due to the rapid development of programmable devices. Darabiha et al. [9] developed a phase-based stereo vision design which generates 20 stereo disparities using 256 × 360 pixel images at 30 f/s by four Xilinx XCV2000E FPGAs. Jia et al. [10] developed MSVM-III, which computes trinocular stereopsis using one Xilinx XC2V2000 FPGA. This system runs at approximately 30 f/s with 640 × 480 pixel images within a 64 pixel disparity search range. Even though considerable progress has been made, improvements are still needed because the previously proposed systems have various deficiencies in regard to the stereo matching method, frame rate, scalability, and one-chip integration of pre and post-processing functions. All of the real-time stereo vision systems developed so far have only partially satisfied the requirements in terms of an architectural point of view. This paper proposes dedicated hardware architecture for real-time stereo vision and integrates it within a single chip to overcome these limitations. The entire stereo vision process, including rectification, census transform, stereo matching, and post-processing is designed and implemented using a single

c 2010 IEEE 1051-8215/$26.00 

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

16

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

FPGA. Regarding the issue of scalability, the proposed system is intensively pipelined and synchronized with a pixel sampling clock. In general, a sequential bottleneck is eliminated when we synchronize the entire system with an individual data element, rather than on the flow of control. As a result, the overall performance improves [11]. For the vision-related systems, in the same manner, the throughput of the implemented system is continuously increased following the input pixel frequency, unless the pixel frequency does not exceed the maximum allowed frequency of the design. For this reason, the implemented system can generate disparity images of VGA (640 × 480) resolution at a conventional video-frame rate, 30 and 60 f/s, with a pixel clock of 12.2727 and 24.5454 MHz, respectively. Because no other processing clock is used, the implemented system can generate disparity images with the same resolution at the designated frame rate when connected to the cameras with different pixel clock frequencies without the requirement for adaption of the internal system design. As a result, the implemented system is flexible with respect to the camera input parameters, such as frame rate and image size. The remainder of this paper is organized as follows: Section II presents common background information about the stereo vision process, including pre and post-processing, Section III gives a detailed description about the implementation of the proposed real-time stereo vision system, Section IV discusses and evaluates the experimental performance results, and Section V draws the conclusion. II. Stereo Vision Background Stereo vision refers to the issue of determining the 3-D structure of a scene from two or more images taken from distinct viewpoints [13]. In the case of a binocular stereo, the depth information for the scene is determined by searching the corresponding pairs for each pixel within the image. Since this search method is based on a pixel-by-pixel comparison, which consumes much computational power, most stereo matching algorithms have assumptions about the camera calibration and epipolar geometry [14]. An image transformation, known as rectification, is applied in order to obtain a pair of rectified images from the original images as in (1)–(2). In the equations, (x, y) and (x , y ) are the coordinates of a pixel in the original images and the rectified images, respectively. To avoid problems such as reference duplication, reverse mapping is used with interpolation. Once the image pair is rectified, 1-D searching with the corresponding line is sufficient to evaluate the disparity [15]–[18] ⎡  ⎤ ⎡  ⎤ ⎡  ⎤ ⎡  ⎤ xl xl xr xr ⎢  ⎥ ⎢  ⎥ −1 ⎢  ⎥ −1 ⎢  ⎥ y y y y = H = H , (1) ⎣ l ⎦ ⎣ r ⎦ r ⎣ r ⎦ l ⎣ l ⎦   zl zr 1 ⎡1 ⎤ xl     ⎡ xr ⎤  zr xl xr ⎢ zl ⎥ (2) = ⎣  ⎦, = ⎣  ⎦. yr yl yl yr zl

zr

After rectification, stereo matching (given a pixel in one image, find the corresponding pixel in the other image) is

applied to solve the correspondence problem. Since pixels with the same intensity value can appear many times within the image, a group of pixels, called windows, is used to compare the corresponding points. Several stereo vision algorithms were introduced based on different correspondence methods– local and global. The local method includes block matching, feature matching, and gradient-based optimization, while the global method includes dynamic programming, graph cuts, and belief propagation. Because we can restrict the searching range of stereo matching to one-dimension, local methods can be very efficient compared with global methods. Moreover, even though local methods experience difficulties in locally ambiguous regions, they provide acceptable depth information for the scene with the aid of accurate calibration [13]. For this reason, we employed the local stereo method as the cost function of the proposed system. In particular, we used censusbased correlation because of its robustness to the random noise within a window and its bit-oriented cost computation [19]. The census transform maps the window surrounding the pixel p to a bit vector representing the local information about the pixel p and its neighboring pixels. If the intensity value of a neighboring pixel is less than the intensity value of pixel p, then the corresponding bit is set to 1, otherwise it is set to 0. The dissimilarity between two bit strings can be measured through the hamming distance, which determines the number of bits that differ between these two bit strings. To compute the correspondence, the sum of these hamming distances over the correlation window is calculated in (3), where I1 and I2 represents the census transform of template window I1 and candidate window I2 Hamming(I1 (u, v), I2 (x + u, y + v)). (3) (u,v)∈W

The results for stereo matching provide a disparity, which indicates the distance between the two corresponding points. For all possibilities, the disparity should be computed and evaluated at every pixel within the searching range. However, reliable disparity estimations cannot be calculated on surfaces with no texture or repetitive texture when utilizing a local stereo method [20]. In this case, the disparities at multiple locations within the image may point to the same location in the other image despite each location within one image being assigned, at most, one disparity. This unique test method tracks the three smallest matching results, v1 , v2 , and v3 , instead of seeking only the minimum. The pixel has a unique minimum if the minimum, v1, lies between v2 and v3 , as in (4), where N is an experimental parameter. Usually, the value of N lies between 1.25 and 1.5; we used a value of 1.33 in all our experiments v2 > Nv1 and v3 > Nv1 .

(4)

The occlusions can also mislead the disparity result since a scene is captured with two different viewpoints and no correlated information exists between the image pair. The left–right consistency check (LR-check) occludes the points where the two images are not negatives of each other [21]. An experimental parameter, T, is compared with Error in (5) and (6) in order to evaluate the disparity result, where xr is

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM

Fig. 1.

17

High-level hardware architecture of the proposed real-time stereo vision system.

the right coordinate, xl is an estimated left match at disparity dxrr , and dxl  is the left-based estimated disparity

the size of the image. In our experiments, we used 400 as a threshold value

l

xl = xr + dxrr Error = xr −

(xl

+

(5) dxl  ). l

(6)

The sub-pixel estimation adds additional accuracy to the disparity result using parabola-fitting, f (x) = a + bx + cx2 . Since the sub-pixel can be found at x = −b/2, the sub-pixel x can be obtained as (7), where C(d) and C(d ± n) is the correlation value at position d and d ± n, respectively −2

x = d+

−1

+1

+2

7(2C(d )+C(d )−C(d )−2C(d )) . (7) 20(2C(d −2 )−C(d −1 )−2C(d)−C(d +1 )+2C(d +2 ))

For enhancing the disparity image, spike removal and subpixel estimation can be used, as described in [22] and [26]. The spike removal eliminates small spikes in the disparity map by regarding them as useless information. The spike is locally stable but not large, and has no support from the surrounding surfaces. Since the true surface within the stereo image is commonly part of a larger surface, the surface label, L, can be assigned to each pixel. Equation (8) shows the assignment process of a specific pixel, i, where Ni is a neighborhood pixel around i, and di is the disparity of pixel i. After the label is assigned, the number of pixels with a given label, L, is counted. The pixel is regarded as a spike and removed while generating the disparity result, if the number of pixels with given label is smaller than a threshold. The threshold usually depends on

i=L

given j ∈ N(i) j=L



dj − di ≤ 1.

(8)

III. Hardware Implementation of the Stereo Vision As described above, the computational complexity of stereo vision is easily calculated, especially with a census transformbased method. When assuming m × m and n × n windows for the census transform and correlating with a searching range, r, the required operations to compute the disparity are as follows: (r + 1) · m2 · (n2 − 1) times of pixel subtraction, r · m2 times of hamming distance measurement for n2 − 1 length of bit vector, and r · m2 times of hamming distance summation. A number of additional operations are required for pre and post-processing, as described above. Considering the image size, these operations can be characterized as very repetitive and time-consuming, especially for a conventional computer. This sequential bottleneck, which is caused by many memory references, causes difficulty in meeting real-time performance for the entire stereo vision system. To achieve real-time performance while handling the stereo vision procedure presented above, we designed a dedicated hardware architecture implemented in an FPGA. Compared

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

18

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

with an instruction cycle mechanism in a conventional computer, an FPGA can process data more efficiently via a customized datapath. Fig. 1 describes the overall system architecture. All of the stereo vision process is performed by passing a single datapath controlled by a controller module. The datapath structure is defined by three modules as follows: rectification, stereo matching, and post-processing. For the in-depth implementation of the dedicated datapath, the following design decisions were made: intensive use of parallelism and pipelining, single pixel clock synchronization, and elimination of the use of an external image buffer. Because parallelism and pipelining are intrinsic resources of an FPGA, it is important to fully utilize these resources to improve computational performance. To achieve parallelism, we divide each module into a set of simpler functional elements. Then, each functional element is replicated and executed in parallel, as much as possible. For example, we place r numbers of hamming distance modules to get the r hamming distances, instead of repetitively executing the single hamming distance module r times. As stereo matching has many repetitive operations, because of the window-based image processing, parallelizing these repetitions can greatly improve overall performance. Synchronization of all modules with a single pixel clock is another important design decision to achieve good performance and scalability. From the viewpoint of the camera, the frequency of the pixel clock is related to the number of pixels transmitted, which in turn depends on the frame rate and the image resolution. Assuming an imaginary sensor that uses a 12 MHz clock as the pixel clock to generate a 640 × 480 gray level images at 30 f/s, a 48 MHz pixel clock would theoretically allow this sensor to generate images with size four times that of the original image at the same frame rate, or generate the same size images at four times of the original frame rate. That is, the image size and frame rate depend directly on the pixel clock. Therefore, the implemented system becomes flexible with respect to the resolution and frame rate of the camera if we synchronize the system with the pixel clock of the camera. Once the design is synchronized with pixel clock, it is important to increase the maximum allowed frequency, to maximize the throughput of the system. For this reason, the parallelized modules are divided again into a set of pipeline stages, considering the size and complexity of the operation. In addition, an external image buffer is excluded, because it requires more than one clock cycle when fetching data from external memory devices. Even though this intensive pipelining causes additional use of routing resources, it helps the overall throughput. After the initial pipeline latency, the disparity for the corresponding input is generated in real-time and synchronized with the pixel clock. A detailed description of each module is given below. A. Rectification Module As described in Section II, rectification is performed based on the matrices generated when the camera is calibrated. Since the camera calibration is done off-line, the rectification module takes the user input through an external communication port and initiates the rectification process. Fig. 2

Fig. 2.

Realization of the rectification process using reverse mapping.

shows the procedure for calculating the reverse mapping coordinate, as described in (1)–(2). The resulting coordinate can be larger or smaller than the reference coordinate due to the characteristics of reverse mapping. However, storing the entire image is wasteful, both in terms of memory usage and latency, because the common assumption about the camera alignment is already made in binocular stereopsis. For this reason, we assume that the pixels are adjusted vertically within ±50 scan-lines at maximum. We use dual-port memory as a circular line buffer to minimize waste while compensating for the coordinate difference. In addition, the reading point of time, Ts , in (9) is used by measuring the maximum and minimum reference address value to maximize memory utilization. Bu and Bl represent the upper and lower boundary values of the pixel adjustment range. y is the output pixel coordinate and H(y ) is its reference pixel coordinate. Ln is the maximum number of lines that can be stored to the line buffer. Ih and Iw represent the size of image height and width, respectively. Since Ts is constant until the rectification matrix is constant, Ts is calculated when off-line and transferred to the system during the configuration of the rectification matrix after calibration



Ih

 

Bu = MAX

MAX (y − H (y )) r

,  y =0





Ih

MAX(y − Hl (y ))

y =0





Ih

 

Bl = MAX MIN (y − H (y )) r

, y =0





Ih

MIN(y − Hl (y ))

y =0



Ts = Bl +

Ln −(Bu + Bl ) 2



× Iw .

(9) After initial latency, Ts , rectification using backward mapping can be performed to synchronize with the pixel input using the pixel coordinates of the reference image (x, y), which are calculated from the rectification process based on the incremental output pixel coordinates (x , y ). Based on the calculated coordinate, the pixel output for the rectified image is read from the line buffer and sent to the stereo matching module. As a result, the stereo matching module is guaranteed to contain the rectified image input, which satisfies the epipolar geometry.

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM

Fig. 3.

Hardware architecture for stereo matching (census transform/correlation).

Fig. 5.

Fig. 4.

19

Hardware architecture for the parallel processing of window pixels.

B. Stereo Matching Module After the rectification process, stereo matching is performed between the two corresponding images. As shown in Fig. 3, stereo matching is divided into two stages, the census transform stage and the correlation stage. In the census transform stage, the left and right images are transformed into images with census vector pixel values instead of gray-level intensity pixel values. We use an 11 × 11 window size for the census transform to generate a 120 bit long transformed bit vector per each pixel. Since the census transform is a type of windowbased processing, the neighboring pixels of the processing pixel need to be accessed simultaneously. To achieve this, we use a scan-line buffer and a window buffer. A scan-line buffer is a buffer which is able to contain several numbers of scanlines in the input image, and a window buffer is a set of shift

Hamming distance computation using the adder tree.

registers which contain the pixels in the window. Since the window buffer consists of registers, it guarantees instant access to its elements. The scan-line buffer used in the proposed system consists of 10 dual-port memories, and each memory can store one scan-line of an input image. Assuming that the coordinates of the current input pixel are (x, y) and the intensity value of the pixel is I(x, y), the connections between the memory are shown in Fig. 4. Fig. 4 also shows the scan-line buffer conceptually converting a single pixel input into a pixel column vector output. A window buffer is a square-shaped set of shift registers, and each register can store one pixel intensity value in the input image. The right-most column of the window buffer acquires the pixel column vector, which is generated from the scan-line buffer, and the intensity values in each register are shifted from right to left at each pixel clock. As a result, all 11 × 11 pixels exist in the window buffer, and each pixel can be accessed simultaneously. After a pixel

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

20

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

Fig. 6.

Hardware architecture of the LR-check module.

Fig. 7.

Realization of the sub-pixel estimation using parabola fitting method.

Fig. 8.

Hardware architecture of the spike removal module.

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM

21

TABLE I Coefficients of Interpolation Filter

Logic utilization Number of slice flip flops Number of four input LUTs Logic distribution Number of occupied slices Rectification Census transform Hamming distance Correlation Post-processing Fig. 9.

Implemented real-time stereo vision system based on FPGA.

intensity has been assigned to each register within the window buffer, the census vector can be obtained by comparing the intensity of the centered pixel, Win(xc , yc ), and its neighboring pixels in the window. As a result, an image whose pixel has local information about the relation between each pixel and its neighborhood is generated and delivered to the correlation stage. The correlation stage evaluates the correlation between the census vectors generated by the left and right census transform using a window-based method, similar to the one for the census transform stage. By gathering the census vector within the window size of 15 × 15, we can build a more robust representative of each pixel contained within the image. In the correlation stage, 64 hamming distances, for the preset range r = 64, are evaluated using a template window for a pixel in the left image and the corresponding 64 correlation windows for pixels in the right image. After the comparison, the two pairs with the shortest hamming distances are used to define the resulting disparity. Since the windows being compared can be regarded as bit vectors, it is possible to obtain the hamming distance by counting ‘1’ in the vector after applying an XOR operation. In order to decide upon the disparity result, the template window in the left image is compared with all 64 candidate windows from the right image. Since the size of the window is 15 × 15 and each element of the window is a census vector, the hamming distance between any two windows can be obtained as shown in Fig. 5. First, the census vector from the census transform module is delayed for 64 pixel clocks. Next, the distance between any two census vectors is calculated instead of building a correlation window. Since the range of the hamming distance between two pixels can be from 0 to 120, the result is represented by 7 bits and regarded as the input of the correlation window. The hamming distances are obtained by accumulating all the pixel values in the correlation window, similar to the census transform stage. We can use the tournament selection method to find the shortest distance among these 64 hamming distances. The candidate window, which has the shortest distance from the template window, is selected as the closest match, and the coordinate difference of the selected windows along with the x-axis is extracted as the disparity result.

Number of FIFO/RAMBs Rectification Census transform Correlation Post-processing

Used

Available

Utilization

53,616 60,598

178,176 178,176

30% 34%

51,191 4,035 7,076 3,792 24,526 10,744

89,088

57% 4% 8% 4% 27% 12%

322 64 10 182 66

336

95% 19% 2% 54% 17%

12 3

96 12

12% 25%

Number of DSP48s Number of DCM-ADVs

C. Post-Processing Module The post-processing module contains four sub-modules that increase the accuracy in the resultant disparity images as follows: LR-check, uniqueness test, spike removal, and subpixel estimation. As described in Section II, the LR-check is required to remove mismatches caused by occlusion. Since the left to right matching has been completed, only the search to determine the right to left disparities has to be performed by reusing the matching results [23], [27]. For example, the stereo matching is performed using the pixel in the left, pl (x, y), and the set of pixels in the right image, pr (x, y)–pr (x + 64, y), at the given time t. At the next time t + , the matching is performed using pl (x + 1, y) and pr (x + 1, y)–pr (x + 65, y), and so on. For this reason, the matching result of pr (x, y) with pl (x − 64, y)–pl (x, y) is available at t. As shown in Fig. 6, the disparity values are aligned vertically when compared to the reference image direction. Conversely, the disparity values for a pixel within the other image are aligned in a diagonal direction. In the proposed system, the possible pixel range is ±64 due to a disparity range of 0–63. For this reason, 64 dual-port memories, which can store 128 hamming distances values, are used to perform the LR-check. The address for the output port of each dual-port memory is obtained based on the disparity value found in the stereo matching. During each pixel clock cycle, the output of each dual-port memory is compared with each other in order to determine the shortest hamming distance and its index. By comparing the two disparities, one from the vertical direction and the other from the diagonal direction, we can decide whether this disparity passes the LR-check. Fig. 6 shows the block diagram of the LR-check module. The uniqueness test is also performed to validate the disparity result generated by the stereo matching module. In order to evaluate the uniqueness of the disparity at each position, we need to keep track of the three lowest ranking hamming distances, C(di ), C(di − 1) and C(di + 1), where C(di ) is the hamming distance at the selected disparity result. Following the simple comparison described in Section II, we can check

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

22

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

TABLE II Performance Summary of Reported Stereo Vision Systems Implemented system SAZAN FPGA Kuhn et al. ASIC Darabiha et al. FPGA MSVM-III FPGA DeepSea ASIC Software program 3.2 GHz, SSE2 This paper FPGA

Image size

Matching method

Disparity range

Window size

Rectification

Frames per second

320 × 240

SSAD

25

13 × 9

No

20

256 × 192

SSD/ Census

25

Census (3 × 3) Corr (10 × 3)

No

50

256 × 360

Local weighted

20

N/A

No

30

640 × 480

SSAD/ Trinocular

64

N/A

Hard-wired

30

Firmware

200

Software

1.1

Hard-wired

230

512 × 480

Census

52

640 × 480

Census

64

640 × 480

Census

64

Census (N/A) Corr (N/A) Census (11 × 11) Corr (15 × 15) Census (11 × 11) Corr (15 × 15)

TABLE III Evaluation Result of the Proposed System Using Middlebury Stereo-Pairs Stereo-pair

Ground truth

Disparity

Bad pixels (δd = 1.0, nonocc)

Percent of bad pixels

nonocc: 9.79 all : 11.56 disc : 2029

nonocc: 3.59 all : 5.27 disc : 36.82

nonocc: 12.50 all : 21.50 disc : 30.57

nonocc: 7.34 all : 17.58 disc : 21.01

Average percent of bad pixels: 17.24

whether the selected disparity is a unique minimum, double minimum or non-unique minimum [22], [24]. If a disparity result passes the LR-check and uniqueness test, sub-pixel estimation and spike removal are applied to increase the reliability and accuracy. The sub-pixel estimation again utilizes C(di ), C(di − 1) and C(di + 1), which was tracked in

the uniqueness test module, with C(di −2) and C(di +2) giving additional precision. By using the parabola fitting method in (7), the sub-pixel estimation is performed with ease. To decrease the complexity and latency of the operations, the combination of shifting, addition, and subtraction is used for each performance of multiplication and division, as shown in

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM

23

Fig. 10. Resultant disparity image with a different post-processing level. Images are captured online under the common indoor scenery. (a) Raw image (Left), (b) disparity result, (c) LR-check, uniqueness test, sub-pixel, and (d) final disparity with spike removal.

Fig. 7. After a total 23 cycles of clock latency, a calculated sub-pixel value is obtained and synchronized with the pixel clock. Spike removal is implemented based on the window processing architecture, and described as part of the stereo matching module. Started by the label “1,” the disparity output generated by the stereo matching module is compared with its neighboring disparity values and labeled by following (8). Generally, the spike occupies only a small region and does not have concave behavior. For this reason, we can skip the sorting of the equivalent label pairs to simplify the removal process. Fig. 8 shows a block diagram of the implemented spike removal module. While counting the number of disparities with identical labels, each disparity value is stored within the line buffer because the vertical size of the region should be considered. If the non-spike condition is satisfied, the resultant disparity value is delivered to the system output, including any latency caused by the line buffer. Otherwise, the disparity value is replaced with a blank.

IV. Experimental Results The proposed real-time stereo vision system is designed and coded using VHDL and implemented using a Virtex-4 XC4VLX200-10 FPGA from Xilinx. Fig. 9 shows the implemented system. The implemented system interfaces two VCC-8350CL cameras from the CIS corporation through standard camera-link format, or directly controls a MT9M112 CMOS sensor from Micron as a stereo camera pair. Table I summarizes the device utilization reports from the Xilinx synthesis tool in ISE release 9.1.03i, XST J.33. The number of used slices and the maximum allowed frequency of the proposed system are 51,191 (about 57% of the device) and 93.0907 MHz, as shown in Table I. Since the system is based on the local pixel processing, the size of the input image does not affect the device utilization directly. However, both the disparity range and window size are required to be adjusted to maintain the resulting disparity quality if the size of the image is changed. Among the submodules listed in Table I, the amount of the logic resources consumed by census transform and correlation modules are linearly increased as the disparity range and window size increase. The other modules of the system are relatively less affected.

The performance verification of the proposed system is completed using cameras with different frames per second values. Both the 30 and 60 f/s cameras passed the working test just under 12.2727 and 24.5454 MHz, respectively. In the same manner, we can expect the maximum performance of the proposed system to occur when using a camera with 230 f/s, especially when considering the underestimation characteristic of the synthesis tool [12]. Since the maximum frame rate of the camera available for this research is only 60 f/s, the performance of the proposed system is verified through the timing simulation using Mentor Graphics ModelSim 6.1f as the simulation environment with the test vectors which describe the actual behavior of the camera. The software program corresponding to the proposed system was implemented using ANSI-C language for the use of performance and disparity quality comparison. Several optimization methods and processor-specific functions, such as multimedia extensions (MMX) and streaming single instruction multiple data extensions 2 (SSE2), are considered for the full utilization of the hardware resources [25]. The experimental environment for the software program is described as follows: Microsoft Visual C++ 7.0, Microsoft Windows XP SP2 Professional, Pentium 4 3.2 GHz, and 2 GB DDR SDRAM. Table II shows the performance comparison result. As shown in Table II, the proposed system, even though it was implemented in an FPGA, shows improvements over the previous systems in both performance and integration. Table III shows the evaluation results of the cost function which was used in the proposed system. We measured the percentage of bad pixels in the Middlebury stereo-pairs, which consist of Tsukuba, Venus, Teddy, and Cones [26]. The percentage of bad pixels was obtained as in (10), where dC(x, y) and dT (x, y) are the computed disparity map and the ground truth map, respectively, N is the total number of pixels, and δd is the disparity error tolerance. We used δd = 1.0 for further comparison with the state-of-the-art stereo vision algorithms on the Middlebury stereo evaluation page 1 B= (|dC (x, y) − dT (x, y)| > δd ). (10) N (x,y) Since the hardware was built for real-time processing of an incoming image, the disparity results of the proposed design were generated through HDL functional simulation,

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

24

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

Fig. 11. Online results of the proposed stereo vision system. (a) Raw image (left), (b) disparity result (proposed), (c) disparity result (bumblebee), (d) raw image (left), (e) disparity result (proposed), and (f) disparity result (bumblebee).

using the disparity search range limitation for each stereopair. As shown in the figure, the proposed system provided acceptable disparity accuracy, considering the real-time performance. Fig. 10 is the resultant disparity image captured in a common indoor environment. The images were processed in real-time and obtained online from the implemented system at different post-processing levels. As the post-processing levels advanced, the uncertain pixels were removed to increase the confidence of the disparity result. Because the disparity truth of the common scene is difficult to obtain in real-time, a qualitative comparison was performed using a commercial product, Digiclops stereo vision software with a Bumblebee camera from Point Grey Research, Inc. (PGR). To compare the disparity results with those of a similar configuration, the Digiclops stereo vision system was configured to have a disparity range of 0–63 and a 21 × 21 stereo mask. As shown in Fig. 11, the proposed system generated disparity results comparable to the commercial product, with the realtime performance. V. Conclusion In this paper, we have built an FPGA-based stereo vision system using census transform, which can provide dense disparity information with additional sub-pixel accuracy in realtime. The proposed system was implemented within a single FPGA including all the pre and post-processing functions such as rectification, LR-check, and uniqueness test. The real-time performance of the system is verified under a common real-life environment to measure the further applicability. To achieve the targeted performance and flexibility, we have focused on the intensive use of pipelining and modularization while designing the dedicated datapath of the proposed stereo vision system. We made all the functional elements operate in synch with a single pixel clock as well as utilizing the same

format of video signals as their input and output. Because the video signals define the format of the image, each module can be considered as an independent imaging sensor. As a result, the flexibility of the entire system, as well as individual modules, increased with respect to resolution and frame rate. In addition, we can expect performance enhancement since the proposed system is designed to be synchronized on individual pixel data flow, which results in the elimination of the sequential bottleneck caused by instruction cycle delay. In the future, the proposed real-time stereo vision system will be implemented with an ASIC for the purpose of commercial use. Considering the efficiency of implementing gate logic within an ASIC, we can expect performance enhancement with both quality and speed. The proposed system can be used for higher level vision applications such as intelligent robots, surveillances, automotives, and navigation. The additional application areas in which the proposed stereo vision system can be used will continue to be evaluated and explored. References [1] J. I. Woodfill, G. Gordon, and R. Buck, “Tyzx DeepSea high speed stereo vision system,” in Proc. IEEE Comput. Soc. Workshop Real-Time 3-D Sensors Use Conf. Comput. Vision Pattern Recog., Washington D.C., Jun. 2004, pp. 41–46. [2] M. Bertozzi and A. Broggi, “GOLD: A parallel real-time stereo vision system for generic obstacle and lane detection,” IEEE Trans. Image Process., vol. 7, no. 1, pp. 62–81, Jan. 1998. [3] A. Bensrhair, M. Bertozzi, A. Broggi, A. Fascioli, S. Mousset, and G. Toulminet, “Stereo vision-based feature extraction for vehicle detection,” in Proc. IEEE Intell. Vehicle Symp., Paris, France, vol. 2. Jun. 2002, pp. 465–470. [4] N. Uchida, T. Shibahara, T. Aoki, H. Nakajima, and. K. Kobayashi, “3-D face recognition using passive stereo vision,” in Proc. IEEE Int. Conf. Image Process., Genoa, Italy, vol. 2. Sep. 2005, pp. 950–953. [5] D. Murray and C. Jennings, “Stereo vision-based mapping and navigation for mobile robots,” in Proc. IEEE Int. Conf. Robotics Autom., Albuquerque, NM, vol. 2. Apr. 1997, pp. 1694–1699. [6] J. Woodfill, B. von Herzen, and R. Zabih, Frame-rate robust stereo on a PCI board. Available: http://www.cs.cornell.edu/rdz/Papers/ Archive/fpga.pdf

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

JIN et al.: FPGA DESIGN AND IMPLEMENTATION OF A REAL-TIME STEREO VISION SYSTEM

[7] M. Kuhn, S. Moser, O. Isler, F. K. Gurkaynak, A. Burg, N. Felber, H. Kaeslin, and W. Fichtner, “Efficient ASIC implementation of a real-time depth mapping stereo vision system,” in Proc. IEEE Int. Circuits Syst. (MWSCAS ’03), Cairo, Egypt, vol. 3. Dec. 2003, pp. 1478–1481. [8] S. Kimura, T. Shinbo, H. Yamaguchi, E. Kawamura, and K. Nakano, “A convolver-based real-time stereo machine (SAZAN),” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Fort Collins, CO, vol. 1. Jun. 1999, pp. 457–463. [9] A. Darabiha, J. Rose, and W. J. Maclean, “Video-rate stereo depth measurement on programmable hardware,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Madison, WI, vol. 1. Jun. 2003, pp. 203–210. [10] Y. Jia, X. Zhang, M. Li, and L. An, “A miniature stereo vision machine (MSVM-III) for dense disparity mapping,” in Proc. 17th Int. Conf. Pattern Recognit., Cambridge, U.K., vol. 1. Aug. 2004, pp. 728–731. [11] W. J. Dally and S. Lacy, “VLSI architecture: Past, present, and future,” in Proc. 20th Anniv. Conf. Adv. Res. VLSI, Atlanta, GA, Mar. 1999, pp. 232–241. [12] J. Diaz, E. Ros, F. Pelayo, E. M. Ortigosa, and S. Mota, “FPGAbased real-time optical-flow system,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 2, pp. 274–279, Feb. 2006. [13] M. Z. Brown, D. Burschka, and G. D. Hager, “Advances in computational stereo,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 8, pp. 993–1008, Aug. 2003. [14] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Int. J. Comput. Vision, vol. 47, pp. 7–42, 2002. [15] L. D. Stefano, “A PC-based real-time stereo vision system,” Mach. Graphics Vision Int. J., vol. 13, no. 3, pp. 197–220, Jan. 2004. [16] M. Sonka, V. Hlavac, and R. Boyle, “3-D vision, geometry,” in Image Processing, Analysis, and Machine Vision, 2nd ed., Pacific Grove, CA: PWS Publishing, ch. 11, 1999. [17] R. I. Hartley and A. Zisserman, “Epipolar geometry and the fundamental matrix,” in Multiple View Geometry. Cambridge, U.K.: Cambridge University Press, ch. 9, 2000. [18] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach, 1st ed., Englewood Cliffs, NJ: Prentice Hall, 2002. [19] R. Zabith and J. Woodfill, “Non-parametric local transforms for computing visual correspondence,” in Proc. 3rd Eur. Conf. Comput. Vision, Stockholm, Sweden, May 1994, pp. 151–158. [20] M. H. Lin and C. Tomasi, “Surfaces with occlusions from layered stereo,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 8, pp. 1073–1078, Aug. 2004. [21] G. Egnal and R. P. Wildes, “Detecting binocular half-occlusions: Empirical comparisons of five approaches,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 8, pp. 1127–1133, Aug. 2002. [22] K. Mühlmann, D. Maier, J. Hesser, and R. Männer, “Calculating dense disparity maps from color stereo images, an efficient implementation,” Int. J. Comput. Vision, vol. 47, nos. 1–3, pp. 79–88, Apr. 2002. [23] R. Yang, M. Pollefeys, and S. Li, “Improved real-time stereo on commodity graphics hardware,” in Proc. Conf. Comput. Vision Pattern Recognit., Jun. 2004, pp. 36–43. [24] C. L. Zitnick and T. Kanade, “A cooperative algorithm for stereo matching and occlusion detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 7, pp. 675–684, Jul. 2000. [25] Middlebury stereo vision page [Online]. Available: http://vision.middlebury.edu/stereo/ [26] D. Murray and J. Little, “Using real-time stereo vision for mobile robot navigation,” Auton. Robots, vol. 8, no. 2, pp. 161–171, 2000. [27] K. Mühlmann, D. Maier, J. Hesser, and R. Männer, “Calculating dense disparity maps from color stereo images, an efficient implementation,” Int. J. Comput. Vision, vol. 47, pp. 79–88, 2002.

Seunghun Jin received the B.S. and M.S. degrees in electrical and computer engineering from Sungkyunkwan University, Suwon, Korea, in 2005 and 2006, respectively. He is currently working toward the Ph.D. degree at Sungkyunkwan University. His research interests include image and speech signal processing, embedded systems, and real-time applications.

25

Junguk Cho received the B.S., M.S., and Ph.D. degrees from the School of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea, in 2001, 2003, and 2006, respectively. From 2006 to 2007, he was a Research Instructor in the School of Information and Communication Engineering, Sungkyunkwan University. In 2008, he was with the Department of Computer Science and Engineering, University of California, San Diego, as a Postdoctoral Scholar. His research interests include motion control, image processing, and embedded systems.

Xuan Dai Pham received the B.S. and M.S. degrees in computer science from the University of Hochiminh City, Hochiminh City, Vietnam, in 1994 and 2001, respectively, and the Ph.D. degree in electrical and computer engineering from Sungkyunkwan University, Suwon, Republic of Korea, in 2008. In 2008, he was with the Information Technology Faculty, Saigon Institute of Technology, Hochiminh City, Vietnam, as a Lecturer and a Researcher. His research interests include image processing and computer vision.

Kyoung Mu Lee received the B.S. and M.S. degrees in control and instrumentation engineering from Seoul National University, Seoul, Korea, in 1984 and 1986, respectively, and the Ph.D. degree in electrical engineering from the University of Southern California, Los Angeles, in 1993. From 1994 to 1995, he was with Samsung Electronics Company Limited, Korea as a Senior Researcher. Since September 2003, he has been with the Department of Electrical and Computer Engineering, Seoul National University, as a Professor. His primary research interests include computer vision, image understanding, pattern recognition, robot vision, and multimedia applications.

Sung-Kee Park received the B.S. and M.S. degrees in mechanical engineering from Seoul National University, Seoul, Korea, in 1987 and 1989, respectively, and the Ph.D. degree from the Department of Automation and Design Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 2000. Since 2000, he has been with the Center for Intelligent Robotics, Korea Institute of Science and Technology, Seoul, Korea, as a Research Scientist. His current research interests include computer vision, robot vision, and intelligent robots.

Munsang Kim received the B.S. and M.S. degrees in mechanical engineering from Seoul National University, Seoul, Korea, in 1980 and 1982, respectively, and the Ph.D. degree in engineering robotics from the Technical University of Berlin, Berlin, Germany, in 1987. Since 1987, he has been with the Center for Intelligent Robotics, Korea Institute of Science and Technology, Seoul, Korea, as a Research Scientist. He has led the Advanced Robotics Research Center since 2000, and became the Director of the Intelligent Robotics Development Program, Frontier 21 Program, since 2003. His current research interests are design and control of novel mobile manipulation systems, haptic device design and control, and sensor application to intelligent robots.

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

26

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 1, JANUARY 2010

Jae Wook Jeon (S’82–M’84) received the B.S. and M.S. degrees in electronics engineering from Seoul National University, Seoul, Korea, in 1984 and 1986, respectively, and the Ph.D. degree in electrical engineering from Purdue University, West Lafayette, IN, in 1990. From 1990 to 1994, he was a Senior Researcher at Samsung Electronics, Suwon, Korea. In 1994, he joined the School of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, Korea, as an Assistant Professor, where he is currently a Professor. His research interests include robotics, embedded systems, and factory automation.

Authorized licensed use limited to: Sungkyunkwan University. Downloaded on January 15, 2010 at 08:28 from IEEE Xplore. Restrictions apply.

Suggest Documents