FPGA-based Rectification of Stereo Images

5 downloads 0 Views 2MB Size Report
C is a matrix that maximizes the visibility of the common area between the images of both videos. It is very useful for stereoscopy, since only the common area ...
FPGA-based Rectification of Stereo Images Jo˜ao Rodrigues1 , Jo˜ao Canas Ferreira2 1 2

PhD Student, FEUP

Assistant Professor, DEEC, FEUP

[email protected], [email protected]

Abstract. In order to obtain depth perception in computer vision, one needs to process pairs of stereo images. This process is computationally challenging to be carried out in real-time as it requires the search for matches between objects in both images. Such process is significantly simplified if the images are rectified, making the objects horizontally aligned between them. The process of stereo images rectification has different steps with different computational requirements, that are therefore not usually implemented in the same system. These includes 2D searches for high fidelity matches, precise matrix calculations, and fast pixel coordinates’ transformations and interpolations. In this project, the complete process is effectively implemented in an Spartan-3 FPGA, taking advantage of a Microblaze softcore for slow but precise calculations, and of the fast dedicated hardware for real-time requirements. The implemented system successfully performs real-time rectification on the images of two video cameras, with a resolution of 640 x 480 pixels and a frame rate of 25 FPS, and is easily configured for videos with higher resolutions. The obtained results are quite satisfactory, with output images having a maximum vertical disparity of 2 pixels, proving that stereo images rectification can be efficiently achieved in an low-resources FPGA (64Kb).

1. Introduction Depth information of objects in an image is essential in applications such as videosecurity, military, cinematography, robotics and even some medical departments. The process to recover this information is commonly performed in Stereo Vision, which uses two images and triangulation operations to determine the depth of represented objects. The triangulation between two cameras can be accomplished by finding a corresponding point or object viewed by one camera in the image of the other. In most camera configurations finding these correspondences requires a search in two dimensions, although it can become a one-dimension search if the images are previously rectified. In rectified images the objects have the same vertical position in both images, so we just need to look along an horizontal line to find the same point in a pair of images. Even though the rectification may not be perfect, it still is very useful: the more precise the rectification operation is, the smaller the search area for correspondences can be. The process of rectification is normally divided in two main phases: calculation of the required transformation matrices and application of those transformations to the images. Several authors, such as Richard Hartley [Hartley 1999] and Andrea Fusiello [Fusiello 2000], proposed some methods and give good mathematical background for the

calculation of the necessary transformations. These transformations needed to rectify the images are represented as two 3 x 3 matrices, one for each camera. Some systems of images rectification have been previously proposed and implemented, like the MSVM-III from Jia, Y. [Jia et al. 2004], or an auto-rectification system based on an IC3D system from Xinting Gao [Gao et al. 2008]. However, these systems have specific restrictions that makes them unusable in most situations. For example, the MSVM-III requires the transformation matrix to be given, and the IC3D-based performs rectification only with translations in the images, without rotation or scaling. Our main goal is to implement both phases in an FPGA-based system, which outputs in real-time the rectified streams of video from two cameras. In the next section we will describe the implementation and the methods chosen. In sections 3 and 4 the results are presented along with brief conclusions about the proposed system.

2. Implementation Every method and algorithm described was implemented in an FPGA, either on Microblaze using C or directly in Verilog. This way the whole system is implemented in a single board, making it more useful and portable. The stream of video to be rectified are obtained from a stereo kit with two CMOS sensors of 640 x 480 pixels of resolution. There are several methods described in the literature to compute the required image transformations [Hartley 1999] [Fusiello 2000]. We implemented the most general method, where every image parameter is rectified, except for lens distortion. This method is based on epipolar geometry and is advised for random camera placement, with noncoplanar objects in view relatively near the camera and thus having high horizontal disparity. This requirement is important in order to give valuable spatial information to the images, and thus being able to accurately estimate the rotations need to both images. An example of a set on which this method should not be applied is satellite photography (e.g. Google maps, ...). The objects (e.g. houses, ...) are practically coplanar, forming the plane corresponding to the earth surface. In these cases, where almost no 3D information exists, a simpler method of rectification should be used, like the one described and implemented by Jia [Jia et al. 2004]. The chosen method consists of finding enough correspondences between the images, and then performing an iterative least-mean-squares function to minimize the error of the estimated matrices. These steps do not have real-time requirements, but require high precision, and therefore are implemented in C in a Microblaze. The calculated matrices are then applied to both videos in real-time. An FPGAbased bilinear interpolation was implemented in Verilog, in order to estimate and reconstruct the rectified video streams. The full system is described in Figure 1. The auxiliary modules are represented in white, the supporting hardware modules in red and the three main phases in blue and green. The three steps needed to implement a complete rectification process will now be described.

Figure 1. Global description of the system.

2.1. Correspondence Problem The correspondence problem consists in finding some points in one image and the corresponding points in the other image. Because of the limited instructions memory available on the FPGA (64 KB), an advanced method of finding correspondences was impossible to design. The correspondence problem was made simple by dividing it into some small weighting functions. The various actions of the functions are now listed chronologically: 1. Every 5 x 5 pixels block is analysed with a new non-linear algorithm and the best candidates chosen. The algorithm calculates the block’s quality as the higher value, between the sum of the darker pixels than the central, and the pixels with higher value (lighter). Pixels with a luminosity value similar to the central pixel are ignored in order to suppress noise effect. This algorithm proved to be very efficient at detecting blocks with interesting characteristics like corners. 2. The images are divided in zones (80 x 60 pixels), and the best candidates from each zone are selected. This division is important to obtain correspondence matches throughout the entire image, improving the precision of transformation matrix. 3. For each candidate, a search for a match is performed in the other image. The search-area is defined as a rectangular block around the same coordinates and is iteratively reduced around the epipolar line. The similarity of the candidate blocks between both images is calculated based on weighted simple functions: • A linear comparator. Sensitive to luminosity but susceptible to noise. • A non-linear comparator - similar to the one used previously. Sensitive to characteristics changes, and insensitive to noise. • The distance to the estimated epipolar line. • To untie, the quality of the blocks previously calculated. 4. The matches for each candidate are refined, using the same algorithms but with a block size of 15 x 15 pixels. 5. The unicity of each candidate is calculated, which represents the trust in the correspondence. This eliminates errors in patterns and repetitive textures, giving more importance to unique characteristics. It’s calculated for each candidate as the difference between the confidence of the first match to the second one, resulting in a high unicity when there is only clear a match. 6. For each zone of the reference image, only the best (higher unicity) two correspondences are saved, producing the final correspondence pairs.

2.2. Transformation Matrix The transformation required to rectify the video is given in the form of a 3 x 3 matrix, for each camera. The coordinates of at least eight correspondence pairs are needed in order to estimate the epipolar geometry and the transformation matrices. The epipolar geometry is estimated using the famous 8-points-algorithm and the result is used in the refining of the correspondences, until they stop changing for some iterations. The algorithm iteratively repeats from step 3, until a minimum of unique high quality matching blocks are found. If the images have few spatial information, as explained before, and the algorithm is not able to find enough matching blocks the system gets new images from the cameras and restarts from step 1. In this project we want, for each coordinate of the final rectified images, to know the coordinate to interpolate from the taken (unrectified) images. For this we have to calculate one different matrix for each camera, using the equation 1. H = D · [C · T · G · R]−1 · N

(1)

The matrix H represents the final transformations to be applied to the video stream. N and D are the Normalizing and Denormalizing matrices. These matrices put the coordinates in the [-1,1] range, improving the precision of the 8-points-algorithm as described in Hartley [Hartley 1999]. R and G are the matrices with the same name described by Hartley [Hartley 1999]. These matrices send the epipole of the image to the point at infinity in the horizontal axis, making the epipolar lines horizontal and parallel between them. They are calculated using the 8-points-algorithm and a C implementation of an homogeneous equation solver by Singular Value Decomposition (SVD). Performing the SVD in the coordinates of the correspondence pairs results in the Fundamental Matrix, describing the epipolar geometry of the images. Another SVD on this matrix and in its transpose results in the coordinates of the epipoles of both images. These matrices are in the form: cos(θ) sin(θ) 0 R = −sin(θ) cos(θ) 0 e G = 0 0 1

1 0 −1 f

0 0 1 0 0 1

where f is the distance of the epipole to the origin, and (theta) is the angle of the line passing through the epipole and the origin. T is a matrix of scaling and vertical translation that makes the epipolar lines coincide between both images. For its calculation we need to apply the previous matrices to the original coordinates, and then finding k and d so that Y.k + d = Y 0 , where Y and Y 0 are the vertical coordinate of the candidate and match, k is the scaling factor and d the vertical translation. This is the same as performing Y.k − Y 0 + d = 0, which represents an homogeneous system solvable by SVD. C is a matrix that maximizes the visibility of the common area between the images of both videos. It is very useful for stereoscopy, since only the common area can be

analyzed. This matrix is the same for both cameras and consists only of a scaling and translation factor in both axis. The previous matrices calculations were simulated, and proved to be very reliable. In the simulations a list of random pairs with variable size was created and then gently distorted with a given random matrix. That matrix was successfully retrieved with only the distorted pairs, by using the described methods. In order to improve the simulations correctness, the distorted coordinates were rounded to the nearest integer. This step alone proved to introduce the errors reported in table 1 in the recovered coordinates. As we could see, the rectification process is improved if the correspondences found are dispersed and in various different depths. Table 1. Simulated precision using different number of pairs

Number of pairs Maximum error in pixels of: Almost coplanar image Depth-rich image Dispersed points

9

13

25

100

250

50 - 57 10 - 14 5-8

13 - 16 4-6 2,0 - 2,9

7 - 11 1.8 - 2.9 1,5 - 2,6

1.7 - 2.3 0.6 - 0.75 0,5 - 0,7

0.2 - 0.4 0.1 - 0.25

As the algorithms developed for the correspondence problem already solve these issues the methods was implemented in the FPGA. In order to obtain the best precision possible, without taking too much time, the more complex mathematical functions are performed in an auxiliary support module. 2.3. Rectification Unlike the previous methods, the system must apply the calculated transformation matrices in real time to both videos, at a speed of 25 frames per second of 640 x 480 pixels. In this project a bilinear interpolation method was chosen to reconstruct the rectified images, but other methods can be easily used. This interpolation resulted in a much better looking video than with no interpolation whatsoever. The implementation on the FPGA of this process consists of the following steps: • For each coordinate [0-639;0-479] multiply it by the transformation matrix. The result is an homogenous coordinate of the point to interpolate from the images. • Transform the homogeneous coordinates into Cartesians. This requires a division. • Read the four nearest pixels surrounding the calculated coordinates and perform the bilinear interpolation. • Send the rectified images to the monitor, memory, or other output.

3. Results The FPGA used for implementation was a Xilinx XC3S1500, but the system is adaptable and easily applied to other FPGAs or cameras with different characteristics. The complete process was successfully implemented, meeting the time requirements thanks to the parallelism power of the FPGAs. The proposed correspondence algorithm has been thoroughly tested in the development system. These cameras had significant blur effect and lens distortion and, even so, the algorithm was capable of detecting enough correspondences with good quality: this

(a) Unrectified (b) Rectified Figure 2. Images taken and displayed using the new method

algorithm showed less than 15% error in the correspondence pairs, which usually became much less after some iterations. A new way to evaluate pairs of images was used to analyse the results. It is based on a bi-color image, on which each image fills a different color: red and blue. This allows us to easily compare the objects position on both images, and thus the precision of the rectification method. It also allows to see the images with colored 3D glasses and confirm the increase in quality after the rectification. Although the cameras are in a stereoscopy kit and visually aligned, theoforiginal images are clearly as we can see the in figure 2. The calculation the matrix resulted in the unrectified, expected precision from simulations, and the interpolation resulted in a visually-lossless rectified image. When about 50 pairs of good points were found, the system rectified the videos with a maximum error of 1 pixels. In general, there were always between 30 and 60 pairs, with 1 or 2 bad correspondences, and the resulting error was less than 2 pixels as seen in figure 2.

4. Conclusion The implemented algorithms are capable of rectifying the videos in real-time, with a good precision. The 2 pixels of maximum error is good enough to practically reduce the search area to a line, and thus very useful to stereoscopy. We showed that a good and reliable stereo images rectification process can be implemented in an FPGA, using only 64KBs of memory. This means, for example, that a cheap and personal 3D-camcorder could be easily constructed, saving the rectified 3D video in real-time.

References Fusiello, A. (2000). Epipolar rectification. http://profs.sci.univr.it/ ˜fusiello/rectif_cvol/rectif_cvol.html. Gao, X., Kleihorst, R., and Schueler, B. (2008). Implementation of auto-rectification and depth estimation of stereo video in a real-time smart camera system. In Computer Vision and Pattern Recognition Workshops, pages 1–7, Anchorage, AK,. Hartley, R. I. (1999). Theory and practice of projective rectification. International Journal of Computer Vision, 35(2):115–127. Jia, Y., Zhang, X., Li, M., and An, L. (2004). A miniature stereo vision machine (msvmiii) for dense disparity mapping. In ICPR ’04: Proceedings of the Pattern Recognition.

Suggest Documents