Online Stereo Calibration using FPGAs - nipe.se

0 downloads 0 Views 798KB Size Report
given a continuous stream of image pairs, we want to perform stereo cali- bration in ...... extract information about them at that particular scale and thereby reduce.
Thesis for the degree of Master of Science

Online Stereo Calibration using FPGAs Niklas Pettersson

Division of Physical Resource Theory - Complex Systems Chalmers University of Technology G¨oteborg, Sweden, 2005

Online Stereo Calibration using FPGAs niklas pettersson © niklas pettersson 2005

Division of Physical Resource Theory - Complex Systems Chalmers University of Technology SE-412 96 G¨oteborg Sweden Telephone +46(0)31-772 1000

Chalmers Reproservice G¨oteborg, Sweden, 2005

Online Stereo Calibration using FPGAs niklas pettersson Division of Physical Resource Theory - Complex Systems Chalmers University of Technology

Abstract Stereo vision is something that most people do everyday without even realizing it. More specifically, as we walk around our environment, our two eyes are constantly taking in a pair of images of the world. Our brain then fuses these two images and unconsciously computes the approximate depth of the objects you see around you. In the field of computer vision, our two eyes are replaced by two cameras and our brain is replaced by a computer. However, the aim of stereopsis remains the same. That is, given a pair of stereo images, we want to compute the scene depth. Now, with human vision, in order to calculate scene depth, your brain probably needs to know how far apart your eyes are and what orientations your eyes are in. Similarly, with computer vision, you need to know the translation and rotation between the cameras before you can calculate scene depth. Determining the translation and rotation is known as stereo calibration. This thesis deals with the problem of stereo calibration. More specifically, given a continuous stream of image pairs, we want to perform stereo calibration in real time. However, due to the computational intensity of these calibration calculations, today’s computers are not yet fast enough to do this in real time. In order to solve this problem, we suggest using programmable logic, i.e. Field Programmable Gate Arrays (FPGAs), for parts of the calibration process. The reason for using programmable logic is that many of the steps needed in the stereo calibration algorithm can be done independently of each other. This means that a parallel approach, such as using FPGAs, will have a clear advantage. By implementing steps in the algorithm in parallel on FPGAs, we can reduce the computational time needed per frame and achieve real time performance. The resulting stereo calibration that was achieved in this thesis is very accurate. For example, we can determine the verge angle between the two cameras with an accuracy of less than one degree with the system running at speeds exceeding the frame rate needed for real time performance. Hence, this thesis clearly shows the advantage of using FPGAs to solve computationally intensive computer vision tasks. Keywords: FPGA, Computer Vision, Multiple View Geometry, Stereo Vision, Essential Matrix

Acknowledgements I would like to thank my supervisor in Australia, Lars Petersson, for all his support throughout this work. Without his help and continuous support, this would never have been possible. Thanks also to Kristian Lindgren, my supervisor in Sweden, for his advice and guidance. Kristy, thanks for all the support. Without your superior knowledge in the English language, I would be lost. I would also like to thank Andrew Dankers for help with the data acquisition. Niklas Pettersson Somewhere over the Pacific Ocean, 2005

5

Contents 1 Introduction

1

1.1

Motivation

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

What is Computer Vision . . . . . . . . . . . . . . . . . . . .

2

1.3

Problem specification . . . . . . . . . . . . . . . . . . . . . . .

3

1.4

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.5

Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2 Previous work 2.1

Automated Stereo Calibration . . . . . . . . . . . . . . . . . . 2.1.1

2.2

6

Real Time Motion and Stereo Cues for Active Visual Observers . . . . . . . . . . . . . . . . . . . . . . . . .

6

Related work on FPGAs . . . . . . . . . . . . . . . . . . . . .

7

2.2.1

Edge detection, using the Sobel operator . . . . . . . .

7

2.2.2

Convolution . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2.3

General Image processing . . . . . . . . . . . . . . . .

7

2.2.4

Calculation of Arctan in hardware . . . . . . . . . . .

7

3 Theory 3.1

6

8

Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3.1.1

Notation . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3.1.2

2D Projective Transformations . . . . . . . . . . . . .

9

3.1.3

Stereo and Epipolar Geometry . . . . . . . . . . . . .

10

3.1.4

The Essential Matrix . . . . . . . . . . . . . . . . . . .

11

3.1.5

The Sobel edge detector . . . . . . . . . . . . . . . . .

12

3.1.6

RANSAC . . . . . . . . . . . . . . . . . . . . . . . . .

13

6

3.2

Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.3

Kalman filtering . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.4

Field Programmable Gate Arrays (FPGAs) . . . . . . . . . .

15

3.4.1

15

Programming language, VHDL . . . . . . . . . . . . .

4 Implementation 4.1

4.2

4.3

4.4

4.5

16

System Overview . . . . . . . . . . . . . . . . . . . . . . . . .

16

4.1.1

Experimental setup . . . . . . . . . . . . . . . . . . . .

17

4.1.2

CEDAR head . . . . . . . . . . . . . . . . . . . . . . .

18

4.1.3

Frame grabber . . . . . . . . . . . . . . . . . . . . . .

18

4.1.4

Videoserver . . . . . . . . . . . . . . . . . . . . . . . .

19

Gaussian Pyramid and Local min/max detection . . . . . . .

20

4.2.1

Gaussian stage . . . . . . . . . . . . . . . . . . . . . .

20

4.2.2

Linebuffers . . . . . . . . . . . . . . . . . . . . . . . .

21

4.2.3

Convolutions . . . . . . . . . . . . . . . . . . . . . . .

22

4.2.4

Choosing the size of the kernel . . . . . . . . . . . . .

24

4.2.5

Thoughts on discretisation of Gaussian filter . . . . .

25

Sobel Operator . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.3.1

First step of implementation . . . . . . . . . . . . . .

27

4.3.2

Second step of implementation . . . . . . . . . . . . .

27

Finding and matching points in images . . . . . . . . . . . . .

31

4.4.1

Feature detection . . . . . . . . . . . . . . . . . . . . .

31

4.4.2

Keypoint descriptor extraction . . . . . . . . . . . . .

32

4.4.3

Keypoint matching . . . . . . . . . . . . . . . . . . . .

32

4.4.4

Profiling of SIFT in C-implementation . . . . . . . . .

33

Stereo calibration . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.5.1

Calculating the Essential Matrix . . . . . . . . . . . .

35

4.5.2

Dealing with ambiguity . . . . . . . . . . . . . . . . .

37

4.5.3

Calculating the vergence angle . . . . . . . . . . . . .

37

5 Results 5.1

5.2

5.3

38

Experimental setup . . . . . . . . . . . . . . . . . . . . . . . .

38

5.1.1

Indoor, lab . . . . . . . . . . . . . . . . . . . . . . . .

38

5.1.2

Outdoor, car . . . . . . . . . . . . . . . . . . . . . . .

38

Stereo calibration . . . . . . . . . . . . . . . . . . . . . . . . .

39

5.2.1

Moving Cameras . . . . . . . . . . . . . . . . . . . . .

39

5.2.2

Fixed Cameras . . . . . . . . . . . . . . . . . . . . . .

40

5.2.3

Using RANSAC . . . . . . . . . . . . . . . . . . . . .

40

Distribution of matched keypoints . . . . . . . . . . . . . . .

41

6 Conclusions

43

6.1

Feature detector . . . . . . . . . . . . . . . . . . . . . . . . .

43

6.2

Computer Vision in FPGAs . . . . . . . . . . . . . . . . . . .

43

6.3

Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

A Paper presented at the IEEE Intelligent Vehicles Symposium, Las Vegas, 2005 47

Chapter 1

Introduction The aim of this chapter is to introduce the reader to the vast area that is known as Computer Vision. We also give an outline of the thesis so that the reader will get a feeling for what is to come. The thesis is written in such a way that it can be read on various levels, the top one being quite broad, whilst the bottom level gives a detailed view of how things have been implemented.

1.1

Motivation

The work conducted in computer vision research has many applications. For example: ˆ Robotics and autonomous agents. In robotics, the aim of the vision system is to gather as much information as possible using the least number of sensors and lowest computing power. Examples of autonomous agents include the Mars rover(NASA), vehicles in the DARPA Grand Challenge and various humanoid robots. ˆ Driver assistance. As of yet, we cannot create an autonomous driver that gets even close to a human in performance. Still, we know that a human driver is far from perfect. In the area of Driver Assistance, we aim to improve the driver using advanced driver assistance systems (ADAS). These systems include pedestrian detection using cameras, sign detection, sign recognition and lane departure warning. ˆ Entertainment. Computer vision techniques can also be used to create 3D maps of interesting places such as historic buildings and tourist attractions. This lets you virtually visit places you never would have visited otherwise.

The work in this thesis has applications in the area of Driver Assistance. 1

CHAPTER 1. INTRODUCTION

1.2

What is Computer Vision

As humans we are equipped with five senses - sight, sound, taste, touch and smell. Out of these five senses, most of us rely most heavily on our sight. We use our eyes to recognise objects and friends, read text, and to see where we put our feet in order not to fall over. The simple task of dressing becomes so much more difficult if you try it without using your eyes. The area of computer vision aims to give computers the ability to “see”. In the foreword of [10], Oliver Faugeras says this about computer vision: Making a computer see was something that leading experts in the field of Artificial Intelligence thought to be at the level of difficulty of a summer student’s project back in the sixties. Forty years later the task is still unsolved and seems formidable. A whole field, called Computer Vision, has emerged as a discipline in itself with string connections to mathematics and computer science and looser connections to physics, the psychology of perception and the neurosciences. In computer vision, a camera is analogous to the human eye whilst the computer itself is analogous to the brain. The camera takes pictures of the world which is then fed into the computer for processing and interpretation. There are many ways in which these images may be processed and interpreted. These include: ˆ Segmentation. From a single image, it is easy for a human to see where one object ends and another begins. In computers, this problem is often solved using edges and differences in colours and textures. ˆ Tracking. Given an image sequence, computer vision techniques may also be used to track a particular object over several frames. It is also possible to calculate the object’s velocity and distance from the camera as well as predict the future path of the tracked object. ˆ Detection/Classification/Recognition. Humans can easily spot the presence of, detect, certain objects. Further, humans can classify these objects as beeing tables, chairs, humans etc. Also, a particular object can be recognised as your favorite chair or perhaps a friend. ˆ Stereo Vision. By using several images of a stationary scene, we can find matching points between the images, determine the camera positions and orientations and reconstruct the three dimensional shapes in the scene.

This is only a subset of the vast area of computer vision. Henceforth, we concentrate on the area of stereo vision. 2

CHAPTER 1. INTRODUCTION

1.3

Problem specification

Suppose we have a scene, and we have two cameras - a left camera and a right camera. These cameras both take a picture of the scene simultaneously. In doing so, we go from a 3D scene to a pair of 2D images. In other words, one dimension is lost.

Figure 1.1: The top two images are the input images from the left and right cameras. From these two input images, the bottom image can be calculated. This image is known as a depth map or a disparity map. A lighter shade of grey indicates that an area is closer to the camera while a darker shade indicates an area that is further away. These images were taken from [6].

The aim of stereo vision is to recover this third dimension. More specifically, given the left and right images, we want to calculate the scene depth. In order to calculate depth, the images must first be processed in a certain way so that the the search for an object in the left image can be simplifyed to a search along a horizontal line in the right. The differece in position along that horizontal line is called the disparity. This image processing step is usually referred to as “rectification”. To rectify a pair of images, we need to (i) find matching image points in the pair of images,

3

CHAPTER 1. INTRODUCTION

(ii) calculate something called the essential matrix 1 . (iii) use the essential matrix to “warp” or rectify the images into the desired form. Much research has been done on step (iii). In this thesis, we concentrate on the first two steps. Now, in the application we are interested in (i.e. driver assistance), we are continuously capturing a new pair of stereo images. This means that we need to calculate our depth maps from these stereo images in real time. In this application, the cameras can move actively, due to vibrations or other external factors. Hence, we also have to find the matching image points, calculate the essential matrix and rectify the images in real time. To achieve this real time performance, we observed that a lot of the computations could be reformulated and implemented directly in hardware. Normally, these computations are done in a serial fashion in a standard computer. However, these computations can actually be done in parallel since they are independent. By implementing these computations in dedicated hardware, it is possible to perform many operations simultaneously. Thus, the problem may be stated as follows: Given a pair of images of a scene, we want to find matching image points and calculate the essential matrix in real time by implementing many of these computations in dedicated hardware.

1.4

Contributions

The task of finding matching image points and calculating the essential matrix is also called stereo calibration. Our contribution is to formulate this problem in such a way that it is possible to perform these calculations in real time. These calculations are naturally quite complex and computationally intensive. Our suggestion is to use computer vision algorithms implemented in dedicated hardware to accelerate this process. The implementation has been done using VHDL, a commonly used hardware description programming language, to program an FPGA. An FPGA is an integrated circuit in which you can combine the gates and logic to perform specific tasks. This is also referred to as programmable logic. An FPGA is more versatile than a microcontroller or a digital signal processor(DSP), but also more difficult to program. 1

More details on the essential matrix are given in Chapter 3 of this thesis

4

CHAPTER 1. INTRODUCTION

The approach of using dedicated hardware to perform automated stereo calibration has to our knowledge not been done before.

1.5

Roadmap

Chapter 2 gives a brief survey of related work. Chapter 3 introduces the reader to the notation used as well as the theory necessary for understanding the later parts of the thesis. This chapter also presents a model for calculating the essential matrix in the case of a stereo head with three degrees of freedom. We provide details of our implementation in Chapter 4. Section 4.1.1 gives an overview of the hardware and software used in the experimental setup. Sections 4.2 and 4.3 discuss issues related to the hardware implementation of the Gaussian pyramid and the Sobel filter respectively. The Gaussian pyramid and Sobel filter are used as parts of an algorithm for detecting points in the left and right stereo images. In Section 4.4.4, we provide a complexity analysis of the algorithm used to detect these feature points. We also describe how matches between these feature points are found. In Section 4.5, we show how the point correspondences are used to calculate the essential matrix. Lastly, Chapter 5 presents our experimental results. Further, this chapter also discusses future work and improvements.

5

Chapter 2

Previous work This chapter gives some background on the work related to this thesis. Section 2.1 discusses previous work on automated stereo calibration. In Section 2.2, we give references to some previous work on the implementation of computer vision algorithms in hardware such as Field Programmable Gate Arrays(FPGAs). As of yet, no work combining the two fields has been found.

2.1

Automated Stereo Calibration

This section aims to give a brief understanding of what has been done in the area of automated stereo calibration. A number of methods of how to perform the actual calibration is presented by Hartley and Zisserman in [10]. Indeed, in theory, the problem is considered to be “solved”. However, in practice, it is not yet possible to do the calibration in real time especially if you include the problem of finding and matching points between the images.

2.1.1

Real Time Motion and Stereo Cues for Active Visual Observers

In his PhD thesis[4], M˚ arten Bj¨orkman describes the theory involved in automatically calibrating a stereo head with three degrees of freedom. The setup he used is similar to the one described in section 4.1.2 of this thesis. He also describes a system capable of stereo calibration in real time. A Harris corner detector [8] was used to detect feature points while intensity correlation between image patches was used for matching. This resulted in a relatively large number of false matches. Bj¨orkman dealt with this by employing a robust iterative algorithm for estimating the Essential matrix. We believe that this method is well worth using in a future version of our system. This particular part of his thesis is also published as a paper [5]. 6

CHAPTER 2. PREVIOUS WORK

2.2

Related work on FPGAs

Here, we sample related work in the image processing and signal processing domain, using dedicated hardware (i.e. FPGAs).

2.2.1

Edge detection, using the Sobel operator

In a note [2], Atmel describes the implementation of a Sobel edge detector in a small FPGA. Atmel also gives a brief introduction to the Sobel operator and an explanation on how to pipeline the implementation. The basic idea is to perform as many computations as possible in parallel. This reduces the time needed to perform each convolution. The article [16], gives another implementation of the Sobel operator. Here the Sobel operator is used as a step towards performing template matching.

2.2.2

Convolution

Atmel has another interesting application note on how to efficiently perform convolutions in FPGAs entitled “3x3 Convolver with Run-time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs” [3]. The basic idea here is to avoid multiplications by using a pre-calculated look-up tables. By performing convolutions in this way we can reduce the area used in the FPGA.

2.2.3

General Image processing

A system for FPGA-based image processing is described in [17]. The authors show ways of implementing image weight calculations and a Hough line transform, although not in any detail.

2.2.4

Calculation of Arctan in hardware

“Optimisation and implementation of the arctan function for the power domain” [1] describes how to implement the arctan function in hardware using various methods. The various methods were also compared with respect to power consumed by the FPGA. However, these comparisons were made based on the assumption that only the angle between the two vectors was important. In this thesis, we are also interested in the magnitude of the sum of the vectors. Thus, the worst performing implementation with regards to power was used, the CORDIC [20], since it calculates the angle and magnitude simultaneously.

7

Chapter 3

Theory The purpose of this chapter is to provide the reader with the background and theory necessary for understanding the later parts of this thesis. Sections 3.1 introduces the mathematical notation used as well as the basic concepts of stereo vision. Sections 3.2 and 3.3 describe convolutions and the Kalman filter respectively. Finally, specific information regarding FPGAs and how to program these using VHDL is presented in Section 3.4.

3.1

Stereo Vision

As explained in Chapter 1, the aim of stereo vision is to calculate scene depth given a pair of stereo images.

3.1.1

Notation

We will use bold-face symbols, such as x, to denote a column vector. These definitions and notations are from [18] and [10]. Homogeneous coordinates The world we live in obeys Euclidean geometry. Only Euclidean transformations such as translations and rotations are allowed. However, when we take an image of the world with a camera, the transformation mapping the 3D scene to a 2D image is not an Euclidean transformation but a projective transformation. Consequently, we need to introduce some concepts from projective geometry. One concept from projective geometry is homogeneous coordinates. In Eu˜ = (x, y)T . clidean geometry, a 2D point is usually represented by a 2-vector x This is known as an inhomogeneous vector. In projective geometry, a 2D 8

CHAPTER 3. THEORY

point is represented by a 3-vector x = (x1 , x2 , x3 )T . This is known as a homogeneous vector. (Throughout this report, homogeneous vector quantities are denoted as x, while the corresponding non-homogeneous quantities ˜ .) Homogeneous vectors are only defined are denoted with a tilde as in x up to a scale factor. Therefore, the homogeneous vectors (x1 , x2 , x3 )T and k(x1 , x2 , x3 )T represent the same point for any non-zero k. ˜ = (x, y)T and homogeneous If a 2D point has inhomogeneous coordinates x T coordinates x = (x1 , x2 , x3 ) , then the relationship between the inhomogeneous and homogeneous coordinates is given by x=

x1 , x3

y=

x2 x3

Note that if x3 → 0, then x → ∞ and y → ∞. Therefore, any point with homogeneous coordinates x = (x1 , x2 , 0)T is a point at infinity. Skew symmetric matrix A matrix is skew symmetric if A = −AT . From here on we use the following notation to describe a skew symmetric matrix:   0 −t3 t2 0 −t1  [t]× =  t3 (3.1) −t2 t1 0

This can also be thought of as a cross product between two 3-vectors a × b: a × b = [a]× b = aT [b]×

T

(3.2)

Inliers and Outliers When referring to data, we need to know when a measured data point complies with the model and when it does not. The standard terminology for this is inliers and outliers respectively. Inliers is data that is consistent with the model we are using plus some kind of Gaussian noise. Outliers on the other hand, is data that is inconsistent with the chosen model. For example, when you want to estimate the geometry of a camera setup, you have to be able to minimize the effect of outliers in order to get a stable result.

3.1.2

2D Projective Transformations

A 2D projective transformation is an invertible transformation that maps points in 2D projective space P2 to points in P2 in such a way that straight lines remain straight lines. More precisely, 9

CHAPTER 3. THEORY

P P’ image 1

image 2

x

x’

X

(a)

planar surface

(b)

Figure 3.1: Examples of 2D projective transformations. (a) The projective transformation between the image of a plane (the end of the building) and the image of its shadow onto another plane (the ground plane). (b) The projective transformation between two images induced by a world plane. These images were taken from [10].

a 2D projective transformation is an invertible mapping H from P2 to P2 such that three points x1 , x2 and x3 lie on the same line if and only if H(x1 ), H(x2 ) and H(x3 ) do. A projective transformation is also called a projectivity, a collineation or a homography. Examples of such mappings can be seen in Figure 3.1.

3.1.3

Stereo and Epipolar Geometry

Epipolar geometry is the geometry relating two views of the same scene. Consider two cameras with camera centres cl and cr (Figure 3.2). The camera at cl captures an image of the world, resulting in image Il . Similarly, the camera at cr captures an image of the world, resulting in image Ir . Consider an arbitrary point X in the scene. This scene point X will project to image point xl in Il . The point X will also project to image point xr in Ir . Thus, xl and xr are matching, or corresponding, image points. To further explain Figure 3.2, it is necessary to first introduce some terminology: ˆ The baseline is the line joining the two camera centres. ˆ The epipolar plane Π is the plane containing the scene point X and the camera centres cl and cr . ˆ The epipole el is the point where the baseline intersects image Il . From Figure 3.2, we can see that el is simply the image of the second

10

CHAPTER 3. THEORY

X

epipolar plane Π

Il

Ir xr

xl ll

lr el

er

cl

cr

Figure 3.2: Epipolar geometry is the geometry between two views. This image was taken from [18].

camera centre cr in the first image Il . Similarly, epipole er is the point where the baseline intersects image Ir while epipole er is the image of the first camera centre cl in the second image Ir . ˆ The epipolar line ll is the line where the epipolar plane Π intersects image Il . The epipolar line lr is the line where the epipolar plane Π intersects image Ir .

3.1.4

The Essential Matrix

For a stereo camera setup to be calibrated, we need to know the relation between image points in the two images. This relation is a correlation transferring a point in one image to a line in the other. The correlation is known as the Essential matrix. Using this relation, and searching for an object from the first image along the epipolar line in the other, one can calculate the offset (or disparity). From this disparity, calculating the distance to the object in question is a straightforward task. Objects with a large positive disparity are close, objects with zero disparity are on what is known as the fovia and objects with negative disparity are further away. Performing this calculation over a number of surfaces gives an estimation of the distance to various parts of an image. An example of this was shown earlier in Figure 1.1. Derivation Consider the stereo camera setup described earlier with camera centres positioned at cl and cr with the baseline t = cr − cl . Also, consider a scene 11

CHAPTER 3. THEORY

point X. This point X together with the two camera centers forms a plane, as shown in Figure 3.2. The vectors x ˆl = X − cl and x ˆr = X − cr represent the projection of the point X onto the cameras. Since the three vectors x ˆl , x ˆr and t all lie in the same plane they must be linearly dependent. Thus, the determinant det(ˆ xl , t, x ˆr ) = 0. This property is know as the epipolar constraint and was introduced independently by [13] and [19] in 1981. This implies that the transformation relating an image point in one image to another image, must be of rank 2. Thus, the transformation transfers a point onto a line, the epipolar line. A transformation with these properties is also known as a correlation. Now, using the rotational matrices Rl and Rr to represent the rotation of each camera from a setup with the camera pointing perpendicular to the baseline, the two projections x ˆl and x ˆr can be given in the local reference frame of each camera. Thus, we form: xl = Rl x ˆl and xr = Rr x ˆr

(3.3)

Using these relations we can write the epipolar constraint as: det(ˆ xl , t, x ˆr ) = = = = = =

|ˆ xl · (t × x ˆr )| = |ˆ xl · ([t]x x ˆr )| = |ˆ xl · (Tˆ xr )| = x ˆTl Tˆ xr = T x Rl TRr T xl = xTr Exr

(3.4)

where E = Rl TRr T . E is also known as the Essential matrix. This derivation has been taken from [5].

3.1.5

The Sobel edge detector

The Sobel operator is a way of finding edges in images. This is done by applying two separate filters. That is, convolving the image with two kernels:     −1 0 1 −1 −2 −1 0 0  (3.5) Sx = 81  −2 0 2  Sy = 81  0 −1 0 1 1 2 1

Sx gives a large respons on vertical edges while Sy detects horizontal edges. Now, let Sx and Sy be two vectors, each representing the derivative in the respective direction. By combining these vectors one can calculate the gradient field in an image, with magnitude and orientation of the gradients.

12

CHAPTER 3. THEORY

c

d

b

a

(a)

(b)

Figure 3.3: Line estimation from a set of points (a) Solving this problem using least squares has problems with the outliers(hollow circles) (b) Using RANSAC each estimation of the line is formed from only two points. The support for the two different hypotheses is measured as the number of points within a threasholding distance from the line. These images were taken from [10].

3.1.6

RANSAC

RANSAC, which stands for RANdom SAmple Consensus, is a robust algorithm for solving problems otherwise solved with least squares methods. To illustrate this method we present a simple example of fitting a straight line to a set of points. Using normal least squares we get the line in Figure 3.3(a). Here, the inliers are filled circles, and hollow circles represent outliers. The solution given by least squares is clearly not the correct solution and we see the sensitivity of the least squares approach to outliers. Least squares only works in a situation where the number of outliers is low compared to the number of inliers. The RANSAC algorithm works in a different manner. Here, we form hypotheses from as few points in the data as possible. In Figure 3.3(b), two lines are constructed. The first line is constructed from two randomly selected points a and b while the second line is constructed from two other randomly selected points c and d. The support for these two hypotheses is based on the number of data points within a threshold distance from the line. In this example, it is easy to see that the first line, a-b, has far greater support in the data then the second line, c-d. The fact that one only selects a few randomly selected points to form the hypotheses until some threshold is reached, reduces the complexity of the method. Therefore, this is not an exhaustive search. Thus, RANSAC is a more robust method when you expect to have many outliers in your dataset. 13

CHAPTER 3. THEORY

3.2

Convolutions

The general definition of a convolution is Z f (x)g(x)dx sc =

(3.6)

S

In computer vision, this is interpreted as applying the kernel g(x) to the image data f (x). Performing this two dimensional convolution can be interpreted as moving a small window representing the kernel over the image and taking the dot product of the kernel and the image patch to form a new image. Computationally, this is a very expensive operation. For each pixel in the image data, you have to perform as many multiplications as sites in the kernel. For a square image of size N xN and a kernel of size nxn, this results in O((N n)2 ) operations per image. However, a kernel is separable if it can be written as: K = k1 · k2T (3.7) where K is the 2D kernel and k1 and k2 are 1D kernels. A convolution in one dimension is defined as Z ∞ f (x)g(x)dx sc = sd =

−∞ n X

f (i)g(i)

(3.8) (3.9)

i=0

Performing these two separate one dimensional convolutions gives the complexity O(N 2 n), which is a significant improvement for a large kernel.

3.3

Kalman filtering

In 1960, R.E. Kalman published his famous paper[11] describing a recursive solution to the discrete-data linear filtering problem. Since that time, the Kalman filter has been the subject of extensive research and application, particularly in the area of target tracking. The online encyclopedia, Wikipedia [21], has this to say about the Kalman filter: The Kalman filter is an efficient recursive filter which estimates the state of a dynamic system from a series of incomplete and noisy measurements. An example of an application would be to provide accurate continuously-updated information about the position and velocity of an object given only a sequence of observations about its position, each of which includes some error. 14

CHAPTER 3. THEORY

It is used in a wide range of engineering applications from radar to computer vision. Kalman filtering is an important topic in control theory and control systems engineering. Basically, Kalman filtering is a recursive process for filtering a signal. It uses two steps, measurement update and time propagation. The time propagation builds on a model of how the signal can change. Estimating the next value from this model and previous data we get a likely measurement. When the value is to be updated, the measurement and estimation is weighed in proportion to how much one “trusts” the measurement and the model respectively.

3.4

Field Programmable Gate Arrays (FPGAs)

An FPGA is an Integrated Circuit (IC) with a large number of logical gates inside. These gates can be AND, OR, XOR and small lookup tables. The advantage of an FPGA is that these gates can be connected in different patterns to perform all sorts of operations. The behaviour of the FPGA can therefore be changed quickly to suit a new application. This is almost like a small processor, only that all gates can operate independently of each other and in parallel, whereas a processor performs operations in a sequential manner. The number of gates, the area, in the FPGA is limited so there is a trade-off between speed and area. Another issue is to set up the gates in such a way that the delay from input to output from a module is less than a given clock frequency so that we can perform the calculations fast enough to keep up with the input data stream.

3.4.1

Programming language, VHDL

The way of instructing the FPGA to do what we want, is to program it. This can be done in a number of ways from manipulating single gates to programming using one of the higher level programming languages. The two most used languages for programming FPGAs are VHSIC Hardware Description Language (VHDL) and Verilog. VHDL is an Ada like language with very strict type checking. Verilog is more like C. In the work related to this thesis, VHDL was used. The first thing one has to learn when programming in these kinds of languages is that things happen all at once, and not like in a normal computer program where things have a more sequential nature.

15

Chapter 4

Implementation This chapter works its way from our experimental setup and how to perform a number of computer vision processes in hardware, to the actual calculation of the stereo calibration parameters. Details are given on how to implement this in a FPGA by using VHDL with various techniques. Firstly, we present our experimental setup and where this work was conducted. In Sections 4.2 and 4.3, we describe how feature points are extracted from the input stereo images. We also describe how this is achieved in real time by moving a number of steps in the point extraction process from software implementation to hardware implementation in a FPGA. Section 4.4 shows how we obtain point correspondences between the images and Section 4.5 describes how we calculate the essential matrix from these point correspondences.

4.1

System Overview

The information path is shown in figure 4.1. We obtain two images from the two cameras on the stereo head. These are the input to the framegrabber Host Computer

Frame Grabber FPGA

images

Image filtering

Candidate detection

Feature extraction and matching filtered images candidate points

Point correpondances Stereo Calibration calculation

Figure 4.1: Information path and overview of the system used.

16

CHAPTER 4. IMPLEMENTATION

Figure 4.2: Toyota Landcruiser, the research platform used

board. On this board resides an FPGA which takes the camera images as input, creates a Gaussian pyramid, performs local max/min calculations and computes the edge image using the Sobel operator. It then outputs this information together with the unprocessed raw images to the host computer. Using this information, the software in the host computer computes the point correspondences. Finally, the essential matrix is calculated using these correspondences. The stereo calibration calculations are extremely sensitive to outliers in the data. This leads to the selection of the Scale Invarant Feature Transform (SIFT), presented by David Lowe in 1999[14], for feature extraction and detection of the point correspondences. The SIFT algorithm is able to robustly find corresponding points in the two images, thereby minimising the number of outliers.

4.1.1

Experimental setup

This work was conducted at the Computer Vision and Robotics lab at ANU/NICTA in Canberra, Australia. For practical tests and evaluation, this lab possesses a 4wd Toyota Landcruiser equipped with a wide range of sensors and actuators. The aim of the research performed is to create driver assistance systems to aid the driver, giving him a safer and more relaxed driving experience. To achieve this, a number of algorithms are being developed including Pedestrian Detection, Sign Detection, Lane Tracking and Obstacle Detection to mention a few. The sensors used for this work are a pair of cameras mounted in a stereo system named CeDAR[7]. The cameras output analog NTSC signals which are digitized by a frame grabber equipped with an FPGA. The raw and filtered images are then transferred via the PCI bus to the Videoserver in the host computer.

17

CHAPTER 4. IMPLEMENTATION

αl

−αr

d = 0.3 m Figure 4.3: LEFT: CeDAR head, RIGHT: CeDAR head from above showing the cameras’ degrees of freedom.

4.1.2

CEDAR head

CeDAR[7] is a high-speed, high-precision stereo active vision system. The mechanism is lightweight, yet capable of motions that exceed human performance. Figure 4.3 shows the standard CeDAR unit. In our system, the system is inverted(upside down) and mounted for use with driver assistance in the vehicle shown in figure 4.2. The head has three degrees of freedom with encoders on each axis. The cameras are mounted 0.3 m apart and can rotate around an axis through their optical center independently of each other. The third degree of freedom is the tilt axis. The cameras used for this work were normal NTSC cameras. These deliver an analog signal consisting of Red, Green and Blue (RGB) information about the scene in front of the cameras. This scene is divided into 640 by 480 pixels. Every second line is delivered in a batch known as a field at a rate of 60 fields per second. Therefore, we get an image of size 640 by 240 pixels 60 times per second delivered through the analog signal. This signal is then sampled by the frame grabber.

4.1.3

Frame grabber

The frame grabber is a PicProdigy development board from Atlantek Microsystems equipped with a Xilinx Virtex II FPGA with 2M gates. Atlantek uses this FPGA for buffering and to interface with the host computer via the PCI bus. Currently they do not use the entire FPGA. Our image processing is being inserted in between the buffering and the PCI interface, diverting the pipeline in the FPGA. Once the image input and processing is done, the images are transferred to the host computer and its videoserver.

18

CHAPTER 4. IMPLEMENTATION

4.1.4

Videoserver

In the host computer, the videoserver includes the software used to capture the images from the board and deliver them to the different client programs used in the vehicle. With this work it is possible not only to provide raw image data, but also filtered images as well as stereo calibration information. This can then be used by the clients in the system.

19

CHAPTER 4. IMPLEMENTATION

...

Interest points

Octave

sobel images

Interest points

Octave

sobel images

Input image

Figure 4.4: Overview of pipeline used to create the Gaussian pyramid and its outputs.

4.2

Gaussian Pyramid and Local min/max detection

A Gaussian pyramid is a pyramid of increasingly blurred and subsampled images with the original image as the base. Another name for this is Scale space. The idea was introduced by Lindeberg in 1993[12] and is used to detect blob-like structures. The idea is that these structures only match the kernel used in the convolutions at a certain scale. It is then possible to extract information about them at that particular scale and thereby reduce the effect of scaling. This is mainly used in object recognition. The concept is often used in scale invariant algorithms such as SIFT. Our approach is to implement the calculations in a parallelised pipeline in order to parallelise the image blurring. To blur an image, one convolvs it with a Gaussian kernel. In the following, this kernel is characterized by a kernel size(n) and a standard deviation(σ). In Figure 4.4 an overview of the involved calculations is shown.

4.2.1 im.

Gaussian stage ∆

LB G1

LB

Sobel

DOG

LB

DOG

LB

DOG

LB



G2



LB

G3

Im/2

Sobel image

Local max/min

Detected points

Subsampled image, input to next octave.

Figure 4.5: One pyramid octave. im. = Input image stream, LB = Line buffer, RB = Row buffer, ∆ = Delay, Gn = Gaussian vector convolution with standard deviation as in table 4.2.

20

CHAPTER 4. IMPLEMENTATION

The pixels from the camera images are pipelined from the framegrabber’s buffering stage to our algorithms. The input image data is a 60 frames per second stream containing interlaced odd and even fields. In Figure 4.5, we show the components needed to create an octave in the image pyramid. In the top left corner, the input image is pipelined in. The first block is a line buffer which outputs five de-interlaced lines. This is split into three pipelines leading to one delay buffer, one convolver and a Sobel stage. The convolvers are connected in a cascade. That is, the output of one convolver is used as input to the next, creating an increasingly blurred image. This reduces the errors introduced from having a kernel that is too small compared to the standard deviation. This problem is addressed and explained further in section 4.2.4. Output from the last convolver is subsampled to form the input to the next octave. Our pyramid consists of four octaves. The convolver stages input five pixels per clock cycle and output a resulting pixel. For a row convolution, the input buffering is a series of simple latches. The column convolution needed a bit more consideration. In order to avoid applying an asymmetric kernel, we interpolate the missing lines from the existing ones. That is, we are recreating every second line in a field in order to simulate having the full uninterlaced image in every field. A bonus effect of this is that it reduces the number of lines we have to buffer from four to two. These linebuffers are further described in section 4.2.2 From the pyramid stage, the four images are pipelined into the candidate detection stage. Here, adjacent scales are subtracted to form the Difference of Gaussian(DOG) images. In order to compare these images and detect local maxima/minima, these DOG images have to be buffered to allow all 27 pixels to be clocked into the comparator at once. If the current point is a local maximum or a local minimum, it is reported to the host computer with its coordinates. An example of the output of this stage is shown in figure 4.6. Performing this last stage in the FPGA removes huge amounts of memory operations from the host computer. The only extra data transferred from the framegrabber are the four Sobel images and the coordinates of the candidate points.

4.2.2

Linebuffers

Since we need to operate on columns, we need a way of accessing pixels from previous lines. In fact, often we need to apply an operation or a filter to a rectangular window in an image. To do this we implement line buffers. The line buffer is implemented as a circular memory. Looking at the illustration in Figure 4.7, we follow a pixel’s path from the input pipeline in the bottom. 21

CHAPTER 4. IMPLEMENTATION

Figure 4.6: Example of interest points detected by simulated hardware in a typical road scene. Linebuffer

Window in the image

1,1

1,2

1,3

2,1

2,2

2,3

3,1

3,2

3,3

Input pixels

Figure 4.7: Principles of a linebuffer. Linebuffers are needed in order to be able to apply a filter or a convolution to a rectangular patch in a pipelined image. One pixel moves from bottom right, through to the top left as the window moves over the pipelined image.

From the bottom right position (3,3) it is latched with every clock cycle left to (3,2) and (3,1). Then, as the window moves forward in the image our pixel is not needed until we get to the next line, and is therefore stored in the linebuffer. One line later, the pixel is retrieved from the linebuffer and the process repeated, now in the second row ( (2,3) to (2,2) to (2,1)). These linebuffers were implemented in hardware using Xilinx native SelectRAMs, an area in the FPGA specialised to act as memory.

4.2.3

Convolutions

To be able to implement many computer vision algorithms in VHDL, there are a number of smaller problems to overcome. One of the first and most basic one is how to perform convolutions without using a large portion of the available area in the FPGA. Convolutions are used in image filtering, edge detection, pattern matching to mention a few. Gaussian blur is done by convolving the image with what is called a Gaussian kernel. As previously described in Chapter 3, convolution is a com22

CHAPTER 4. IMPLEMENTATION

putationally complex operation. However, the Gaussian kernel is separable why the computations per pixel is reduced to two vector multiplications and a number of additions. In hardware, additions are not expensive, but multiplications are. Our way of addressing this problem is to remove the explicit multiplications by using a Lookup Table (LUT). Atmel describes in [3] how to do this efficiently and we will give a short description of how it works here. To compute the dot product between f and g in the standard way, you would first compute all the multiplications and then do the additions. The idea here is to instead sum partial products of the same bit-level. To illustrate this, consider an example with a two bit kernel and a kernel size of n = 3. Suppose we want to calculate: s = f (1) · g(1) + f (2) · g(2) + f (3) · g(3) = =1·3+3·0+2·2=7

(4.1)

We calculate partial products by copying the bits of the kernel if the data bit is 1 or set them to zeros if the data bit is 0. The bits in the data f corresponding to the first partial product are shown in bold font. kernel g(i) data f (i) partial prod. p1 = partial prod. p2 =

11 x 01 11 00

00 x 11 +00 +00

10 x 10 +00 +10

= 011 = 010

To give the final result, p2 is shifted up and added to p1 p = p1 + (p2

Suggest Documents