School of Computer Science and Electronic Engineering ii. SURNAME: KANWAL ... with two best known matching algorithms (RANSAC and Hough Transform). ...... automotive vehicle and second, a video using a handheld camera. Both types ...
Computer Science
Nadia Kanwal
Motion Tracking in Video using the Best Feature Extraction Technique
Master's Thesis
Motion Tracking in Video Captured by Moving Camera using Best Feature Extraction Technique
By
Nadia Kanwal
A thesis submitted in partial fulfillment of the Requirements for the degree of Masters of Science University of Essex 2009
School of Computer Science and Electronic Engineering
SURNAME:
KANWAL
OTHER NAMES:
NADIA
QUALIFICATION SOUGHT:
MPhD COMPUTER SCIENCE
TITLE OF THE PROJECT:
“MOTION TRACKING IN VIDEO CAPTURED BY MOVING CAMERA USING BEST FEATURE EXTRACTION TECHNIQUE”
SUPERVISOR:
Dr. ADRIAN CLARK
DATE:
August 26, 2009
Abstract
This project explores the use of local features for motion tracking to estimate the direction of a moving object. Availability of a number of feature extraction and matching algorithms in literature make it difficult to select any one of them. Therefore, it was considered appropriate to assess the suitability of a technique for a particular application before actually putting it into use. The project begins with a comparative study and analyzes two state-of-art techniques for feature extraction (SIFT and SURF) along with two best known matching algorithms (RANSAC and Hough Transform). The performance evaluation is focused on measuring the correctness of the algorithms for tracking applications. The statistics Mc Nemar’s test has been applied to find the more efficient method for feature extraction and matching. Using the results obtained from this analysis, a set of feature extractor and matching technique has been employed to estimate the direction of a moving object from videos captured using a handheld camera as well as camera fixed on a moving vehicle. The proposed method is capable of detecting left, right, up, and down movements with reasonable accuracy in real world videos. The results are not hundred percent accurate but encouraging enough for further investigation. The system is capable of identifying the direction of the moving object with more than 90% accuracy if the object changes its direction independent of the surroundings, and with less than 30% accuracy otherwise.
This Project dissertation is in accordance with Examination Regulations 6.12 and 6.13. ii
School of Computer Science and Electronic Engineering
Table of Contents 1
Chapter: Introduction ............................................................................................................................ 1
2
Chapter: Background ............................................................................................................................ 3 2.1.1
Principal Issues ..................................................................................................................... 4
3
Chapter: Literature Review ................................................................................................................... 6
4
Chapter: Algorithms for Feature Extraction ....................................................................................... 12 4.1.1
SIFT .................................................................................................................................... 12
4.1.2
SURF................................................................................................................................... 16
4.2
4.2.1
Nearest Neighbor Algorithm ............................................................................................... 19
4.2.2
RANSAC ............................................................................................................................ 20
4.2.3
Hough Transform ................................................................................................................ 21
4.3 5
7
Testing Methods.......................................................................................................................... 22
Chapter: Performance Evaluation ....................................................................................................... 26 5.1
6
Matching Algorithms .................................................................................................................. 19
Evaluation Data ........................................................................................................................... 26
5.1.1
Code Implementation ...................................................................................................... 29
5.1.2
RANSAC vs Hough Transform .......................................................................................... 29
5.1.3
SIFT vs. SURF .................................................................................................................... 39
5.1.4
SIFT vs SURF Conclusion .................................................................................................. 55
Chapter: Application on real world video sequences .......................................................................... 56 6.1.1
Proposed Method for Direction Estimation of a Moving Vehicle ............................. 57
6.1.2
Real World Videos for Experiment:............................................................................... 59
6.1.3
Experimental Setup .......................................................................................................... 60
6.1.4
Verification of Results ..................................................................................................... 61
6.1.5
Results and discussion .................................................................................................... 61
6.1.6
Application’s appraisal ................................................................................................... 75
Chapter: Conclusion and Future Work ............................................................................................... 76
References ................................................................................................................................................. 78 Appendix A: Graphs Comparing RANSAC and Hough Transform ........................................................... 80
iii
School of Computer Science and Electronic Engineering
List of Tables Table 1: Feature detectors ............................................................................................................................. 7 Table 2: Local descriptors ........................................................................................................................... 10 Table 3: Z-Score table ................................................................................................................................. 25 Table 4: Mc Nemar's Test for Hough and RANSAC using SIFT features ........................................ 35 Table 5: Mc Nemar's Test for Hough and RANSAC using SURF features ..................................... 35 Table 6: Algorithm’s parameter values ....................................................................................................... 39 Table 7: Mc Nemar's Test for Blurred images....................................................................................... 42 Table 8: Mc Nemar's Test for Compressed images .............................................................................. 44 Table 9: Mc Nemar's Test for Images with change in Illumination ................................................... 46 Table 10: Mc Nemar's Test for Images with change in view point .................................................... 48 Table 11: Mc Nemar's Test for Images with change in view point .................................................... 50 Table 12: Mc Nemar's Test for Zoomed and Rotated Images............................................................. 52 Table 13: Mc Nemar's Test for Zoomed and Rotated Images............................................................. 54 Table 14: Parameter values used for application ........................................................................................ 57 Table 15: Real world video data ............................................................................................................. 59 Table 16: Camera Specifications ............................................................................................................. 60 Table 17: Video # 1 Results verification ................................................................................................ 63 Table 18: Video # 2 Results verification ................................................................................................ 65 Table 19: Video # 3 Results verification ................................................................................................ 67 Table 20: Video # 4 Results verification ................................................................................................ 69 Table 21: Video # 5 Results verification ................................................................................................ 71 Table 22: Video # 6 Results verification ................................................................................................ 73 Table 23: Video # 7 Results verification ................................................................................................ 75
iv
School of Computer Science and Electronic Engineering
List of Figures Figure 1: Scale Space (Lowe, 2004) ........................................................................................................... 13 Figure 2: Extrema Detection (Lowe, 2004) ................................................................................................ 13 Figure 3: SIFT Key Point Descriptor (Lowe, 2004) .................................................................................. 15 Figure 4: Original image ......................................................................................................................... 17 Figure 5: Integral image .......................................................................................................................... 17 Figure 6: Haar Wavelet Filter for x and y direction .................................................................................... 18 Figure 7: SURF Descriptor (Bay, Ess, Tuytelaars, & Luc, 2006) .............................................................. 19 Figure 8: ROC Curves ................................................................................................................................ 23 Figure 9: Datasets and their description (a) ................................................................................................ 27 Figure 10: Datasets and their description (b) .............................................................................................. 28 Figure 11: Graphs comparing Number of Matches by Hough Transform and RANSAC .......................... 31 Figure 12: ROC Curves: Hough Transform and RANSAC matching Graffiti, Boat and UBC Images ..... 33 Figure 13: ROC Curves: Hough Transform and RANSAC matching Bike and Leuven Images ............... 35 Figure 14: Graphs comparing number of matched feature points detected by SIFT and SURF ................ 38 Figure 15: Graphs comparing SIFT and SURF features for Bikes dataset ........................................ 41 Figure 16: Graphs comparing SIFT and SURF features for UBC dataset ......................................... 43 Figure 17: Graphs comparing SIFT and SURF features for Leuven dataset .................................... 45 Figure 18: Graphs comparing SIFT and SURF features for Graffiti dataset .................................... 47 Figure 19: Graphs comparing SIFT and SURF features for Wall dataset ......................................... 49 Figure 20: Graphs comparing SIFT and SURF features for Boat dataset ......................................... 51 Figure 21: Graphs comparing SIFT and SURF features for Bark dataset ......................................... 53 Figure 22: Direction detection in two sample frames ......................................................................... 60 Figure 23: Frames for video # 1 (Car moving on straight road) ................................................................. 62 Figure 24: Graphs indication Left and Right motion of a moving vehicle....................................... 62 Figure 25: Frames form video # 2 (Car Moving Up and Down the Hill) ................................................... 64 Figure 26: Graphs showing Vehicle's motion in Left, Right, Up and Down direction .................. 64 Figure 27: Sample frames form indoor video....................................................................................... 66 Figure 28: Graph showing left motion in Indoor video sequence..................................................... 66 Figure 29: Sample frames form outdoor sequence .............................................................................. 68 Figure 30: Graph showing left movement in outdoor sequence ....................................................... 68 Figure 31: Aircraft Taking-off Video Sequence ......................................................................................... 70 Figure 32: Graphs presenting motion detection of an aircraft taking-off ........................................ 70 Figure 33: Aircraft Flying over Trees ......................................................................................................... 72 Figure 34: Graphs presenting Upward and Straight Motion of an aerial vehicle ........................... 72 Figure 35: Air to Air Video Sequence ........................................................................................................ 74 Figure 36: Graphs presenting Left, right, up and down movement of an aircraft ......................... 74
v
School of Computer Science and Electronic Engineering
Acknowledgements
I am pleased to thank Dr. Adrain Clark whose endless support and guidance was available to me throughout this project. I acknowledge his personal efforts in collecting real world videos for the application part of this project. I would like to thank Mr. Shoaib Ehsan for having valuable discussions on the topic. I would like to express my heartfelt gratitude to my husband and my immediate family who have been a constant source of love and support. I am also indebted to my colleagues at Lahore College for Women University, for their well wishes.
vi
School of Computer Science and Electronic Engineering
1
Chapter: Introduction
Today, the focus of research is not only to develop sophisticated hardware to capture the best quality images and videos but also to process this information efficiently and use it purposefully. Recently the advancement in object detection, recognition and tracking supported a lot of vision applications such as “Human Behavior Analysis through videos”, “Security Surveillance Systems”, “Recognition Systems to facilitate people with disabilities”, “Image Mosaicing”, Image Rendering” and much more. Performance of these applications is based on accurately characterizing the image data into classes representing specific properties to facilitate matching and recognition tasks.
Image points having some distinguishable property from its neighborhood, can be collected as interest points or features of an image. This property can be the information about colour, gradient, contrast etc. An interest points is only useful if it can be matched with images under occlusion, cluttering and varying imaging conditions. Therefore the goal of any vision algorithm is to find those features which are insensitive to noise, illumination and other transformations, efficiently and accurately.
Different approaches for detecting and matching objects and images with set of other images have been proposed in literature such as optical flow, Image segmentation and region based methods, but techniques based on local features are the most popular for detection, recognition and tracking applications. Feature extraction is mainly a two step process, detection and description of an interest point. Where detection means to locate image points with some distinguishable property and description contains the information (such as derivatives) of the neighborhoods of points which provide a mean of establishing point to point correspondences and improves matching results. Both of these steps in feature extraction plays vital role in upgrading the performance of the algorithm. The efficiency and accuracy of the technique lies in accurately detecting feature points and efficiently generating its descriptor.
1
School of Computer Science and Electronic Engineering
There are number of techniques developed for feature detection and description, some well known of them are Harris-Laplace, Hessian-Laplace, Steerable filters, Spin Images, SIFT and SURF. These techniques can be characterized as good or bad, based on how much these are capable of detecting stable features in images with noise, illumination, and affine transformations. All of these describe features differently and as given in (Mikolajczyk & Schmid, October 2005) region based descriptors appear to give better performance. Therefore, Scale and rotation invariant feature extraction algorithms like SIFT (Scale Invariant Feature Transform), SURF (Speeded-Up-Robust Features), Harris-Laplace and HessianLaplace are considered good for applications dealing with real images/videos. However, SIFT and SURF are considered state-of-the-art and mostly referred in literature so far. The features extracted by these techniques are robust to significant range of affine distortion, changes in illumination, 3-D view point changes and noise (Lowe, 2004). Therefore, I have selected them to carry out a detailed comparative study.
Feature detection and description without matching is incomplete for any vision application therefore, we also need some efficient method to associate related points in pair of images. In literature, there are two widely used matching algorithms known as RANSAC and Hough Transform. Just like other vision techniques the performance of matching algorithms also varies with application and therefore, no technique can be characterized as best (in terms of accuracy and efficiency). The purpose of analyzing the detector and descriptor’s performance along with matching algorithms is to find a complete set of technique for recognition or tracking applications with good performance. Remaining part of this report explains background of the project (chapter 2), an overview of the work already done in this field (chapter 3) followed by introduction to the techniques (for feature extraction and matching) and testing methods used for evaluation (chapter 4). Next part is performance evaluation results and their discussion (chapter 5), and application of best feature extraction and matching technique
2
School of Computer Science and Electronic Engineering
for tracking the motion of a moving object (chapter 6). The report ends with conclusion and future work (chapter 7).
2
Chapter: Background
Image and video analysis had great influence on most of the research work going on in computer vision these days. It is due to its application in a vast variety of areas such as in entertainment (development of software to mimic human actions on visual actor), in sports (the development of robotic players for giving practice sessions to human players, like a foot ball player, tennis ball tracker, badminton player etc). Different kinds of automatic vehicles can be developed which can move autonomously in hazardous situations and difficult areas. In Medical Science, imaging can be very helpful to locate specific cells associated with a particular disease in the human body and it helps doctors with their diagnosis and surgeries. Similarly, one cannot deny the importance of efficient analysis of human behavior for security surveillance systems to monitor areas with high risks, in virtual reality to allow the user to interact with the computer-simulated environment, in robotics to allow machines to interact with humans in order to facilitate them in their day to day life like attendants for elderly people and patients, and last but not the least, intelligent machines especially for people with disabilities such as wearable glasses to track objects for blind people. For all these applications the requirement is to have efficient object detection and tracking methods which can provide real time results with great accuracy. The application developed for the fulfillment of this project is not new however, it is unique in a sense that no other sensors have been used except for camera to get real time data, while allowing camera to move along with object/vehicle, instead of having a fixed position. The basic idea is to use local image information from frame to frame to identify the direction of moving object carrying a camera. It is, therefore, different form object detection or tracking using fixed a camera. The purpose of this activity is
3
School of Computer Science and Electronic Engineering
to identify the suitability of local features matching method for application like control system for an unmanned vehicle which can be an automobile or an airplane.
2.1.1
Principal Issues
Matching two images together and finding exact correspondence between their feature points is the most challenging task but the most fundamental one for any recognition and tracking application. In order to find local information from images which should be sufficient enough to give information regarding a moving object’s pose is the prime objective of this application. Researchers in robotics and vision community are presenting solutions to similar problems, such as Lowe, who has been the creator of SIFT technique, along with his colleagues used SIFT features to identify natural landmarks to build SLAM. This method appeared to be a better solution than finding artificial landmarks (Stephen, Lowe, & little, October, 2002).The use of local gradient information of video frames to estimate the global position of object has been a great motivation. Sim and Dudek have also suggested the use of landmarks for pose estimation (Sim & Dudek, September 1999.). These two applications proved the use of local feature based methods as better approach to find the global position and pose of a mobile robot. Therefore, to estimate the direction of a moving object using the same technique seems quite feasible. Another issue is the selection of feature detector and descriptor. Numbers of local feature descriptors have been proposed in the literature. An evaluation of these is given in (Mikolajczyk & Schmid, October 2005). But the problem with this analysis is that it is still unclear and cannot tell in unequivocal terms if any algorithm has superiority over the others for matching features between images with affine transformations. The study also lacks the use of error bars to clearly identify the effectiveness of correct results. The number of correct matches does not make an algorithm the best unless its claim is accompanied by number of false matches as well as true and false negatives. It has been mentioned in the
4
School of Computer Science and Electronic Engineering
referenced paper that it is difficult to identify false matches and hence also difficult to find number of true and false negatives. Therefore, this issue has been taken up in this study. Another factor of revising this kind of evaluation is to include the latest technique SURF which claims to have equally better performance than SIFT or any other distribution based descriptor. SURF descriptor stores Haar wavelet response of the region surrounding the pixel in 64 or 128 bins. Herbert et al themselves analyze the performance of SURF-128 and SURF-64 and found SURF-64 with efficient and accurate results. After acknowledging SIFT as best performer now the focus of research has been shifted to compare SIFT with SURF in vision community. It is because of the claim made by SURF being better as well as time efficient which has grabbed the attention of many scientists. Work done by (Danny, Shane, & Enrico) for comparing local descriptors for image registration and comparison of different implementations of local descriptors by (Bauer, Sunderhauf, & Protzel, 2007) is a good starting point to build ground for the study to be undertaken. Valgren and Lilienthal have also evaluated the performance of SIFT and SURF algorithms in outdoor scenes for robot localization (Valgren & Lilienthal). According to their study SURF gives comparable results to SIFT but with low computational time. All of this comparison and evaluation work does not carry any substantial statistical proof of the results claimed and therefore needs to have a detailed investigation of the behavior of the two sate of the art algorithms in different imaging conditions or matching between images with affine transformations. Most of the research papers referred above used precision-recall curves to compare the performance of the local descriptors. The approach is not wrong but some other statistical test can also be applied to do in-depth analysis of the behavior of the techniques, such as to compare two or more algorithms ROC curves can be used to find the superiority of one technique over the other. Some statistical test such as Mc Nemar’s test can also be applied to accurately identify the behavior of two techniques under similar conditions. The detail of this test is provided in chapter 4.
5
School of Computer Science and Electronic Engineering
3
Chapter: Literature Review
Detection and recognition tasks have been the focus of research for more than two decades. It would not be out of place to say that it is getting mature now and researchers have produced impressive algorithms to solve these issues, some of which are considered landmark in computer vision and image processing. The main approaches for object detection and tracking can broadly be divided into three categories i.e. Top-down approaches, bottom-up approaches and combination of the two (Wang, Jianbo, Gang, & I-fan, 2007). In top-down approaches, the image as a whole is taken as one region and then segmented into sub regions to find specific object. This model is inspired by pattern matching technique in human beings to identify and locate objects (Oliva, Antonio, & Monica, Sept, 2003) but it is important to say that humans also use their knowledge of object classes which helps them to identify different types of objects belonging to the same class such as different types of cups, pens, humans and animals, which machines lack. To achieve this intelligence, software needs two types of information, the image and the object itself. The object can be modeled using some representation method and image information can be collected using methods like Colour Histogram, Wavelet Histograms etc. Then segmentation of the image into sub-regions based on these colours or wavelet histograms helps to identify the one that holds the object. The difficulty with this approach is that it does not give importance to object’s appearance or its local features and therefore fails to differentiate between different objects of the same class and so can result in increased false matches. Alternatively the Bottom-up approach focuses on finding object features like edges, corners, segments and colours. Algorithms then use these or group of these features to correspond to the object of interest which are then matched by evaluating certain mathematical functions such as contour matching (Vittorio, Tuytelaars, & Gool, July, 2006), corner detection, colour histograms etc... Use of features for object detection has been quiet popular but the limitation of Bottom up approach is that they have to spend more time in finding local features and then group them to correspond to one object. Therefore, many 6
School of Computer Science and Electronic Engineering
researchers combined these two approaches to take advantage of the both. A number of algorithms have been developed using this concept. Before going into the discussion of these algorithms, it is important to discuss the basic techniques developed to identify an image point as feature. An image point can qualify as feature only if it holds the property of uniqueness and repeatability. Identification of features having these two properties can increase the accuracy of detection algorithm but it is a difficult task at the same time especially due to possible change in imaging conditions and viewpoints. Simple features (‘interest points’) can be extracted from images by applying ‘Binary Thresholding’ to find difference in pixel values which indicates the presence of some interest points or by using complex mathematical functions like 1st / 2nd order derivatives expressed in terms of Gaussian derivative operators or in other words as local maxima or zero-crossings of image pixels in 2nd order derivatives. Using these mathematical methods, edges, corners and blobs can be computed. A brief Summary of features and their detection methods is given in (Table-1). These detectors have been commonly used in all algorithms for object detection and tracking. Table 1: Feature detectors
Feature Detector
Features
Year
Moravec’s operator
Edge
1977
Canny
Corner
1986
Harris and Stephens
Edge and corner
1988
SUSAN
Edge and corner
1997
FAST
Corner
1998
Laplacian of Gaussian, Difference of Gaussians, Determinant of Hessian
Edge, Corner and Blob
1980
Moravec’s operator was the first known operator to find ‘interest point’ and he found a point where there was large intensity variation in every direction. He used square overlapped windows of 4x4 to 8x8 and 7
School of Computer Science and Electronic Engineering
calculated sums of squares of differences of pixels adjacent in each of four directions, such as difference in a 4x4 patch in horizontal, vertical, diagonal and anti diagonal directions. He selected the minimum of the four variations as representative values for the selected window (Moravec, 1977). The ‘interest points’ defined by Moravec were afterwards classified as corner points in an image. Moravec’s intention was to find image points having different properties than their neighboring pixels. The idea was then used by others to find edges and blobs. Edges simply make the boundary of an object and are therefore the most basic characteristics of an object to define its shape. In computer vision and image processing, the most common object representation used is contour (based on edges). An edge in an image is the pixel location where there is a strong intensity contrast in pixel values. Many researchers have defined ways of finding edges, like Sobel, Laplace and Canny (Canny, 1986). However, Canny’s technique has been considered the most efficient one. According to his approach, the image needs to be smoothed first to eliminate noise and then Image Gradient is used to highlight with high spatial derivatives. Then his algorithm attacks each region to suppress non-maximum edges. Finally hysteresis is used to suppress remaining non edge pixels in which two threshold values are used. The pixels with value higher than the high threshold value are considered as edge pixels. Harris and Stephen’s Corner Detector’ (Harris, Chris, & Stephens, 1988) is the improvement of Moravec’s detector. They find interest points using derivatives in ‘x’ and ‘y’ direction to highlight the intensity variations for a pixel in a small neighborhood. The second matrix will encode this variation which is named as trace matrix. An interest point is then identified using the determinant of the trace matrix and then non-maxima suppression is applied. In brief, it uses gradient information to find the corner. The technique is more sound than Moravec’s Operator and provides good results but it has some short comings as well. Firstly, it is sensitive to noise due to the dependence on gradient information. Secondly, it gives poor performance on junction points and ideally identifies corner points on L junctions only. At the same time, its computational cost is more than any other corner detector.
8
School of Computer Science and Electronic Engineering
SUSAN Corner Detector (Smith & Brady, 1997) used a circular mask over pixels to be tested. The approach is to sum the differences in intensities of all pixels in the mask with the pixel under consideration and if that sum is less than the threshold then pixel value will be replaced by the difference between threshold and the sum otherwise set to ‘0’. The technique can be adopted to find edges along with corners. If threshold is large it will detect all edges and if the threshold is small then it will find the corner FAST (Trajkovic & Hedley, 1998), another approach to find local features, is to consider 16 pixels in a circle of radius 3 around the pixel. If ‘n’ immediate pixels are all brighter than the central pixel by at least two or on the other hand all darker than the pixel under consideration then this pixel is considered to be a feature point. The best value of radius can be different, as most researchers found radius of ‘9’ to produce good results in terms of corner detection. FAST is more efficient than any other corner detector but it is quiet sensitive to noise along edges and readily identify diagonal edges as corners. Difference of Gaussian (DoG) is a greyscale image improvement algorithm, in which one blurred image is subtracted from another less blurred image to find edges and removing noise. The blurring of image is done by applying two different blurring factors (Gaussian filter with low and high blurring values) and the resultant two blurred images are then subtracted. The resultant image only contains points with higher grayscale value as compared to others and presents edges. It has another application of smoothening of image used by some other feature detectors as a basic step. Laplacian operator uses second derivative to find zero crossings for edge detection (Morse, 1998-2000). Laplacian of Gaussian was the first method used to find blobs in an image. It uses Gaussian kernel to convolve the image at a certain scale and then a Laplacian function is applied to identify regions known as blobs where the result, if less than zero, shows the presence of bright blobs and greater than zero shows the presence of dark blobs. The problem with second derivative methods is that they are very sensitive to noise.
9
School of Computer Science and Electronic Engineering
Hessian matrix is 2nd order partial derivative of a function and is useful to find blobs and edges. First of all, the 2nd order partial derivative of a window of pixel values is calculated. Then the determinant of this matrix is checked for non zero at any point to categorize that as feature point. If the determinant of Hessian at point ‘x’ is positive then ‘x’ is considered to be a local minima (useful to find blobs), if determinant appears to be negative it means that ‘x’ is local maxima (useful to find blob and edge) and a zero value means that ‘x’ is in determinant. These critical points give the presence of blobs. Hessian Matrix has been used to develop algorithms for feature point’s extraction. The determinant of Hessian Matrix of Haar Wavelet has been used as an interest point locator in ‘SURF’ (Bay, Ess, Tuytelaars, & Luc, 2006). Table 2: Local descriptors
Descriptor
Data
Local Jet
Image derivatives
Steerable Filters
Derivatives computed by Gaussian filter
Moment Invariants
Shape and intensity distribution in a region
Proposed by Koenderink and Doorn Freeman and Adelson Van Gool et al.
Year 1987 1991 1996
Shape context
3D Histogram of edge point locations and orientations
Belongie et al.
2002
Spin Image
Histogram of point positions and intensity in the neighborhood of 3D interest point
Lazebnik et al.
2003
SIFT
3D histogram of gradient locations and orientations
David Lowe
2004
PCA-SIFT
Vector of image gradient in x and y direction
GLOH
Gradient Location-orientation histogram
SURF
Wavelet Response of region surrounding the interest point
Y. Ke and Sukthankar Mikolajczyk and Schmid Herbert et al.
2004 2005 2006
All of the above mentioned detectors are capable of detecting feature points based on some characteristics but in order to match these points in other images there is still need of distinguishable data regarding
10
School of Computer Science and Electronic Engineering
feature point which can make it unique and robust to varying imaging conditions. Therefore a lot of descriptors have been developed to store region’s information around feature point. A descriptor plays an important role in matching because of the amount of information it contains about interest point and its neighborhood. (Table-2) present different descriptors proposed in literature along with the type of data they store. A detailed performance evaluation of these descriptors has been done by (Mikolajczyk & Schmid, October 2005) which not only explore the capability of producing correct matches but also provides a detailed analysis of behavior of these descriptors under different image transformations like Scale, Rotation, JPEG compression and Illumination. It is interesting to see that deferential based descriptors had worst performance for affine transformation while distribution based descriptors like descriptors based on histograms and shape context performed very well for almost all kind of transformations.
11
School of Computer Science and Electronic Engineering
4
Chapter: Algorithms for Feature Extraction
The goal of a good recognition system is to find image features which are repeatable, distinctive, invariant and efficient to compute and match. Techniques selected for this study have followed almost similar concept of identifying features and describing their neighborhood information. Brief description of the two techniques, SIFT and SURF used in this project is given here under.
4.1.1
SIFT
This technique combines a scale invariant region detector and a descriptor based on histogram of gradient distribution around each detected point (Lowe, 2004). Main steps defined for detection and description are as follows. 4.1.1.1
Scale Space Extrema Detection
The purpose of this step is to find interest points which are stable across all possible scales using scale space. Where scale spaces is generated by convolving the input image with a variable scale Gaussian function. , ,
, ,
∗
,
where * is the convolution operation in x and y 1
, ,
2
/
Laplacian of Gaussian is computationally expensive operation, therefore an efficient way is to find Difference-of-Gaussian of smoothed images and the operation becomes subtraction of two images only. , ,
, , ,
,
, ,
∗
,
, ,
12
School of Computer Science and Electronic Engineering
A scale space contains at least four octaves with 3 intervals (3 samples per octave) and then further octaves can also be generated by down sampling the Gaussian Image with a factor of 2
Figure 1: Scale Space (Lowe, 2004)
Next is to find the interest point from this scale space. In every octave each image point is compared with its 26 neighborhood points in three dimensional space (x, y and scale) and it qualifies as distinctive pixel if it has maximum or minimum value with respect to all 26 neighbours.
Figure 2: Extrema Detection (Lowe, 2004)
The question which arises here is how many octaves and how many samples per octave should be enough to generated stable key points. Lowe’s paper as well as Bay et. al. found that taking 3 samples per octave with the multiplicative difference gives maximum number of key points and with large number of samples, the number of stable key points starts to decrease. The point is strengthen from the fact that the blurring the image with big filters cause image points to lose their characteristics. The same problem occurs if we increase the number of octaves. Therefore 4 octaves with 3 samples each are considered to be the optimal choice. 13
School of Computer Science and Electronic Engineering
4.1.1.2
Accurate Key point Localization
After identifying a keypoint, next step is to find its local information such as location, scale and ratio of principal of curvature. At this point it is important to leave the points which have very low contrast (selected as keypoint due to noise). Here Lowe suggests using 3D quadratic function to determine the interpolated location of selected points. In order to find points with maximum contrast the Lowe discard the pixels whose derivative is less than 0.03 assuming pixel values in the range of (0,1). The keypoints selected so far are the one with considerable contrast value as compared to their neighboring pixels. But we still have to deal with noise and false edge responses and for this Lowe suggested the use of Hessian Matrix instead of Harris and Stephan’s corner detector to avoid unnecessary calculations of eign values. The method is to find the principal curvature in two directions, if the point is an edge point it will have one principal curvature much higher than the other, but if it will be a corner the difference will be very low and therefore will be selected. The determinant of Hessian Matrix gives the value of principal curvature
H=
Or equivalently
⟺
1
The keypoints are rejected with principal curvatures greater than 1.
4.1.1.3
Orientation Assignment
Another important property that a feature point should possess is its invariance to orientation. An orientation histogram is formed by recording all gradient orientations of sample points within a region 14
School of Computer Science and Electronic Engineering
around the key point. Peaks in the histogram reflect the possible directions of the local gradients. Highest peak in the histogram along with any other local peak within the range of 80% of the highest peak help selecting key points with these orientations. Therefore there will be multiple key points created at the same location and scale but with different orientation. Finally to improve accuracy the key point is interpolated among three peaks histogram values to find stable gradient orientation.
4.1.1.4
Descriptor
The final step is to define a descriptor that also adds invariance to illumination and 3D view point. SIFT descriptor starts by calculating gradient magnitude and orientation at each image point in a region around key point as shown in the Figure. The implementation uses 4x4 descriptor computed from 16x16 sample array. Therefore the results are stored for 4x4 array of histogram with 8 orientation bins in 128 element feature vector for each key point.
Figure 3: SIFT Key Point Descriptor (Lowe, 2004)
Gaussian weighted function is used to assign more weight to the magnitude of nearest neighbors sample point in the window and less weight to the points which are far from the center of the window. This is done to avoid false registration of keypoints. In order to make this descriptor invariant to brightness the feature vector is normalized to unit length. The output of SIFT detector and descriptor is now invariant to scale, rotation, illumination and 3D view point and thus can be used for recognition, tracking, registration or 3D reconstruction application. 15
School of Computer Science and Electronic Engineering
4.1.2
SURF
SURF is a technique based on the use of local information for feature extraction. It is different from SIFT both in detecting the feature point and calculating its descriptor. In SURF, Bay et al. proposed some alternative methods to compute similar kind of information as SIFT do. The detail of which has been given here under. 4.1.2.1
Integral Image
Bay et. al. proposed that the computational load of applying filters can be reduced if we use integral image instead of original image, originally proposed by Voila and Jones for face detection (Viola & Jones, 2001). Integral images allow very fast computation of any box type convolution filters. It is very simple to calculate an integral image, i.e. each pixel location (x, y) is defined as the sum of all pixels in a rectangular region formed by the origin and x.
∑
I∑
∑
,
It takes only four operations to calculate the area of a rectangular region of any size.
O B
D
∑ C
A
∑ = A-B-C+D (Fig 4) and (Fig.5) shows the example pixel values of original and integral Image.
16
School of Computer Science and Electronic Engineering
1
1
1
1
2
3
1
1
1
2
4
6
1
1
1
3
6
9
Figure 5: Integral image
Figure 4: Original image
Once an integral image is computed, the rest of the steps to be applied in SURF become computationally cheap such as Haar wavelets response and Hessian matrix calculation to find interest points.
4.1.2.2
Interest Point Detection
To detect an interest point from image, a popular choice of detector has been used i.e. Hessian Matrix because of its good performance and accuracy over integral images. A point in the image is selected as Interest Point for which the determinant of the matrix is largest. For a point
,
in an image ‘I’, the Hessian Matrix H(x, ) in ‘x’ and scale ′ ′ is defined as below:
H=
Where Lxx is the Laplacian-of-Gaussian of the image. They followed the argument made by Lindeberg (Lindeberg, 1988) that Laplacian-of-Gaussian is optimal for scale space as compared to Difference-ofGaussian which causes loss in repeatability under image rotation (Herbert Bay et al.) Interest point is selected if a pixel has maximum or minimum value within a window of 26 neighboring pixels. To find points in scale space the determinant Hessian Matrix is used. The main focus of this technique is to optimize the performance of the algorithm therefore, the authors explored every possibility to reduce the computational cost and the use of Fast Hessian Matrix is also one of it, proposed by (Neubeck & Gool, 2006)
17
School of Computer Science and Electronic Engineering
4.1.2.3
Localization of Interest Point
As the scale space is build using different values of Gaussian filter and the difference between two smoothed images is substantial, therefore it is important to identify the point’s position by interpolation.
4.1.2.4
Orientation Assignment
To find features which are invariant to rotation Bay et. al. suggested the use of Haar Wavelet response in ‘x’ and’ y’ direction in a circular neighborhood of radius 6s around the interest point. Where‘s’ is the scale at which the interest point was detected. Again Integral image optimizes the computation of Haar wavelets due to the application of Haar box filters to compute responses in ‘x’ and ‘y’ direction.
Figure 6: Haar Wavelet Filter for x and y direction
Where dark region has weight -1 and white region has weight +1, and with only six computations the result can be obtained. The dominant orientation is calculated by adding all responses within a sliding orientation window of size .
4.1.2.5
SURF Descriptor
After identifying the points with different scales and orientations, the rest is to calculate the descriptor. The technique suggests dividing the region into 4 x 4 sub-regions and calculating Haar wavelet response of the sub-region in horizontal and vertical direction and then calculating the sum of the absolute values of the responses.
18
School of Computer Science and Electronic Engineering
Figure 7: SURF Descriptor (Bay, Ess, Tuytelaars, & Luc, 2006)
The results are then stored into four dimensional descriptor vector, the length of which comes to 64 to store information of 4 x 4 sub-regions. Similar to SIFT, SURF descriptor is also normalized so that it becomes illumination invariant.
4.2 Matching Algorithms Matching two images needs to find point to point correspondences between pair of images. The similarity between features can be found using their descriptors which hold the content information of the feature point. If the distance between the descriptor of two features is less than some threshold, they are considered as matched features. Taking only the Euclidean distance of two feature points does not always give true matches because of the presence of some highly distinctive features (Lowe, 2004). Lowe suggested that the nearest neighbor’s distance should be compared with the second nearest neighbor and if it has lower distance measure than the second one, only then it should be selected. In SURF, Bay et. al. has followed the same approach to find point to point correspondence. 4.2.1
Nearest Neighbor Algorithm
Nearest neighbor can be found simply by using Euclidean distance between two descriptors, if the distance between them is less than some threshold they qualify as nearest neighbors. But to apply this
19
School of Computer Science and Electronic Engineering
technique in practice, some data structure is required and its choice plays critical role in efficiently finding corresponding features between two images. It is also necessary to use data structure for finding nearest neighbor because the descriptor is normally a multidimensional array of data and to match all of it is a time consuming process. The original implementation of SIFT and SURF uses similar data structures. SIFT implementation uses Best Bin First (BBF) algorithm, which is a modified search ordering for the kd tree algorithm. Added benefit of this method is the use of priority queue based BBF search. This search can be stopped after checking certain number of bins (for SIFT the search is stopped after first 200 nearest neighbors to reduce the computational load of comparing unimportant matches). SURF also uses k-d trees to find nearest neighbors but here again SURF claims efficiency over SIFT because of the information collected during detection phase, i.e. sign of Laplacian (the trace of Hessian Matrix) to distinguish between the blob points. Here blobs are regions around feature points and a feature point is detected on the basis of blob response which results in dark spot on light background and light spot on dark background. Only the points having similar sign of Laplacian are considered for matching. Similarity based on descriptor’s distance is not sufficient to obtain correct matches; therefore, we need additional robust algorithm that can also estimate the transformation between matched points and identify the false matches. For this two best suited techniques are RANSAC (Bolles, Martin, & Robert, 1981) and Hough Transform (Ballard & Brown, 1982).
4.2.2
RANSAC
RANSAC (Random Sample Consensus) is an iterative method to partition dataset into inliers (consensus set) and outliers (remaining data) and also delivers an estimate of the model computed from the minimal set with maximum support. The concept behind RANSAC is to select four points randomly; these points define the projective transformation (Homography Matrix), the support for this transformation is measured by the number of points that lie within a certain distance. The random selection is then repeated
20
School of Computer Science and Electronic Engineering
a number of times and the transformation with the greatest support is considered the best model fit. (Richard & Andrew) Algorithm contains following steps 1. Compute Interest Points in each image 2. Compute set of interest points matches based on descriptor’s similarity 3. For ‘N’ samples (where N can be a random number) a. Select a random sample of 4 correspondences and compute Homography Matrix ‘H’ b. Calculate the distance ”d” for each pair of corresponding feature points c. Compute the number of inliers consistent with ‘H’ and for which ‘d’ < t (where t is again a threshold distance for similarity measure) 4. Choose ‘H’ with largest number of inliers. 5. For optimal solution re-estimate ‘H’ from all correspondences classified as inliers by minimizing the cost function. 4.2.3
Hough Transform
Hough Transform was developed to identify lines by representing each potential pixel as a line in Hough space, and the maximum number of pixels voting for line predicts the position of line in image. This idea of using parametric space to find geometric shapes is then extended to find general shapes and matching. For matching applications, Hough Transform is used to find clusters of features that vote for a particular model pose. It allows each feature to vote for all object/model poses that are consistent with the feature’s location, scale and orientation and the peak in the Hough parametric space helps identifying the number of feature points having consistent transformation and which are then marked as matched feature points. The advantage of using Hough Transform is that it allows the search for all possible posses of the object simultaneously by accumulating votes in parametric space. The problem with this implementation is that with the increase in number of parameters to find matches the dimensions of matrix also increase. It not 21
School of Computer Science and Electronic Engineering
only increases the time complexity of filling the votes and then parsing the whole space to find the highest peak, but also utilizes a lot of memory. Despite its having high time and space complexity the accuracy is unobjectionable. Hough Transform for Feature Matching 1. Generate a four dimensional accumulator array A(x, y, scale, ) 2. Initialize the Array to zero 3. For all matched feature points in pair of images a. Calculate the transformation between pair of matched feature points in terms of x-axis position(x), y-axis position(y), scale(s) and orientation( ) b. Increment A(x, y, scale, ) 4. Find maximum value in accumulator array, its x, y, scale and
gives the image transformation
5. For all matched feature points a. Mark feature point as true match which follows the image transformation identified by Hough’s Peak.
4.3 Testing Methods Algorithms developed for Image processing and Vision problems work very well for one application but fail to give impressive results for another application. Some of the object detection techniques, for example, may work well to identify rigid objects (with some geometric shapes) but fail to detect non-rigid objects or deformable objects. It is, therefore, fundamental to assess the performance of any algorithm before applying it to a certain problem. To characterize the performance of algorithms, lots of methods have been introduced. The most common and logical of which are comparing the outcome of a certain method with some other known results, and then calculating the data (TP, FP, FN, TN)
22
School of Computer Science and Electronic Engineering
TP
FP
FN
TN
Where TP is The number of test predictions matches the correct result TN is The number of test predictions matches the wrong result FN is The number of test predictions shows a false non-match FP is The number of test predictions shows a false match Because these Figures do not convey the information (that we need regarding the performance of an algorithm) separately, different methods have been developed for combining these Figures and presenting them graphically. Some of these are described and utilized in this project. 4.3.1.1 ROC Curve (Receiver Operating Characteristic Curve) An ROC Curve (of different values of one or more than one parameter) is a plot of False Positive rate versus True Positive rate. This curve shows the correct predictions as well as false predictions. The closer the curve to the top-left corner of the graph, the better the algorithm is as shown in (Fig.8 (a)). These two plots represent two ROC curves for two matching algorithms represented by two different colours i.e red and green.
(a)
(b) Figure 8: ROC Curves
23
School of Computer Science and Electronic Engineering
(Fig.8 (a)) clearly shows that the algorithm represented with a red line performed much better than the algorithm represented with a green line as evidenced by the higher rate of the true positive values. The graph on the right side, however, gives unclear information about the performance of the two algorithms. This makes it impossible to say which of the algorithm is better. In such a situation, some other but more sophisticated method needs to be applied. McNemar’s Test is one such method. 4.3.1.2 Mc Nemar’s Test It is a form of a chi-square test for matched pair data. In this test, the results of the two algorithms are stored as follows: Algorithm A
Algorithm A
Failed
succeeded
Algorithm B Failed
Nff
Nsf
Nfs
Nss
Algorithm B Succeeded
Mc Nemar’s Test calculates in the following way:
|
|
1
(-1 is the continuity correction) “If the number of tests is greater than 30 then central limit theorem applies. This states that if the sample size is moderately large and the sampling fraction is small to moderate, then the distribution is approximately Normal” (Clark & Clark, 1999). In this case, we need to calculate ‘Z’ score of the algorithms using the following equation: 24
School of Computer Science and Electronic Engineering
|
|
1
Z-score should be interpreted using Z-score table so that the Algorithm ‘A’ and ‘B’ will be categorized as similar/different and better/worse with a confidence limit. If Algorithm ‘A’ and ‘B’ are similar, the Score of Z will tend to zero, but if they are different the two-tailed prediction could be used with a confidence limit in order to show that both algorithms are different from each other. Whereas one-tailed prediction can also be used to find the supremacy of one algorithm over the other with the following confidence limits. Table 3: Z-Score table
Degree of Confidence
Degree of Confidence
Two Tailed Prediction
One Tailed Prediction
1.645
90%
95%
1.960
95%
97.5%
2.326
98%
99%
2.576
99%
99.5%
Z Score
If the Z-Score is 11.314 for example, then by using the above table, the following results will be deducted: We are 99% confident that Algorithms ‘A’ and ‘B’ give different results. We are 99.5% confident that Algorithm ‘A’ is superior to Algorithm ‘B’.
25
School of Computer Science and Electronic Engineering
5
Chapter: Performance Evaluation
5.1 Evaluation Data The data selected to evaluate the performance of matching and feature extraction algorithms are standard datasets used for comparing vision algorithms by almost everyone. These have been provided in the University of Oxford website1. There are five datasets with different imaging conditions and synthetic transformations which are listed below:
Increase in Blur
JPEG Compression
Change in Illumination
Change in View Point
Change in Scale and Orientation
Each dataset contain 6 images each with varying geometric or photometric transformations. This means
that the effect of change in image conditions can be separated from the effect of changing the scene type. Each scene type contains homogeneous regions with distinctive edge boundaries (e.g. graffiti, buildings), while only some contain repeated textures of different forms. All images are of medium resolution 800 x 640 pixels approximately.
1
http://www.robots.ox.ac.uk/~km/
26
School of Computer Science and Electronic Engineering
Trees and Bikes dataset, a sequence of images with increase in blur from left to right. The blur has been introduced by varying the camera focus.
The images are UBC (University of British Columbia’s building) with increase in JPEG compression. The JPEG sequence is generated using a standard xv image browser with the image quality parameter varying from 40% to 2%.
A sequence of six images with a decrease in brightness. The light changes are introduced by varying the camera aperture.
Figure 9: Datasets and their description (a)
27
School of Computer Science and Electronic Engineering
The first dataset is known as Graffiti dataset, while the second contains images of a wall. Both have a sequence of images taken from different viewpoints. In the viewpoint change test, the camera varies from a fronto-parallel view to a one with a significant foreshortening at approximately 60 degrees to the camera.
Boat and Bark datasets are images with a difference in scale and orientation. The scale change sequences are acquired by varying the camera zoom. The scale changes by the factor of 4, and the images were taken by rotating the camera up to 60 degree angle. Each image in a sequence is at 10 degree rotation from the previous one. Figure 10: Datasets and their description (b)
28
School of Computer Science and Electronic Engineering
5.1.1 Code Implementation There are few open source implementations available for both SIFT and SURF algorithms. Recently Bauer et al (2007) have compared some of the source codes by implementing SIFT and SURF (Bauer, Sunderhauf, & Protzel, 2007). These codes include David Lowe 2 ’s SIFT original implementation, SIFT++3, LTI-lib4 SIFT, SURF codes, SURF5 and SURF-d. According to this study, while the Lowe’s implementation and SIFT ++ give the best results, SURF and SURF give very close results to the former ones (Lowe’s implementation and SIFT ++). SIFT original implementation (by Lowe) is in its binary form and, therefore could not be used. There are some other implementations of both algorithms available, such as SIFT6 by Rob Hess and open source SURF7 feature extraction library (under GNU License). The latter two are written in C++ with OpenCv in Microsoft Visual Studio implementations, and therefore are selected for use in this project. These implementations do not come with a matching code like RANSAC and Hough Transform (except for SIFT by Rob Hess which already has a RANSAC implementation). That is why; the rest of the codes have been developed in this study.
5.1.2
RANSAC vs Hough Transform
Matching features between a pair of images is a complex problem to deal with. It involves comparing the features of a reference or a model image with all the features of the test image in order to find the relative matches. The complexity of the problem increases with the increase in number of the features in both images. Hough Space has been utilized to accumulate votes of image pixels to find lines, circles, or other shapes. This concept (the accumulation of votes) has been extended to match feature points by collecting http://www.cs.ubc.ca/~lowe/keypoints/ http://vision.ucla.edu/~vedaldi/code/siftpp/siftpp.html 4 http://ltilib.sourceforge.net 5 http://www.vision.ee.ethz.ch/~surf/ 6 http://web.engr.oregonstate.edu/~hess/ 7 http://code.google.com/p/opensurf1/ 2 3
29
School of Computer Science and Electronic Engineering
the votes for all sets of poses in an image to match an object. There are other techniques such as alignment techniques, in which a subset of test and reference image features is used to determine the perspective transformation and orientation (Lowe, 2004). Most alignment based methods use RANSAC to find the number of features which have similar model or transformation. Tree based methods are also used for different applications, such as K-D trees which allow the arrangement of points in k-dimensional array to facilitate key search or nearest neighbor search. Because of the high complexity of these algorithms, the most popular approaches are computational methods like RANSAC and Hough Transform. Both give good performances under similar conditions. That is why; researchers have also proposed the combination of the two techniques (Brown, Szeliski, & Winder, 2006) to get better results. Although Hough Transform and RANSAC are different approaches as described in section 4.2.2 and 4.2.3 they produce comparable results for different types of Image data. It is therefore essential to analyze the performance of these two matching strategies before conducting a comparative study of the two outstanding feature extraction algorithms. A good evaluation of the two algorithms in terms of computational time has been given by (Stephen, Lowe, & little, October, 2002). The comparisons mainly focus on the efficiency of the two techniques.
Although they proved Hough Transform to be
computationally more expensive than RANSAC, it is worth exploring whether the quality of the results generated is also good or not? In order to carry out this comparison, lots of experiments have been performed on test images taken from test datasets with different imaging conditions and affine transformations. SIFT and SURF features have been matched using RANSAC and Hough Transform in a pair of images. The results were compared using true positive matches against a scale error threshold. The step by step procedure went as follows: 1. Feature detection using SIFT and SURF 2. Finding point to point correspondence between feature points in a pair of images based on a similarity measure in their descriptor (using Nearest Neighbor Algorithm) 30
School of Computer Science and Electronic Engineering
3. Identifying true Positive and False positive matches based on the difference from the actual transformation (peak value in Hough Space)
1000
Ransac
Ransac
Hough
Hough
1200
400 200
2000
800
# of Matches
# of Matches
600
600 400
1500 1000 500
200
0 1
2
3
4
0
5
Hough
2500
1000
800 # of Matches
Ransac
1
Images of BIKES Dataset
2
3
4
0
5
1
Images from Trees Dataset
(a)
Ransac Hough
Hough
4000
600 400 200
# of Matches
800
# of Matches
800
600 400 200
0
3000 2000 1000
0 5
1
Images from Leuven Dataset
2
3
4
0
5
1
Images from Graffiti Dataset
(d)
2
3
(e)
(f) Ransac
Hough
2000
1000 500 0
1
2
3
4
Images from Bark Dataset
(g)
5
Hough
1500 # of Matches
1500
4
5
Images from Wall Dataset
Ransac
# of Matches
# of Matches
1000
5000
4
5
Ransaac 1000
3
4
(c)
Hough
2
3
Images from UBC Dataset
(b) Ransac
1
2
1000
500
0 1
2
3
4
5
Images form Boat Dataset
(h)
Figure 11: Graphs comparing Number of Matches by Hough Transform and RANSAC
31
School of Computer Science and Electronic Engineering
The analysis starts by comparing the number of matches identified by both algorithms. The results are presented in the form of bar charts in the (Fig. 11). It is obvious from these graphs that due to different imaging conditions and differences in scale and orientation, Hough Transform sometimes became unable to give sufficient number of matches in, for example, ‘Graffiti’, ‘Wall’, ‘Bark’ and ‘Boat’ datasets. Graffiti and Wall datasets contain images with a change in viewpoints, while ‘Bark’ and ‘Boat’ are the datasets with different Zoom and Rotation. Hough transform tries to collect the votes of features to identify similar transformations in positions, scales and orientations giving four degrees of freedom i.e. a position in x-axis and y-axis, a scale and an orientation. But it seems that while rotating the image or changing viewpoint, the features lose consistency and therefore fail to create a single significant peak in Hough Space. Hough found 4 or more matches to qualify the criteria of matching, but did not prove to be the best option. In the remaining datasets, Hough along with RANSAC gives us comparable matches to analyze their performance. The number of matches alone cannot be the criteria to judge the performance of any algorithm; therefore, the true positive matches and error bars have also been calculated in order to complete the evaluation process. Before investigating the quality of results obtained using both algorithms, it is important to describe the method and purpose of calculating actual transformations between two pairs of Images and the method used to calculate error bars. The verification of matching results is the most difficult task without human consultation, as the number of local features identified by a good detector or descriptor lies in the range of thousands of features, and if only 50% of them match, we need to verify at least 500 matches, which is impossible to do visually. This also demands some robust method for deciding the correctness of the matched points. Krystian et al. have suggested a mathematical method to find the precision and recall values using a number of matches and correspondences (Mikolajczyk & Schmid, October 2005). However, it is unclear how they categorize a match point as a correct match point or a false match point.
32
School of Computer Science and Electronic Engineering
Matching SIFT Features using RANASC and Hough Transform
False Positive
(a)
False Positive
(d)
False Positive
(g)
Matching SURF Features using RANSAC and Hough Transform
False Positive
Result of SIFT features with error bars
(c)
(b)
False Positive
(f)
(e)
False Positive
(i)
(h)
Figure 12: ROC Curves: Hough Transform and RANSAC matching Graffiti, Boat and UBC Images
33
School of Computer Science and Electronic Engineering
To solve this issue, it is suggested that Hough Transform can be used to find the exact transformations in feature points. Therefore, the transformation followed by a maximum number of feature points can be selected as the true difference in features. Hence, it is used to find true positive and false positive matches. The features does not have true transformation are considered false positive matches. 5.1.2.1 ROC Curves To plot ROC curves we need a parameter, which when changed can show the varying performance of the techniques. Different values of scale produce different number of matched features. I have selected scale as varying parameter because almost all datasets contain images with difference in scale levels (except for the dataset with rotated images (‘Boat’ and ‘Bark’), where angle has been selected). Therefore, changing scale error threshold from lower to higher, the number of “True Positive Matches” also changes. The algorithm for which the ROC curve approaches to top-left corner is considered best as it will have maximum true positive matches with respect to false positive matches. 5.1.2.2 Error Bar Graphs Method used to find error bar is as follows
Find number of matches in pair of images using RANSAC and Hough Transform
Extract true matches and repeat the process number of times
calculate mean and standard deviation
Calculate Standard Error by dividing Standard Deviation with Square root of number of iterations and plot Mean vs Standard Error
The Hough and RANSAC for all sets of images have been compared and discussed in the forth coming paragraphs. Only one image per dataset has been selected for discussion here and the results for the rest of the images are attached as Appendix A.
34
School of Computer Science and Electronic Engineering
Matching SIFT Features using RANASC and Hough Transform
Matching SURF Features using RANSAC and Hough Transform
False Positive
Result of SIFT features matching with error bars
False Positive
(a)
(b)
False Positive
(c)
False Positive
(d)
(e)
(f)
Figure 13: ROC Curves: Hough Transform and RANSAC matching Bike and Leuven Images Table 4: Mc Nemar's Test for Hough and RANSAC using SIFT features
RANSAC Failed Hough Failed Hough Passed
RANSAC Passed
46
32
3
19
The Z-Score is 4.73, therefore 99% confident that RANSAC and Hough are different 99.5% confident that RANSAC is better than Hough Transform
Table 5: Mc Nemar's Test for Hough and RANSAC using SURF features
RANSAC Failed
RANSAC Passed
Hough Failed
63
15
Hough Passed
5
18
Z-Score is 2.01 According to Z-score we are 95% confident that RANSAC and Hough are two different algorithms 97.5% confident that RANSAC is better than Hough Transform
35
School of Computer Science and Electronic Engineering
For almost all images from datasets presented here RANSCA outperformed not only in finding good number of matched points but with maximum True Positive matches in case of SIFT features. However, for SURF features, the Hough has given the performance almost equal to RANSAC as obvious from the graphs in (Fig. 13) (for ‘Bike’ and ‘Leuven’ datasets), so it is hard to select the best performer. In the next step, error bars have been calculated and the results have been presented in the third column of (Fig. 12 & 13) which clearly shows that Hough despite having less number of matches contain no standard error and therefore produces accurate results. But at the same time, RANSAC surprises with negligible error bars and therefore do not allow categorizing it as bad matching technique when compare with Hough Transform. In case of SURF features, the ROC curve (shown in column-2, Fig.12 &13) overlaps each other, so it becomes impossible to tell that which method gives better results and even error bars fail to distinguish the superior method. Therefore, another test (Mc Nemar’s Test) is applied to resolve the issue. RANSAC and Hough Transform are giving contradictory results for Bikes dataset. If the features are extracted using SIFT technique, then RANSAC is getting more true positive matches than Hough Transform but if there are SURF features, then Hough Transform produces better results. Due to this contradiction Mc Nemar’s test is applied to SIFT features first and then to SURF features. The data for Mc Nemar’s test has been collected for all six images of dataset and is, therefore, much more reliable than ROC curves where, matching results between two images are presented. To sort the data in terms of ‘fail’ or ‘pass’ for an algorithm, I have selected a pass limit of 15% meaning that if an algorithm produce more than 15% of total matches for a specific scale error threshold than it is considered pass otherwise it is counted for its failure. As the Z-score has been calculated for both features, the Z-score table given as (Table-3) has been used to interpret the results shown on the right side of Mc Nemar’s (Table-4&5). Both Mc Nemar’s tests prove that RANSAC is a better algorithm than Hough Transform. Further, images selected as test data are either of planar scenes or the camera position 36
School of Computer Science and Electronic Engineering
is fixed during acquisition (Oxford). Therefore, in all cases the images are related by projective transformation. RANSAC also works out the inliers using homography matrix and therefore, best suited to remove outliers from matching pair of image data. While taking about Hough Transform, it uses all four parameters (position, scale and theta) to identify the correct transformation, which sometimes result in reasonably less number of votes from feature points. In view of above, RANSAC has been selected as more efficient algorithm and used for further evaluation of SIFT and SURF features. In addition, RANSAC has also been selected as feature matching technique for the application developed.
37
School of Computer Science and Electronic Engineering
1000 500
SIFT SURF
1500 # of Matches
# of Matches
SIFT SURF
1000 500 0
0 1
2
3
4
1
5
Images from BIKES Dataset
2
4
(b) SIFT SURF
SIFT SURF
3000
# of Matches
1000
2000 1000 0 1
2
3
4
500 0
5
1
2
3
4
(c)
(d) SIFT SURF
SIFT SURF 6000 # of Matches
1000 # of Matches
5
Images from Leuven Dataset
Images from UBC Dataset
500
4000 2000 0
0 1
2
3
4
1
5
2
3
4
5
Images from Wall Dataset
Images from Graffiti Dataset
(e)
(f) SIFT SURF
2000
# of Matches
# of Matches
5
Images from Trees Dataset
(a)
# of Matches
3
1000
SIFT SURF
1500 1000 500 0
0 1
2
3
4
Images from Bark Dataset
(g)
5
1
2
3
4
5
Images from Boat Dataset
(h)
Figure 14: Graphs comparing number of matched feature points detected by SIFT and SURF
38
School of Computer Science and Electronic Engineering
5.1.3
SIFT vs. SURF
After selecting the best matching algorithm, next step is to put SIFT and SURF algorithms on trial to select the one with better performance on images with varying imaging conditions and different transformations like change in brightness, compression, viewing condition, scale and rotation etc. The algorithms are compared on the basis of number of corresponding feature points identified between pair of Images and then apply more sophisticated tests to determine the most suited algorithm. The graphs in (Fig.14) clearly indicate the superiority of SIFT over SURF technique for finding more consistent feature points. It is important to mention that the number features extracted by the two techniques greatly depend on the selection of parameters values such as number of octaves, samples per octave, blob threshold, non maxima suppression threshold, etc. However, it does not significantly influence the number of correct matches. For this particular comparison the values of these parameters are kept as prescribed in the implementations as tuning the parameter values is beyond the scope of this project. Following table summarize these parameters and their values. Table 6: Algorithm’s parameter values
Parameter Number of Octaves Samples per Octave Threshold
Value for SIFT log min
→
,
Value for SURF →
log 2
4
3
3
Contrast threshold = 0.04
Blob Response= 0.0004
It has been examined by (Bay, Ess, Tuytelaars, & Luc, 2006) that maximum number of feature points can be extracted using up to 4th Octave and therefore increasing number of octave only increase processing time and do not contribute in performance optimization. By keeping the parameter values static for all 39
School of Computer Science and Electronic Engineering
types of images give us the chance to monitor the behavior of algorithms for varying amount of imaging conditions.
The performance evaluation of Feature Extraction is divided into four different testing areas. First the number of true matches identified by the two has been compared for similar images. This will highlight the method which can give maximum performance under all imagining conditions. Then percentage of correct matches is analyzed to check the strength of feature detector and descriptor (Fig. 14). If the descriptor is good then the matches configured by the algorithm are mostly true match. The performance of both techniques has been evaluated for change In Blur, Scale, Rotation, View Point and JPEG compression.
40
School of Computer Science and Electronic Engineering
Results for Blurred Images Matching SIFT and SURF Features in BIKES Dataset with increase in Blur
Images
6 5 4
SURF
3
SIFT
2 0
50 100 % of Correct Matches
(a)
False Positive
(c)
False Positive
(e)
False Positive
(b)
False Positive
(d)
False Positive
(f)
Figure 15: Graphs comparing SIFT and SURF features for Bikes dataset
41
School of Computer Science and Electronic Engineering
5.1.3.1 Blurring Bikes Dataset present sequence of images with increase in blur. Blur in the images has been introduced by changing the camera focus. The amount of blurring effect is in the range from 2 to 6% of the original image. It is interesting to observe that although the percentage of correct matches found by SURF is less than SIFT as shown in (Fig.15 (a)), still all of these feature points have consistent transformation is this case and comparable to SIFT features. Result of matching image 2 and 3 in (Fig.15(c)) and image 4 and 5 in (Fig. 15(e)) indicate that SURF features are more consistent than SIFT, considering that these images have more blurring effect than previous images. Furthermore, there is no geometric transformation in these images, but the intensities of pixels are changing unexpectedly in regions. Therefore, both descriptors happen to find most of the true positive matches. The overall results of matching all six images of the dataset do not help in selecting the most suitable algorithm. Therefore, Mc Nemar’s test is performed (Table-7) on empirical data to sort the more appropriate algorithm. Table 7: Mc Nemar's Test for Blurred images
SIFT Failed SIFT passed SURF Failed
5
4
SURF Passed
0
31
Z-Score is 1.5 Both algorithms are similar for Bikes Dataset
Interpreting Z-score proves that both algorithms appear to be the same as Z-score is approaching to zero indicating that both algorithms have similar performance. For images with more blurred regions SIFT attained maximum score, calculated by (Mikolajczyk & Schmid, October 2005). Therefore, it will be appropriate to say that SIFT and SURF are equally suitable for blurred images for matching and tracking applications.
42
School of Computer Science and Electronic Engineering
Results for JPEG Compression Matching SIFT and SURF Features in UBC Dataset with increase in JPEG Compression
Compression %
SURF 98
SIFT
95 90 80 60 0
50
100
% of Correct Matches False Positive
(a)
False Positive
False Positive
(b)
False Positive
False Positive
Figure 16: Graphs comparing SIFT and SURF features for UBC dataset
43
School of Computer Science and Electronic Engineering
5.1.3.2 JPEG Compression In UBC (University of British Columbia’s Building) dataset, the images have been compressed with JPEG compression and the last image is almost 98% compressed. Matching compressed images is very important to analyze because most of the hardware (like still and video cameras) nowadays provide additional facility of compressing image data. It is important for vision algorithms to be able to process the compressed images or otherwise suffer from the loss of important information during compression and decompressing image data. Therefore feature extraction needs to be effective for these images to develop any real time recognition, or tracking application. According to percentage of correct matches both algorithm compete each other with SURF having edge on SIFT features. But the ROC curves (Fig. 16) actually explains that the SIFT features though less in number are more consistent than SURF feature. The results depict the performance of two techniques to be ideally equal on compressed images, as both have high true positive rate for very small scale error threshold. Table 8: Mc Nemar's Test for Compressed images
SIFT Failed SIFT passed Z-Score is 2.02 SURF Failed
5
10
95% confident that SIFT and SURF are different
SURF Passed
2
31
97.5% confident that SIFT is better than SURF
Z-score for SIFT and SRUF techniques as shown in the table above (Table-8), gives us more than 90% confidence to select SIFT over SURF for images with JPEG compression.
44
School of Computer Science and Electronic Engineering
Results for Images with change in Illumination
Illumination
Matching SIFT and SURF Features in Leuven Dataset with decrease in light
SURF
6
SIFT
5 4 3 2 70
80
90
% of Correct Matches
(a)
False Positive
(c)
False Positive
(e)
100 False Positive
(b)
False Positive
(d)
False Positive
(f)
Figure 17: Graphs comparing SIFT and SURF features for Leuven dataset
45
School of Computer Science and Electronic Engineering
5.1.3.3 Illumination Leuven dataset is a sequence of images with decrease in brightness. Most of vision algorithms work on greyscale images i.e. they need to convert coloured image into grey scale for their operations to be performed. The images with change in brightness can affect the performance of these algorithms as there may be less contrast between pixels in some parts of the images and may be more contrast in other parts. However, SIFT has managed to find more consistent features (as shown in Fig. 14(d)) which truly matched in images with varying pixel intensities. Table 9: Mc Nemar's Test for Images with change in Illumination
SIFT Failed
SIFT Passed Z-Score is 2.07
SURF Failed
1
20
SURF Passed
8
57
95% confident that both algorithms are different 97.5% confident that SIFT is better than SURF
In order to make features illumination invariant, both the algorithms follows the same method of converting the descriptor into unit vector (intensity normalization) which accounts for overall brightness change. Therefore, it is not unexpected that both produce almost similar results. ROC curves (Fig. 17) indicate that there is no significant difference between the two algorithms but SIFT features do have slight edge over SURF features. Mc Nemar’s test on the other hand proves this assessment with Z-score more than 2 meaning that SIFT is better than SURF for images with difference in illumination.
46
School of Computer Science and Electronic Engineering
Results for Images with change in View Point
View Point
Matching SIFT and SURF Features in Graffiti Dataset with change in view point
60
SURF SIFT
50 40 30 20 0
20
40
60
%of Correct Matches
(a)
False Positive
(c)
False Positive
(e)
80 False Positive
(b)
False Positive
(d)
False Positive
(f)
Figure 18: Graphs comparing SIFT and SURF features for Graffiti dataset
47
School of Computer Science and Electronic Engineering
5.1.3.4 View Point (a) Graffiti and Wall datasets contain sequence of images with change in viewpoint. The angle of different viewpoints lies in the range of 20 to 60 degree. In these images, there is change in angle and therefore it is expected for features to have transformation in scale and orientation. The number of matches shown in (Fig.14(e)) indicates that SURF features are less matched as compare to SIFT and so on the basis of correct percentage matches, SIFT producing more accurate results The ‘true positive matches’ results for the whole dataset in (Fig. 18) indicate that SURF and SIFT both are equally good to detect angular transformations. Therefore, Mc Nemar’s test has been applied to select the one with better confidence limit. Table 10: Mc Nemar's Test for Images with change in view point
SIFT failed
SIFT Passed
Z-Score is 1.75 90% confident that both algorithms are different
SURF Failed
1
12 95% confident that SIFT is better than SURF
SURF Passed
4
50
Mc Nemar’s test result shows the supremacy of SIFT technique over SURF in finding features which are robust to change in view point. Therefore, we can select SIFT with 95% confidence to match images with varying view points.
48
School of Computer Science and Electronic Engineering
Result for Images with change in View Point
Matching SIFT and SURF Features in Wall Dataset with change in View Point
View Point Angle
SURF 60
SIFT
50 40 30 20 0
50
False Positive
% of /correct Matches
(a)
False Positive
(c)
False Positive
(e)
100
False Positive
(b)
False Positive
(d)
False Positive
(f)
Figure 19: Graphs comparing SIFT and SURF features for Wall dataset
49
School of Computer Science and Electronic Engineering
5.1.3.5 View Point (b) Another dataset with similar transformation has been used to test the behavior of the two algorithms. Walls dataset also contain similar transformation in view point as in Graffiti images. With exactly same transformation, no wonder that the results are also similar. Again both algorithms produce overlapping ROC curve (Fig. 19) and have been put under Mc Nemar’s test. Table 11: Mc Nemar's Test for Images with change in view point
SIFT failed
SIFT Passed
1
5
Z-Score is 0.75 Both algorithms appear to be same as Z-score
SURF Failed
approaches to zero. SURF Passed
2
50
Contrary to graffiti dataset both algorithms appear to be similar in this case. It shows that both techniques are able to give good results for textured images as well as for planar surfaces.
50
School of Computer Science and Electronic Engineering
Results for Images with difference in zoom and rotation
Scale change
Matching SIFT and SURF Features in Boat Dataset with cnage in Zoom and Rotation
3
SURF SIFT
2,5 2 1,5 1 0
50 % of Correct Matches
(a)
False Positive
(c )
False Positive
(e)
100 False Positive
(b)
False Positive
(d)
False Positive
(f)
Figure 20: Graphs comparing SIFT and SURF features for Boat dataset
51
School of Computer Science and Electronic Engineering
5.1.3.6 Zoom and Rotation (a) The images tested in this section contain two synthetic transformations introduced by changing the camera angle and zoom. This is a complex task for both algorithms, and it is obvious from the (Fig.14(g&h)) that SURF struggles to find consistent features and mostly finds less than 500 matches while SIFT still performs better as compared to SURF by keeping number of matches above 1000. Irrespective of number of matches the percentage of correctness (Fig 20 (a)) shows that the performance of both algorithms is almost at par. Matching results of only the first two images form ‘Boat’ dataset give different results showing SURF performing better than SIFT otherwise the rest of images have been better matched with SIFT features as shown in (Fig .20 (c), (d), (e) and (f)). To draw ROC curve for these datasets, the varying parameter has been changed to angle (from scale). It is because these images have rotational transformation and therefore it would be better to assess their performance for change in angle error threshold. Each image has 10o more rotation than the previous one. To further increase the degree of confidence in these results, Mc Nemar’s test has been applied on both datasets and the result is as given hereunder. Table 12: Mc Nemar's Test for Zoomed and Rotated Images
SIFT Failed
SIFT Passed
Z-Score is 3.35 According to z-score we are
SURF Failed
50
18
SURF Passed
2
80
99% confident that both algorithms are different and we are 99.5% confident that SIFT is better than SURF.
Once again SIFT features appear to be more consistent in case of complex transformation as proved by greater Z-score.
52
School of Computer Science and Electronic Engineering
Results for Zoomed and Rotated Images Matching SIFT and SURF Features in Bark Dataset with change in Zoom and Rotation
SURF
Angle
60
SIFT
50 40 30 20 0
50
False Positive
% of Correct Matches
(a)
False Positive
(c)
False Positive
(e)
100 False Positive
(b)
False Positive
(d)
False Positive
(f)
Figure 21: Graphs comparing SIFT and SURF features for Bark dataset
53
School of Computer Science and Electronic Engineering
5.1.3.7 Zoom and Rotation (b) The sequence of images in ‘Bark’ dataset also contains zoom and rotation transformation. The ROC curves in (Fig. 21) shows the poor performance of both algorithms. (Fig. 21 (a)) indicates that SIFT and SURF both manage to find good percentage of correct feature points but very few of them are true positive matches. By looking at these ROC curves, though SIFT features appear to gain more true positive matches but the curve for SIFT features (Fig.21) is far from top left corner. Even though none of the algorithm can be regarded as best in this scenario but if the selection needs to be made then SIFT will be a better option. Mc Nemar’s test result given below strengthens our claim. Table 13: Mc Nemar's Test for Zoomed and Rotated Images
SIFT Failed
SIFT Passed
Z-Score is 2.59
SURF Failed
50
15
99% confident that both algorithms are different
SURF Passed
3
80
99.5% confident that SIFT is better than SURF
Z-score above 2 for SIFT shows its supremacy over SURF features.
54
School of Computer Science and Electronic Engineering
5.1.4
SIFT vs SURF Conclusion
The above comparative analysis shows that both techniques are equally good. However, SIFT features are more consistent for complex transformations such as scale and rotation. So if an application demands more efficiency than accuracy, then SURF is a better option. However, if accuracy of the results is more important, then SIFT is a better choice. The speed of SURF algorithm has been a consideration since its development and researchers in vision community seems convinced to use SURF for real time applications. However, the implementation of SURF used in this study has not been found fast enough to be used for real time applications. So this study only suggests the use of technique according to type of images. This comparative study is different and more reliable from previous studies as more sophisticated statistical test (ROC curves and Mc Nemar’s tests) has been used to analyze the performance of the algorithms on similar test data. Although the outcome is in line with the majority of the previous studies but it is relatively more reliable.
55
School of Computer Science and Electronic Engineering
6
Chapter: Application on real world video sequences
The whole exercise of evaluating the performance of SIFT and SURF along with matching algorithms (RANSAC and Hough Transform) was for the purpose of choosing the best algorithm for recognition or tracking application. The conclusion which emerges from this study is that RANSAC is the best available option for local feature matching and SIFT and SURF can be used depending on the requirement of application. However, SIFT features appeared to be more robust to all kind of imaging conditions and transformation.
To verify the results, it is important to use these techniques for real time applications instead of simulated data. Therefore, it is suggested to use it for two different real time applications. First, use the local feature extraction and matching method to track the motion of a moving object and identify its direction such as a moving car or an aerial vehicle. Secondly, identify the movement of object carrying hand held camera. The directions to be estimated are left, right, up and down. The proposed applications can help developing a lot more sophisticated systems such as a control system for an ‘Autonomous Vehicle’ to control its movement and collect information of the surrounding environment and a control system for UAV (Un Manned Aerial Vehicle).
The requirement of a control system for autonomous vehicles is very important and needs a lot of accuracy and efficiency. The shortcoming of current autonomous vehicles is that these vehicles do not carry high speed processor and memory and therefore are unable to process data on its own to make decisions. This is done by sending the data back to control system, which then, after analyzing it sends command for next action. In case of image data, problem becomes worse because of high amount of image/video information. It takes much time to send and receive data and also loose efficiency and accuracy in the process. Therefore, a lot of work is being carried out to develop efficient and economical
56
School of Computer Science and Electronic Engineering
algorithms for these kinds of applications. The use of these local features for global localization of robot have proved to be giving promising results (Stephen, Lowe, & little, October, 2002). 6.1.1 Proposed Method for Direction Estimation of a Moving Vehicle SIFT has been used to identify the local features from video data captured using handheld camera on automotive vehicles. As described previously, the threshold values for different parameters play a vital role in improving the quality of result. For the use of algorithms on real data, the contrast threshold has been adjusted to get better results and the parameter values used are as follows Table 14: Parameter values used for application
Parameter
SIFT
Contrast Threshold
0.0005
Nearest Neighbor Distance
0.7
The algorithm has been modified to calculate the direction of the vehicle. Basic method and modifications are as follows. 1. Set of matched feature points are obtained by applying SIFT along with RANSAC on consecutive frames of the video 2. For all matched feature points a. Calculate the accumulative difference in feature’s position between the two frames
X-axis = ∑
Y-axis = ∑
1
1
′
′
57
School of Computer Science and Electronic Engineering
Where ‘x’ and ‘y’ are feature point’s position in Frame-1 and
′
and
′
are the
corresponding feature’s position in Frame-2 and ‘n’ is the total number of matched features. 3. Calculate average difference in all features positions 4. Estimate direction of the vehicle under following conditions
b. If the change in X-axis is greater than 1 then car is moving in left direction. c. If the change in X-axis is less than -1 then car is moving in the right direction d. If change in Y-axis is less than -1 then vehicle is moving downward. e. If change in Y-axis is greater than 1 then vehicle is moving upward. f.
For all change in any direction between -1 and 1 the vehicle is considered to be moving straight
Using this data further information, such as time for which the vehicle moved in one direction, can also be calculated as follows. Count number of consecutive frames for any movement (N). Calculate time of downward movement as
′
If the speed of the car is known then length can also be calculated as ∗ Unfortunately the data collected for this project lacks the speed of the vehicle in videos as well as other information in aerial videos due to which it is not possible to calculate the information proposed above for ‘time’ and ‘length’ for specific movement.
58
School of Computer Science and Electronic Engineering
6.1.2 Real World Videos for Experiment: Two types of videos have been used to test algorithm. First, a video made by fixing a camera in an automotive vehicle and second, a video using a handheld camera. Both types of videos have been recorded by Dr. Adrian Clark. Some aerial videos have also been used to check the system. As there is no aerial video data available for experimentation therefore these sequences have been taken from “Proaerialvideo 8 ” website, where royalty free videos are available to download. The data for the following videos have been recorded and discussed Table 15: Real world video data
Video # 1
Car moving on straight path with left, right and straight movements
Video # 2
Car moving on a hill with left, right, up, down and straight movements
Video # 3
Indoor video made using hand held camera
Video # 4
Outdoor video made using hand held camera
Video # 5
Video of an aircraft taking of
Video # 6
Video of an aircraft flying over the trees
Video # 7
Air to air video
OpenCv with Visual Studio (Express Edition) has the limitation of being unable to read compressed video files and therefore, the video frames were stored as JPG images to allow program to read the image data. Hence the data as well as the results are stored in JPG image format.
8
http://www.proaerialvideo.com/categories.php
59
School of Computer Science and Electronic Engineering
6.1.3 Experimental Setup A video sequence has been recorded using a camera mounted at the front of the car. The camera used for this purpose has following specifications. Camera specifications: Table 16: Camera Specifications
Model:
Cannon Power Shot S5-IS
Resolution:
8 Mega Pixels
Sensor:
5.8 x 4.3 mm CCD
Lens:
12x Optical Zoom
Image Size :
640 x 480 (Used)
Video Format:
AVI (Motion jpeg Compression)
Camera Positioning: The camera was fixed at the front dash board of the car and was zoomed out to take the video of the road and surroundings. To minimize the bumping effect (due to uneven roads) and to keep the camera still it was fixed to the metal using glue.
Figure 22: Direction detection in two sample frames
60
School of Computer Science and Electronic Engineering
6.1.3.1 Data Recorded The output of the program is stored in a text file in the following format
Frame #
# of
Matches
features
Found
# of inliers
Drift in x-
Drift in y-
x-axis
y-axis
axis
axis
direction
direction
‘Matches found’ (red circles in (Fig.22)) are corresponding features identified by nearest neighbor algorithm, whereas ‘# of inliers’ (blue circles encircling red circles in (Fig. 22)) is the features with consistent transformation with Homography matrix calculated by RANSAC. The program calculates the drift in ‘x’ and ‘y’ axis to predict the movement as ‘L’(left), ‘R’(right) or ‘S’(straight) for the last two columns (as shown by redline and a circle in the middle of images in (Fig. 22)). 6.1.4 Verification of Results The movement of the vehicle is identified by the change in the position of feature points while tracking them from frame to frame. If the features have average drift less than -1 in x-axis then the vehicle is moving in right direction and otherwise left, similarly if the average drift of feature points in y-axis is less than -1 then the object is moving in downward direction or upward otherwise. For both axis if the average drift is in between ‘-1’ and ‘1’ the motion is considered as straight. In order to verify the results and calculate the correctness, the number of frames has been counted for a particular direction of motion and then the results generated by the system are matched. 6.1.5 Results and discussion The following section presents and discusses the results of tracking in predicting an object’s direction in real world videos.
61
School of Computer Science and Electronic Engineering
Video # 1
(b)
(a)
Figure 23: Frames for video # 1 (Car moving on straight road)
Left/Right Movement using SIFT
Up/Down Movement using SIFT
Right Turn
(a)
Left Turn
Right Turn (b) Figure 24: Graphs indication Left and Right motion of a moving vehicle
62
School of Computer Science and Electronic Engineering
6.1.5.1 Video 1:
Car moving on Straight Road with left/right turns
The video has been captured using on-board camera. It is approximately five minutes video with 9794 frames. Fig. 24(a & b) shows two sample frames from the video. The road is almost flat and most of the time car is moving in straight direction. The significant left movement appear in first 100 frames and right movement appear in the last set of frames and therefore the result verification for these two movements have been done on those frames. Table 17: Video # 1 Results verification
Movement
Left
Right
Threshold
Frames
Total Number
Correctly predicted
Number
of Frames
frames
1873-2170
297
207
69%
8360-8680
320
308
96%
45-180
135
133
98.5%
2240-2910
670
454
68%
Correctness
>1
< -1
The results obtained are quite encouraging as shown in graphs (Fig.24 (a&b)). The accuracy of the results lies in the range of 60 to 90% in case of right turn and left turn. The system is unable to precisely predict the left and right motion for some occasions in this particular video. This may be because of two reasons, first, the camera position (which can be tilted towards right) secondly; because of nonoverlapping of adjacent regions. This is visible from the video that when the car is taking left turn most of the view is not broad enough to capture the scene beyond the turn and therefore, the frames have very little overlapped regions. Due to small ratio of overlapping regions between frames, the feature tracking becomes difficult and result in false prediction.
63
School of Computer Science and Electronic Engineering
Video # 2
(b)
(a)
Figure 25: Frames form video # 2 (Car Moving Up and Down the Hill)
Right Turn
Upward Motion
Downward Motion
Figure 26: Graphs showing Vehicle's motion in Left, Right, Up and Down direction
64
School of Computer Science and Electronic Engineering
6.1.5.2 Video 2: Car moving up and down the hill The video has been captured using on-board camera. Video is approximately three minutes long with 5257 frames. Table 18: Video # 2 Results verification
Movement
Straight
Left
Right
Threshold
-11
Down
< -1
Frames Number
Total Number of
Correctly
Frames
predicted frames
Correctness
480-640
160
86
54%
2930-3225
295
102
34.5%
4045-4408
363
118
32.5
01-190
190
62
32.6%
3325-4060
735
14
2%
2700-2870
170
170
100%
310-470
161
123
76.4%
862-1070
210
123
58.6%
1860-2350
490
123
25%
3325-4370
1045
23
2%
45-435
390
158
40.5%
2930-3225
295
106
36%
The system successfully identified the right movement with highest accuracy (Table-18) whereas it is 40% reliable to detect straight and downward movement. But the system almost failed (with less than 20% accuracy) to find left and upward movement. It is because during left movement, the road itself is turning and so is the vehicle, which causes all feature points to be identified as moving to the left direction (Fig.26). Therefore, the system is fixed to classify it as vehicle’s right movement instead of left. In case of upward movement, the system does find the upward motion but the difference in features location is too small that it cannot differentiate between straight or steep path. Another important aspect of having poor performance for left and upward movement is the reduction in matching percentage. In case of right and upward motion, the matching percentage remains above 60% but when the vehicle is moving left or going upward, the background keeps changing and the matching 65
School of Computer Science and Electronic Engineering
percentage drops below 50% which causes the close features to be matched and increase the count of straight motion.
Video # 3
(b)
(a) Figure 27: Sample frames form indoor video
Left Movement
Straight Movement
Figure 28: Graph showing left motion in Indoor video sequence
66
School of Computer Science and Electronic Engineering
6.1.5.3 Video 3: Indoor motion sequence This is an indoor video captured using a handheld camera. The sequence is of few seconds with 421 frames. The subject carrying the camera is moving in an ‘L’ shaped corridor, and therefore takes sharp left turn during the sequence. Table 19: Video # 3 Results verification
Movement
Threshold
Actual # of Frames in which movement occur
Correctly predicted
Correctness
Left in x-axis
>1
65
58
89%
Straight
-1< and 1
100
80
80%
-1< and 1
217
157
72%
Up-ward in y-axis
>1
139
37
48%
The video shows that initially the aircraft is moving in right to left direction on the runway and so is the camera. So there is only left motion till 80th frame after which the upward movement starts and the system has identified this movement with reasonable accuracy. As the graphs in (Fig. 32) clearly shows that most of the results are consistent with actual data, with all positive values representing left (Fig. 32(a)) and upward movement (Fig. 32(b)). In this video clip, the movement of both camera and object in focus is in the same direction and so the features identified by SIFT in the background help predicting the correct movement. Another important factor is that the background has a lot of overlapped regions in frames and hence the matched features have sufficiently large translational transformation.
71
School of Computer Science and Electronic Engineering
Video # 6
(b)
(a) Figure 33: Aircraft Flying over Trees
Detecting Up / Down Movement
Detecting Left / Right Movement
Upward Motion
(a)
Straight Motion
(b) Figure 34: Graphs presenting Upward and Straight Motion of an aerial vehicle
72
School of Computer Science and Electronic Engineering
6.1.5.6 Video 6:
Video of an aircraft flying over trees
This video has been taken form an aerial vehicle, which flies over some trees and pass through a monument. The scene contains dense regions in which there is large number of local features found by SIFT which further help to predict the direction. Table 22: Video # 6 Results verification
Movement
Threshold
Actual # of Frames in which movement occur
Correctly predicted
% Correctness
Straight in x-axis
-1 to 1
378
378
100%
Up-ward in y-axis
>1
378
378
100%
The system reflects 100% reliability for this particular sequence. The obvious reason for this is the presence of dense regions and broader view which causes frames to have more ratios of overlapped regions. The large number of features results in high matching rate enabling the system to detect the direction easily in both axis (as shown in Fig 34(a & b)).
73
School of Computer Science and Electronic Engineering
Video # 7
(a)
(b) Figure 35: Air to Air Video Sequence
(a)
(b) Figure 36: Graphs presenting Left, right, up and down movement of an aircraft
74
School of Computer Science and Electronic Engineering
6.1.5.7 Video 7:
Air to Air Video of an aircraft
An on-board camera has been used to make the video of another craft flying in front. The system is unable to detect the motion properly. The subject carrying the camera and the object in front of it makes difficult for the system to relate the relative motion. If the object is moving in upward direction and subject remains at it original position, the system considers the motion of the subject as downward because the features identified on the object are moving in upward direction (as shown in Fig.36). Same problem occurs in left or right direction. The correctness of the algorithm can be seen in the following table. Table 23: Video # 7 Results verification
Movement Straight in x-axis Right in x-axis Up-ward in y-axis
Threshold
Actual # of Frames in which movement occur
Correctly predicted
% Correctness
-1 to 1
45
5
11%
Less than -1
100
54
54%
-1 to 1
145
97
67%
6.1.6 Application’s appraisal For tracking a moving object, the use of local features is quite helpful. The feature extractor (SIFT) finds features in two consecutive frames, and then matches these points to find the transformation in pixels. But the problem arises when the frames have less percentage of overlapped regions, as it happened when the car moved in upward direction. The features in one frame often do not exist in the next frame and so it result in less accurate localization. However, in case of aerial vehicle it was not true which can be seen in video # 4 (Fig.36). The reason is that the camera on this aircraft has broader view and the overlapping ratio of regions in the frames is much more than on a road. Therefore, it can be concluded that in order to make system more reliable the features which occur only in overlapped regions should be considered for direction estimation.
75
School of Computer Science and Electronic Engineering
7
Chapter: Conclusion and Future Work
In this project the performance of the two state-of-art algorithms for feature extraction along with two most popular algorithms for identifying cluster of features with similar transformations has been evaluated. The evaluation shows that SIFT along with RANSAC is the best combination to find and match local features in images with varying imaging conditions and some affine transformations. But at the same time SURF is no worse than SIFT and hence can be used for the applications where time efficiency is the essence of the application and accuracy comes thereafter.
The proposed solution of using local features for direction estimation during motion tracking proved successful. In situations where the moving object carrying camera changes its direction independent of the surroundings, the system captures its direction very easily such as a car taking U-turn or an object turning left or right in the corridor. The results show that the prediction is more than 90% accurate. However, if the change in object’s direction occurs along with the change in the direction of surroundings, the system receives unclear information such as if a car travels on a road with soft turns. It happens because the frames captured during this kind of motion have less overlapping regions due to which the features in overlapped region are outnumbered by new features. Therefore, by taking the average drift in features positions, the system wrongly predicts the direction of the motion and the accuracy lies in the range of 30 to 50%.
Future work will focus to increase the system’s reliability, for which it is suggested that the number of features in the overlapped regions should contribute towards the direction estimation of the moving object. Further, by recording other information the system will be upgraded to calculate the length, depth, steepness and time duration of the moving object.
76
School of Computer Science and Electronic Engineering
The system so developed may be able to suggest solutions to various vision problems. An option can be to develop a control system for aerial vehicles which can be used for aerial surveillance in high risk areas. Similarly an automotive warning system can be developed to help the drives in adverse weather and poor visibility conditions. There is also scope of developing intelligent devises which can recognize the environment for the blind people.
77
School of Computer Science and Electronic Engineering
References Ballard, D., & Brown, C. (1982). Computer Vision. In Computer Vision (p. Chapter 8). Prentice‐Hall. Bauer, J., Sunderhauf, N., & Protzel, P. (2007). Comparing Several Implementations of Two Recently Published Feature Detectors. International Conference on Intelligent and Autonomous Systems, IAV. Toulouse, France. Bauer, J., Sunderhauf, N., & Protzel, P. (2007). Comparing Several Implementations of Two Recently Published Feature Detectors. In Proc. of the International Conference on /Intelligent and Autonomous Systmes, IAV. Toulouse, France. Bay, H., Ess, A., Tuytelaars, T., & Luc, G. V. (2006). Speeded‐Up Robust Features (SURF). computer Vision and Image Understanding, (pp. vol 110, No 3). Bolles, Martin, A. F., & Robert, C. (1981). Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartgraphy. Comm. of ACM . Brown, M., Szeliski, R., & Winder, S. (2006). Multi‐Image Matching using Multi‐Scale Oriented Patches. IEEE Conference on Computer Vision. Canny, J. (1986). A Computational Approach to Edge Detection. IEEE Transcations on Pattern Matching and Machine Intelligence , 8, no.6. Clark, A., & Clark, C. (1999). Performance Characterization in Computer Vision: A Tutorial. Danny, C., Shane, X., & Enrico, H. Comparision of Local Desriptors for Image Registration of Geometrically‐coplex 3D Scenes. Harris, Chris, & Stephens, M. (1988). A Combined Corner and Edge Detector . Proceedings of 4th Alvey Vision Conference . Lindeberg, T. (1988). Feature Detection with Automatic Scale Selection . IJCV , 30(2): 79‐116. Lowe, D. G. (2004). Distinctive Image Features from Scale‐Invariant Keypoints. International Journal of Computer Vision . Mikolajczyk, K., & Schmid, C. (October 2005). A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intellignece, (pp. Vol 27, No 10). Moravec, H. P. (1977). Towards Automatic Visual Obstacles Avoidance. 5th International Conference on Artifical Intelligence. Morse, B. S. (1998‐2000). Lecture Notes: Edge Detection and Gaussian Related Mathematics . Edinburgh. Neubeck, A., & Gool, V. (2006). Efficient Non‐Maximum Supression . ICPR, (pp. 2161‐2168).
78
School of Computer Science and Electronic Engineering
Oliva, A., Antonio, T., & Monica, S. C. (Sept, 2003). Top Down Control of Visual Attention in Object Detection. IEEE Proceedings of International Conference on Image Processing . Oxford, U. (n.d.). Affine Covariant Features. Richard, H., & Andrew, Z. Multiple View Geometry in computer Vision. Sim, R., & Dudek, G. (September 1999.). Learning and Evaluating Visual Features for Pose Estimation. Proceedings of the Seventh International Conference on Computer Vision (ICCV'99). Kerkyra, Greece. Smith, S. M., & Brady, J. M. (1997). SUSAN‐A New Approach to Low Level Image Processing . International Conference of Computer Vision , 23, no. 1. Stephen, S., Lowe, D., & little, J. (October, 2002). Global Localization using Distinctive Visual Features. Proceedings of the 2002 IEEE/RSJ Conference in Intelligent Robots and Systems. Lausanne, Switzerland. Trajkovic, M., & Hedley, M. (1998). Fast Corner Detection. Image and Vision Computing , 16, no. 2. Valgren, C., & Lilienthal, A. SIFT, SURF and Seasons: Long‐term Outdoor Localization Using Local Features. Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple Features. IEEE Computer Vision and Pattern Recognition , 1:511‐518. Vittorio, F., Tuytelaars, T., & Gool, l. V. (July, 2006). Object Detection by Contour Segments Networks. In Lecture Notes in Computer Science . Berlin: Heidlberg: Springer . Wang, L., Jianbo, S., Gang, S., & I‐fan, S. (2007). Object Detection Combining Recognition and Segmentation. Computer Vision,ACCV.
79
School of Computer Science and Electronic Engineering
Appendix A: Graphs Comparing RANSAC and Hough Transform Bark Dataset (Matching SIFT Features )
False Positive
Bark Dataset (Matching SURF Features )
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
80
School of Computer Science and Electronic Engineering
Bike Dataset (Matching SIFT Features )
False Positive
False Positive
Bike Dataset (Matching SURF Features )
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
81
School of Computer Science and Electronic Engineering
Boat Dataset (Matching SIFT Features )
False Positive
False Positive
Boat Dataset (Matching SURF Features )
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive False Positive
82
School of Computer Science and Electronic Engineering
Leuven Dataset (Matching SIFT Features )
False Positive
Leuven Dataset (Matching SURF Features )
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
83
School of Computer Science and Electronic Engineering
Graffiti Dataset (Matching SIFT Features )
Graffiti Dataset (Matching SURF Features )
False Positive False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
84
School of Computer Science and Electronic Engineering
UBC Dataset (Matching SIFT Features )
UBC Dataset (Matching SURF Features )
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
85
School of Computer Science and Electronic Engineering
Wall Dataset (Matching SIFT Features )
Wall Dataset (Matching SURF Features )
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
False Positive
86