Binary image features designed towards vision-based

0 downloads 0 Views 2MB Size Report
MAVs make them feasible for use in a wide range of domains such as surveillance applications [2] ...... value, it falls in the range of [0..65535]. To remove the ..... in less matches. A high threshold means that more bits are allowed to differ, in.
BINARY IMAGE FEATURES DESIGNED TOWARDS VISION-BASED LOCALIZATION AND ENVIRONMENT MAPPING FROM MICRO AERIAL VEHICLE (MAV) CAPTURED IMAGES by JACO CRONJE A dissertation submitted in partial fulfilment of the requirements for the degree of MAGISTER PHILOSOPHIAE in ELECTRICAL AND ELECTRONIC ENGINEERING SCIENCE in the Faculty of Engineering and the Built Environment at the UNIVERSITY OF JOHANNESBURG SUPERVISOR: PROF. A.L. NEL

FEBRUARY 2012

DECLARATION I J. Cronje hereby declare that this dissertation is wholly my own work and has not been submitted anywhere else for academic credit either by myself or another person. • I understand what plagiarism implies and declare that this dissertation is my own ideas, words, phrase, arguments, graphics, figures, results and organization except where reference is explicitly made to another’s work. • I understand further that any unethical academic behaviour, which includes plagiarism, is seen in a serious light by the University of Johannesburg and is punishable by disciplinary action. •

Signed………………......

Date…………………

ABSTRACT This work proposes a fast local image feature detector and descriptor that is implementable on a GPU. The BFROST feature detector is the first published GPU implementation of the popular FAST detector. A simple but novel method of feature orientation estimation which can be calculated in constant time is proposed. The robustness and reliability of the orientation estimation is validated against rotation invariant descriptors such as SIFT and SURF. Furthermore, the BFROST feature descriptor is robust to noise, scalable, rotation invariant, fast to compute in parallel and maintains low memory usage. It is demonstrated that BFROST is usable in real-time applications such as vision-based localization and mapping of images captured from micro aerial platforms.

TABLE OF CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

CHAPTER 1 Introduction . . . . . . . . . 1.1 Background information 1.2 Problem statement . . . 1.3 Outline . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 1 3 5

2 Literature Review . . . . . . . . . 2.1 Related work . . . . . . . . . . 2.2 SIFT . . . . . . . . . . . . . . . 2.3 SURF . . . . . . . . . . . . . . 2.4 FAST . . . . . . . . . . . . . . . 2.5 FAST-ER . . . . . . . . . . . . 2.6 Binary feature descriptors . . . 2.6.1 BRIEF . . . . . . . . . . 2.6.2 BRISK . . . . . . . . . . 2.7 Integral Image . . . . . . . . . . 2.7.1 Serial implementations . 2.7.2 Parallel implementations 2.8 Discussion . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

7 7 7 10 11 13 14 15 16 20 21 21 22

3 The BFROST feature extraction 3.1 Introduction . . . . . . . . . . . 3.2 Feature detection . . . . . . . . 3.2.1 Location detection . . . 3.2.2 Rotation estimation . . . 3.2.3 Scale estimation . . . . . 3.3 Feature description . . . . . . . 3.3.1 Sampling points . . . . . 3.3.2 Descriptor construction . 3.4 Feature matching . . . . . . . . 3.5 Discussion . . . . . . . . . . . .

method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

24 24 25 26 30 32 32 33 36 37 39

4 Results . . . . . . . . . . . . . 4.1 Testing environment . . . 4.2 Feature detector . . . . . . 4.2.1 Rotation invariance

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

42 42 42 43

. . . .

iii

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Page

4.3

4.4

4.2.2 Noise sensitivity . . . . 4.2.3 Illumination invariance 4.2.4 Statistics . . . . . . . . Feature descriptor . . . . . . . 4.3.1 Rotation invariance . . 4.3.2 Noise sensitivity . . . . 4.3.3 Illumination invariance 4.3.4 Statistics . . . . . . . . Discussion . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

44 44 46 49 50 50 51 53 54

5 Applications . . . . . . . . . . . 5.1 Optical flow . . . . . . . . . 5.2 Video stabilization . . . . . 5.3 Camera pose estimation . . 5.4 Object recognition . . . . . 5.5 Object tracking . . . . . . . 5.6 Image indexing and retrieval 5.7 Discussion . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

56 56 57 60 61 63 65 67

6 Conclusion . . . . . 6.1 Contributions . 6.2 Limitations . . 6.3 Future research

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

68 68 69 69

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . .

71

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

APPENDIX A List of terminologies . . . . . . . . . . . . . . . . . . . . . . . . . .

76

B List of abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . .

77

iv

LIST OF TABLES Table 1 2 3 4 5 6 7

Page Detection comparisons between BFROST and other feature extractors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rotation estimation comparisons between BFROST and other feature extractors. . . . . . . . . . . . . . . . . . . . . . . . . . Scale detection comparisons between BFROST and other feature extractors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Descriptor comparisons between BFROST and other feature extractors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory utilization comparisons between BFROST and other feature extractors. . . . . . . . . . . . . . . . . . . . . . . . . . Average execution time for detecting 10000 features. . . . . . . . Average execution time for describing 10000 features. . . . . . .

v

40 40 40 41 41 47 53

LIST OF FIGURES Figure 1 2 3 4 5

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Page Micro Aerial Vehicle with a HD GoPro camera attached. . . . . 1 Aerial image taken from an ArduCopter. . . . . . . . . . . . . . 2 Aerial images taken from different viewpoints. Matching landmarks are highlighted. . . . . . . . . . . . . . . . . . . . . . . . 3 Vision based feature matching system flow diagram. . . . . . . . 5 The 9 point segment test corner detection, taken from [1]. The highlighted squares are the 16 pixels under inspection. The pixel at p is the possible keypoint. The dashed line is passing through 9 continuous pixels which are all brighter than p by more than t. 11 FAST classifier construction. . . . . . . . . . . . . . . . . . . . . 12 The 48 offset positions used by the FAST-ER detector. . . . . . 14 The BRIEF description process. . . . . . . . . . . . . . . . . . . 16 The BRISK detection process. . . . . . . . . . . . . . . . . . . . 17 The BRISK description process. . . . . . . . . . . . . . . . . . . 19 Calculating the sum over area D with only four look-ups. . . . . 20 BFROST feature detection process. . . . . . . . . . . . . . . . . 25 BFROST detection with different threshold values. . . . . . . . 28 Examples of the simple BFROST rotation estimation technique. 31 BFROST description process. . . . . . . . . . . . . . . . . . . . 33 64 Sampling point locations with their associated square regions. 34 Complete testing pattern, generating a 256 bit descriptor. The image on the right illustrates a zoomed-in version of the pattern. 36 Sampling point comparisons. . . . . . . . . . . . . . . . . . . . . 37 Images used for testing. . . . . . . . . . . . . . . . . . . . . . . 43 Repeatability results for the detector rotation invariance tests. . 44 Repeatability results for the detector under additive Gaussian noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Repeatability results for the detector under brightness variations. 46 Repeatability results for the detector under contrast variations. 47 The number of features detected by the different feature detectors. 48 Normalized execution time for detecting features. Timings were normalized by the BFROST timing. . . . . . . . . . . . . . . . 48 The number of features detected with various BFROST detection thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Matching scores for the descriptor rotation invariance tests. . . 50 Matching scores for the descriptor under additive Gaussian noise. 51 Matching scores for the descriptor under brightness variations. . 52 Matching scores for the descriptor under contrast variations. . . 53

vi

Figure 31

32 33 34 35 36 37 38 39 40 41

Page Normalized execution time for describing features. The speed of BFROST is shown as 1 and the other descriptors were normalized by the BFROST timings. . . . . . . . . . . . . . . . . . Feature based optical flow. . . . . . . . . . . . . . . . . . . . . . Feature based video stabilization. . . . . . . . . . . . . . . . . . Video stabilization values over time. . . . . . . . . . . . . . . . Zoomed-in stabilization values over time. . . . . . . . . . . . . . Upside down box successfully recognized. . . . . . . . . . . . . . Book cover successfully recognized under rotation transformation. A building tracked across multiple viewpoints. . . . . . . . . . . A vessel being tracked. . . . . . . . . . . . . . . . . . . . . . . . Feature based image indexing and retrieval. . . . . . . . . . . . Real-time image indexing and retrieval with BFROST. . . . . .

vii

54 57 58 59 59 62 62 64 64 66 66

CHAPTER 1 Introduction 1.1

Background information

Micro Aerial Vehicles (MAVs) have become a popular platform for performing low altitude remote sensing. Most MAVs have the capability of carrying some payload. Although the load carrying capacity is limited, carrying multiple cameras and various sensors such as a Global Positioning System (GPS), Inertial Measurement Unit (IMU) and altimeter is quite possible. The versatility and low cost of these MAVs make them feasible for use in a wide range of domains such as surveillance applications [2], inspection of power-lines [3], search and rescue operations [4], tracking of wildlife [5] and situation awareness for border control. Video footage captured on-board the MAV can be transferred in real-time to a ground station for online processing, or recorded for offline processing at a later stage. The use of lightweight, low cost cameras makes it feasible to explore image processing algorithms that process images captured on-board a MAV. Figure 1 shows an ArduCopter [6] with a HD GoPro camera [7] attached as payload. The ArduCopter is a low-cost open source hardware and software platform. The platform consists of an Arduino-based autopilot for multirotor crafts, from quadcopters to helicopters. The GoPro camera supports recording of hi-resolution video at high frame-rates. The camera and casing weigh-in just below 150g, which makes the GoPro a feasible payload and attractive solution for remote-sensing applications. The location of a MAV needs to be known for the images captured on-board to be useful. GPS data is not always available and sensor data might be insufficient

Figure 1. Micro Aerial Vehicle with a HD GoPro camera attached.

1

Figure 2. Aerial image taken from an ArduCopter. or not accurate enough to determine the exact location and orientation of the platform. Furthermore, to be able to localize the MAV purely based on vision, a map of the surrounding environment is required. Once a vision-based environment map has been created, the task of localizing the MAV inside the map will be a simpler task. The task of vision-based localization and mapping is known as Visual SLAM [8]. Figure 2 shows a typical image captured with a GoPro camera attached to an ArduCopter. Landmarks need to be recognized in order to determine the 3D location of the MAV by using only images captured from an attached camera. Figure 3 shows two images taken from different viewpoints. Typical landmarks that can be identified in both images by a human observer are highlighted with ellipsoids. The arrows between the ellipsoids indicate the matching landmarks. These matching landmarks can be used to estimate the pose of the camera relative to a previous frame. The geo-referenced camera pose can be estimated by taking into account the known GPS-coordinates of static landmarks. The accuracy of these estimates depends heavily on the accuracy of the estimated location of the detected landmarks. It is therefore very important to be able to detect a landmark accurately and repeatably across multiple frames. In the image processing and computer vision community, landmarks are sometimes called image features. A local image feature is an area within an image that consists of: • 2-Dimensional location (x, y) in the image. • A region surrounding the location. This can be described by a width and height variable for rectangular regions and with a radius for circular regions.

2

Figure 3. Aerial images taken from different viewpoints. Matching landmarks are highlighted. • Some form of n-dimensional description that can be used to uniquely define the feature. The description should be adequate to separate different landmarks within a n-dimensional feature space.

1.2

Problem statement

The problem is that the location of a MAV needs to be known for the images captured on-board to be useful. In order to determine the location of the MAV, local image features need to be detected in each captured frame. These features need to be described in such a way that they can be matched between successive frames and matched against features of known landmarks. Furthermore, the accuracy and repeatability of the feature detection is important for the localization estimation. It is difficult to determine the location of the MAV robustly, if the features are not detected across multiple frames with good repeatability. Inaccurate feature location estimations will lead to inaccurate MAV localization, because the inaccuracies will introduce errors into the system. The robustness of the feature description is also important. Features need to be described unambiguously under different conditions. These problematic conditions include features observed under different illumination conditions, rotation, scale, translation and viewpoint transformations. All of these conditions can occur simultaneously, which can be a problem for the feature detector and descriptor. The accuracy of the system will degrade if the features can not be described robustly under the mentioned conditions. A further problem is that in order to have a real-time system and to determine the relative camera pose between frames, features need to be matched efficiently and fast. Therefore, a system is required that can detect the features, describe them

3

and then match a set of features against another set as fast as possible. The process can become computationally expensive when matching thousands of features against thousands of features across multiple frames. The matching process needs to execute rapidly enough such that multiple image matches can be performed during each frame to increase the localization accuracy and keep the system running in real-time. A real-time system requires at least a frame rate of 15 frames per second to prevent noticeable sluggishness to the user and to maintain fresh feedback for an autopilot navigation system. The MAV might revisit the approximate same location several times during a flight. In order to close the localization loop and reduce drift errors, previously visited locations need to be recognized as well. This process entails matching the current detected features with features detected from known or previously detected key locations. To keep the system running in real-time, the problem of indexing these key locations arises. The system needs to define a distance metric, that can be computed efficiently in order to determine the likelihood that the current frame will match a previously known key frame successfully. Such a metric will allow the system to efficiently sort the database of key locations regarding their relation to the current frame. A full feature matching process can then be applied to the most relevant key frames, instead of matching against all the key frames in the database. A high-level diagram of the desired system is shown in Figure 4. The diagram shows the flow of execution for each captured image. The captured image is fed into the feature detection process, thereafter the features are described. A list of the most relevant key frames from the known locations is retrieved. For each of the key frames a complete feature matching process is performed, after which a camera pose estimation algorithm can be used to estimate the pose based on the matched features. The database can then be updated with the adjusted feature locations and possible new key frames. The output of the whole system is an accumulated absolute world camera pose that can further be used by a navigation, visualization or mapping system. This work focuses mainly on the feature detection and description part of the problem. The main contribution of the work is a new feature detector and descriptor that can be implemented on a GPU. The proposed feature extractor is called Binary Features from Robust Orientated Segment Tests or BFROST in short. Comparisons with related feature extractors are given. The work also proposes a method for fast feature matching and brings feature extraction applications into perspective by providing insight into various practical feature based approaches. These applications range from optical flow applications, video stabilization, object recognition and image indexing and retrieval.

4

Figure 4. Vision based feature matching system flow diagram. 1.3

Outline

An introduction to the work and background information were given in Chapter 1. The problem statement and the part of the problem that this work will focus on have been specified. Furthermore, a brief overview of the work to follow in the document will be provided in this section. A literature review on related local image feature detectors and descriptors is given in Chapter 2. Although a wide variety of feature extractors do exist, the literature review focuses only on the closely related work. The proposed feature extractor is placed into context with the latest available extractors in the literature. Integral image computation techniques that are required by the work, are also discussed. Both single and parallel computation implementations are reviewed. Chapter 3 presents the proposed feature detector and descriptor for a vision-based localization system. The feature detection method as well as location, rotation and scale estimation methods are presented. The efficient feature descriptor is explained with additional illustrations to aid the explanation. Finally, a method for fast feature matching is presented. Throughout Chapter 3 the focus is on a 5

GPU parallel implementation. Comparisons between related feature extractors in the literature and the proposed extractor are presented in Chapter 4. The rotation invariance is compared and tested in order to proof the feasibility of the proposed rotation estimation technique. Furthermore, the robustness against image noise and illumination variations is tested. Finally, the execution speed is tested in order to validate if the proposed extractor is feasible for real-time applications. Chapter 5 provides some practical applications that can be implemented by using the proposed feature extractor. Each application is presented with a brief summary and some illustrations to explain their practical implications. From the applications shown in Chapter 5, it can be seen that feature extraction is an important and useful process in image processing and computer vision applications. Finally, Chapter 6 concludes the work. The main contributions of the work are highlighted. Limitations of the method are discussed and possible future research opportunities are presented.

6

CHAPTER 2 Literature Review 2.1

Related work

Image feature extractors are a widely studied computer vision subject. A wide variety of feature detectors and descriptors exist. This is due mainly to their wide use in different applications and the continuous effort to improve their accuracy and performance. The OpenCV library [9] is an open source computer vision library that is very popular and it has been ported to a large variety of different platforms. The fact that it is open source, allows the library to be used by many research and academic institutions as well as in commercial products. The library contains implementations of various feature detectors and descriptors, which makes it suitable for comparison studies without the need to re-implement popular feature extractors. Only the most relevant feature extractors will be reviewed in this chapter. The highly popular SIFT[10] and SURF[11] feature extractors will be discussed very briefly. SIFT and SURF describe a feature point with a n-dimensional floating point vector, whereas BFROST describes a feature point with n number of bits. Therefore, the literature review will focus more on the binary feature descriptors, because they are more directly related to BFROST than SIFT and SURF. The results in Chapter 4 will however include comparisons with SIFT and SURF to make the results understandable in a broader sense. Sections 2.2 - 2.5 will review the feature extractors. Thereafter, Section 2.6 will review the latest binary feature descriptors that were recently published. The proposed feature extractor discussed in Chapter 3 relies on the computation of an integral image. Different integral image computation implementations are reviewed in Section 2.7. Finally, a summary of the literature review is provided in Section 2.8.

2.2

SIFT

SIFT [10] is one of the most referenced feature extraction methods. Each keypoint is described by 128 floating point values which are equal to 512 bytes. Thousands of keypoints are detected and described across multiple scales. SIFT provides a method for detecting and describing local image features.

7

The process of detecting and describing the keypoints can be described briefly as follows: • Firstly, the detector searches for stable features across multiple scales. For each octave, the input image is repeatedly smoothed with a Gaussian convolution function. Each Gaussian image is subtracted from the next Gaussian image in the series to create a set of Difference-of-Gaussian (DoG) images. The resolution of the image is divided by two for each consecutive octave. • The local extrema detection process involves a search over the DoG images for a local minimum or maximum value. A non-maximal and non-minimal suppression algorithm scans through the DoG images. The test involves comparing the values of the eight neighbors on the same scale and the nine neighbors on the scale above and below the current scale. • The location and scale are refined by interpolating between the neighboring values of the detected extrema keypoints. Low contrast features and features along edges are rejected by using a threshold value. • To achieve rotation invariance, an orientation is assigned to each keypoint. An orientation histogram is constructed by sampling gradient orientations around the keypoint. The histogram contains 36 bins, each bin contains the sum of the magnitudes of gradient orientations that falls within the range of 10 degrees. The highest peak in the histogram is used as the dominant orientation. If another peak occurs within 80% of the highest peak, another keypoint is created with the peak index as orientation. Therefore, multiple keypoints with the same location and scale but different orientations can be detected. • Next, the keypoints are described. Image gradients with their magnitudes are computed around each detected keypoint according to the detected scale. Gradients are rotated according to the detected orientation of the keypoint. • The region around the keypoint is divided into 4 by 4 sample areas. An orientation histogram is calculated for each of the sampling areas. Each histogram contains 8 bins. A Gaussian weighting is then applied to the magnitudes before they are accumulated into the histogram. The values of all the histograms are placed into a feature vector. • The feature vector is normalized to reduce the effect of image illumination changes. To reduce saturation effects, each element of the vector is clamped against a threshold and the vector is normalized again. • The resulting normalized vector forms the 128 floating point feature descriptor.

8

Various modifications of SIFT are available in the literature. A reduced version of SIFT has been proposed in [12], where the complexity of the rotation invariance has been removed to cater for indoor localization. The reduced version removes all the processes involved with the feature orientation. This involves calculating the keypoint orientation, generating additional keypoints that contains multiple dominant orientations and the alignment of the descriptor with the orientation. Another variation called PCA-SIFT [13], applies Principal Component Analysis (PCA) on the normalized gradient patch instead of using the original SIFT weighted histogram. The detection phase of PCA-SIFT is the same as with SIFT, but the description phase differs. PCA reduces the dimensionality of the feature vector which reduces the 128 element SIFT vector to a 20 element PCA-SIFT vector, saving a significant amount of memory. However, the accuracy of PCA-SIFT depends heavily on the accuracy of the location, scale and orientation estimations. Small errors in the estimations result in a decrease in the accuracy of the PCA-SIFT descriptor. The fast approximated SIFT [14] method attempts to speed up the computation process of the original SIFT by using integral images. The DoG process is replaced with a Difference-of-Mean (DoM) process. The DoM images can be calculated efficiently by using an integral image. The description process is approximated by replacing the orientation histogram with the computation of an integral histogram [15]. The downside of these approximations is that the integral image can only be used for rectangular regions. The integral image contains the sum of rows and columns and can therefore not be used to sample non rectangular regions accurately. The rectangular regions are not as accurate and robust as the original SIFT circular and exact structures. The approximation seems to be 8 times faster than the original SIFT implementation. There is an active effort to implement SIFT as accurately as possible and with the original robustness on the GPU. The SiftGPU project [16] seems to be the most active and successful at the moment. The project involves a combination of CPU and GPU processing. The generation of DoG pyramids is performed on the GPU as well as the orientation and description process. The CPU manages the keypoint selection and GPU tasks. The implementation uses CUDA [17] and requires at least a NVidia GTX 8800 GPU. The project has been successfully used in visual structure from motion projects. SIFT is robust to almost all common image transformations. The downside of SIFT is that it is computational expensive and the descriptor uses a lot of memory.

9

2.3

SURF

SURF [11] was inspired by SIFT [10], with the main goal to improve the execution speed of the detector and descriptor. SURF depends mainly on an integral image to approximate and speed-up the execution time. The main points to highlight in the SURF implementation can be briefly listed as follows: • The detector relies on the determinant of the Hessian matrix. The Hessian matrix is approximated by sampling rectangular regions to approximate the Gaussian derivatives. The rectangular regions are quickly sampled with the use of an integral image. • The local extrema from the approximate determinant of the Hessian matrix is located across different scales. The same integral image can be used to process each scale, therefore there is no need to create an image pyramid for scale detection. • Haar wavelets are used to calculate the orientation of the sampling points around the keypoint. These Haar wavelets can be calculated by using the same integral image. The dominant keypoint orientation is detected by examining the magnitude of the orientations within a sliding arc window. The arc direction with the highest resulting magnitude is chosen as the dominant orientation. • The region surrounding the keypoint is divided into 4 by 4 sub-regions. Haar wavelets for 25 regularly distributed sampling points for each sub-region are calculated. The sum of the horizontal and vertical wavelet responses is calculated. The horizontal sum, horizontal absolute sum, vertical sum and vertical absolute sum are used for the descriptor for each sub-region. • The resulting descriptor contains 64 floating point values. In some applications rotation invariance is not that important. U-SURF [11] is a modified version that assumes an upright orientation. The rotation estimation process is completely removed and the detector sub-regions don’t need to be rotated. Therefore, U-SURF provides an even faster implementation if no or very little rotation invariance is required. An open source SURF implementation [18] is available. The OpenCV library [9] also contains an implementation of SURF. Recently, the addition of GPU implementations into OpenCV also includes a GPU version of SURF. MIT has a partial implementation on the GPU of SURF, namely GpuSURF and it is based on the algorithm described in [19].

10

SURF has been shown to be less robust than SIFT, but executes much faster and is more usable for real-time applications.

2.4

FAST

The FAST feature detector [20] [1] detects keypoints by inspecting the pixel intensities of sixteen pixels on a circle surrounding the possible keypoint p. Let I(p) be the pixel intensity of the pixel at location p. A positive classification occurs when there exists a set of n continuous pixels on the circle, which are all brighter than the intensity I(p) + t, or all darker than the intensity I(p) − t, where t is the detection threshold value (Illustrated in Figure 5). The radius of the circle surrounding the keypoint p is set to 3 as seen in Figure 5 and by increasing the value of t, the number of detected features will decrease and by decreasing the value of t, the number of features detected will increase.

Figure 5. The 9 point segment test corner detection, taken from [1]. The highlighted squares are the 16 pixels under inspection. The pixel at p is the possible keypoint. The dashed line is passing through 9 continuous pixels which are all brighter than p by more than t. A number of possible values exists for n. The authors of FAST have found that the feature detector obtains the best repeatability when n = 9. However, with a setting of n = 12, it is possible to perform high speed tests to quickly reject keypoints that do not classify as corners. For example, consider the pixel intensities at locations 1, 5, 9 and 13. If three of them pass the intensity test, the keypoint can pass classification, otherwise it is not possible to form a continuous segment of at least 12 tests. Machine learning is applied to construct a decision tree classifier that can detect a feature at high speeds. The pre-process of constructing the classifier is shown in Figure 6 and works as follows: • A set of training images needs to be collected. They should preferably be from the target application domain.

11

• The threshold value t is selected as well as the number of continuous pixels n required for classification. • For each training image, all features are detected using a slow, full segment test by examining all 16 pixels on the circle. The result for all pixels in the training set is now available. Note that the status of all the pixels that do not classify as a feature is also recorded and forms part of the training data. • From the training data, a decision tree is constructed by applying the ID3 method [21]. • The decision tree is converted into compilable code. The code consists of a set of nested if-else statements that simulates the decision tree. OpenCV[9] contains C-code that was generated from a decision tree. • The function of the compilable code performs classification on a location in an image. The code therefore classifies a pixel location as a feature or not.

Figure 6. FAST classifier construction. The classifier will classify the features from the training data correctly, but might classify more general features that are not contained in the training set, incorrectly. This means that it is important to train the decision tree on images from the target application domain. The strength of a detected feature is given by the maximal value of t that still classifies p as a keypoint, a simple binary search is performed to determine the maximum of t. The feature strength is used to perform non-maximal suppression of all the detected features. The suppression process removes all the weak features that are adjacent to each other. This ensures that the detection process does not output a cluster of features very close to each other. On average, FAST can quickly reject non-corner locations with only a few ifstatements.

12

A drawback of the FAST feature detector is that it is sensitive to noise. Discrete pixel look-ups are performed on the 16 surrounding pixels on the circle. These discrete read operations are sensitive to noise. It is possible that only one of the pixels on the segment contains image noise and that will prevent the segment from being detected. FAST is also rotation invariant. An image patch viewed from a different in-plane rotation should contain the same segments for all rotations. The advantages of FAST is that it is very fast to compute and shows very good repeatability in detecting features. The disadvantages are that the detection is sensitive to image noise and the nested if-else statements can not be used on GPU hardware.

2.5

FAST-ER

The FAST-ER [22] detector was inspired by the FAST detector that was reviewed in Section 2.4. While the FAST detector relies on a continuous segment that needs to be detected to classify the pixel as a corner, FAST-ER completely removes this notion. FAST-ER relies on machine learning to learn when a pixel needs to be classified as a feature or not. The aim of the detector was to improve the detection speed and repeatability of the FAST detector. Just like FAST, a decision tree is constructed. Instead of examining the 16 pixels on the circle around the pixel under classification, FAST-ER examines 48 surrounding pixels. The offsets of these pixels are illustrated in Figure 7. The idea is to optimize the decision tree such that the resulting repeatability rate is maximized. A set of training images with different transformations to test the repeatability score, needs to be created. A random decision tree is created and optimized through simulated annealing [23]. The decision tree can be evaluated by a cost function specified in [22]. The cost function depends on the calculated repeatability score, the number of nodes in the tree and the number of features detected. Sixteen different symmetries of the offsets shown in Figure 7 are used when evaluating the tree, their results are combined with a logical OR. The simulated annealing process works as follows: • Randomly mutates the decision tree. • Evaluate the mutated decision tree by calculating the score of the cost function. • Determine the acceptance probability according to the cost and current temperature. 13

3

0

1

2

4

5

6

7

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

p

27 28 29

24 25 26 30 31 32

33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Figure 7. The 48 offset positions used by the FAST-ER detector. • Accept the mutated decision tree as the current decision tree if the acceptance probability test has passed. • Reduce the temperature and repeat the process until the temperature reaches the lower limit. FAST-ER showed slightly better repeatability than FAST. The complexity of FAST-ER is higher and the decision tree contains approximately 30,000 nodes. The drawback of training on images from the application domain still remains. Sensitivity to noise remains a problem, but some improvement has been shown.

2.6

Binary feature descriptors

Recently, binary feature descriptors have started to receive more attention. These descriptors are implemented to describe features with n binary values instead of n floating point values. Binary descriptors can be compacted and represented with much less memory. A combination of 32 binary descriptions may have the same or more descriptive power than a single floating point value. It is important to have a metric defined for the distance between two feature vectors. In the case of floating point descriptors, the distance can be calculated by computing the Euclidean distance between the two vectors. The distance between two binary strings can be measured by the Hamming distance. The Hamming distance can be rapidly computed by performing the bit-wise XOR operator between the two strings and then calculating the number of set bits within the result. Modern day hardware architectures support instructions that can count the number of bits in a word rapidly. CUDA [17] devices with a computecapability2.x, map the

14

popc instruction to a single hardware instruction, that counts the number of set bits in a given word within one clock cycle. Therefore, matching binary descriptors can be computed rapidly.

2.6.1

BRIEF

The BRIEF implementation [24] only focuses on feature description and does not provide a method for feature detection. The method assumes that the keypoint location has already been detected. BRIEF describes each keypoint with a binary string. The length of the string is adjustable and lengths such as 128, 256 and 512 bits are normally used depending on the application and desired accuracy. The more bits that are used the more robust the descriptor, but increased memory is used and decreased computation time results. A reference implementation is included with the latest OpenCV [9] distribution. The description process is illustrated with the diagram in Figure 8. The process works as follows: • The input image as well as the detected keypoint location, -scale and rotation need to be given to BRIEF. The scale and rotation are optional, a default scale size and upright rotation are used if none are provided. • The patch surrounding the keypoint is smoothed with a Gaussian smoothing kernel with σ = 2 and the kernel window size set to 9 ∗ 9. • Sampling points relative to the keypoint are determined by using a fixed pattern. The pattern points are predetermined with an isotropic Gaussian distribution, also shown in Figure 8. If a rotation estimation was provided, the whole pattern needs to be rotated around the keypoint. If a scale was provided, the whole pattern needs to be scaled accordingly. Note that a lot of information gets lost when the pattern is scaled, because less pixels are sampled. • The binary descriptor is built by comparing pairwise pixel intensities for each binary description value. The lines shown in the pattern in Figure 8, indicate sampling points that are compared with each other. The bit is set when the first pixel intensity is greater than the second pixel intensity. • Finally, all the resulting bits are combined to form the descriptor. The BRIEF descriptor is fast to compute, the length of the descriptor is adjustable and matching between descriptions can be performed efficiently. However, image patches need to be smoothed to be more robust against noise, the spatial sampling

15

Figure 8. The BRIEF description process. pattern needs to be rotated for the descriptor to be rotation invariant and the descriptor does not scale well because of the discrete pixel sampling.

2.6.2

BRISK

The BRISK1 implementation [25] describes a method for detecting and describing image features. The feature detection part uses the improved version of FAST [1], namely AGAST [26] to detect keypoints. Furthermore, the feature detection phase tries to detect features by searching in different scale-spaces. The focus is therefore to try and determine the correct scale for each keypoint. The feature descriptor of BRISK also generates a binary string similar to BRIEF [24], but with a different sampling strategy. The length of the descriptor is set to 512 bits. BRISK [25], inspired by BRIEF, also computes pairwise comparisons to build the descriptor for an image patch. Instead of using randomly selected sampling points, the method uses a fixed sampling pattern consisting of 60 sampling points. The BRISK detection process is illustrated with the diagram in Figure 9. 1

The reference implementation and an accompanying video demonstration were placed on the BRISK authors website very recently, the website link can be found in the BRISK paper [25].

16

Figure 9. The BRISK detection process.

17

The process works as follows: • Firstly, BRISK creates a scale-space image pyramid from the input image. The pyramid needs to be created to detect features at different scales. n Octaves are created with intra-octaves between the octaves. Usually n = 4. • The octaves are created by progressively half-sampling the input image. The first octave will be the input image, the next octave will be 2 times smaller than the input image, the next one 4 times smaller than the input image and so forth. The down-sampling factor for octaves therefore forms the following sequence: 20 , 21 , 22 , 23 , ..., 2n . • The input image is down-sampled by a factor of 1.5 to create the first intraoctave. The same progressive half-sampling process generates the following intra-octaves. The down-sampling factor for intra-octaves therefore forms the following sequence: 1.5 ∗ 20 , 1.5 ∗ 21 , 1.5 ∗ 22 , 1.5 ∗ 23 , ..., 1.5 ∗ 2n . • The AGAST feature detector is applied on all the octaves and intra-octaves with the same threshold value. The segment test for 9 continuous pixels is used. • A non-maximal suppression algorithm extracts all the features that have a maximal strength among their neighboring features. The neighbor search is performed on the adjacent octaves of the feature under consideration. This ensures that the search is performed across multiple scales. • Next, a parabola function is fit to the strength and scale of the remaining maximal features. The equation is solved to determine the true scale of the feature. • The position of the feature is recalculated by using the detected scale. • The output of the whole detection process creates a list of feature positions with their appropriate scale and strength. The rotation estimation is left for the description phase. The description process is illustrated with the diagram in Figure 10. The process works as follows: • As a pre-process, a look-up table for each rotation and scale is created. According to [25], this look-up table uses 40MB of memory. • Only the features detected from the detection phase are described.

18

Figure 10. The BRISK description process. • A Gaussian smoothing is applied on each image patch surrounding each sampling point. Figure 10 shows the sampling points with blue circles. The size of the Gaussian smoothing is indicated with the dashed red circles. In total, the sampling pattern contains 60 sampling points. • The local image gradient is calculated between all sampling point pairs that are greater than a predetermined distance from each other. The sum of all gradients is used as the rotation estimation. • The estimated rotation and scale are used to find the final sampling points from the preprocessed look-up table. • The binary descriptor is build by comparing pairwise, smoothed pixel intensities for each binary description value. BRISK defines a short distance sampling point pair to be a pair of sampling points that is less than a predetermined distance from each other. All pairs of short distance pairs are used to generate the pixel intensity tests. The distance threshold is set to limit the number of short distance pairs to 512. Just like BRIEF, BRISK compares sampling point intensities with each other. Each bit is set when the first pixel intensity is greater than the second pixel intensity. • Finally, all the resulting bits are combined to form the descriptor. The BRISK detector provides a method to accurately determine the scale of the detected features. The feature detection process relies on the AGAST detector. However, a scale-space image pyramid needs to be generated. The feature descrip19

tor uses far less sampling points than BRIEF. The descriptor is rotation invariant, however local gradients need to be calculated. BRISK performs better than BRIEF on scalability, because each sampling point is smoothed according to the scale.

2.7

Integral Image

The integral image provides a method to quickly calculate the sum over a rectangular area within an image with only four read operations from the integral image. The integral image is sometimes referred to as a summed area table. The integral image II is structured in such a way that each element at location (x, y), contains the sum of the pixel values to the left and above the location (x, y). Let I be the input image and II the integral image, then the integral image can be represented by: X

II(x, y) =

I(x0 , y 0 )

(1)

x0 ≤x y 0 ≤y

Figure 11 illustrates how the integral image can be used. The sum of the values in the area D within the image is calculated by sampling the integral image at points 1, 2, 3 and 4. The value at 1 contains the sum of A, 2 contains the sum of A + B, 3 contains the sum of A + C and 4 contains the sum of A + B + C + D. In order to retrieve only the sum of D, 1 and 4 can be added together to form A + A + B + C + D. Then 2 can be subtracted to form A + C + D. Finally, 3 can be subtracted to form D. A

B 1

2 D

C 3

4

Figure 11. Calculating the sum over area D with only four look-ups. The integral image has been used in many applications. In [27], it was used to extract responses from rectangular regions to perform face detection. SURF [11] used the integral image to speed-up feature detection and calculate feature orientations and to approximate the Hessian matrix determinant. 20

Box filters can be implemented by using the integral image. Many applications related to graphics are listed in [28], such as real-time environment map creation, translucency effects, simulated depth-of-field and approximating a phong BRDF. Section 2.7.1 will review serial implementations of the integral image while Section 2.7.2 will review parallel implementations in the literature.

2.7.1

Serial implementations

The simplest and most straight forward method to calculate the integral image in serial, is to calculate the values row by row in a sequential order. Each value depends on the previous value. The integral image can then be computed in a single serial pass over an image by: II(x, y) = II(x − 1, y) + II(x, y − 1) − II(x − 1, y − 1) + I(x, y)

(2)

More serial implementations do exist and they are described in [29]. Their performance is not that good, because multiple passes are required to calculate the whole integral image. The single pass algorithm is the fastest serial algorithm available and performs well on a single core CPU. The single pass algorithm can not be ported directly to a parallel algorithm, because each value depends on the previous values, which violates the most basic requirement of parallel computations. The next section will investigate parallel integral image algorithms.

2.7.2

Parallel implementations

The parallel algorithm problem for integral images involves creating multiple threads that simultaneously sum the rows and columns of the input image without interfering or conflicting with each other. In [29], an analysis on serial and parallel implementations for multi-core processors was done. The problem is related to methods of adding up the values of a single row. After each row has been added up, the process can be repeated for each column instead of each row. The summation process is also called prefix sum or scan. 21

The easiest way to compute the prefix sum, is to add the sum of the successor of each element to the element. This creates a dependency between elements which can not run in parallel. The method can be adjusted to group elements together and compute the sum of groups in parallel. The sum of the groups can then be accumulated over multiple passes. The classical parallel prefix-sum solution is described in [30] and the algorithm has been implemented on the GPU with the CUDA architecture in [31]. The method applies the concept of a binary tree. The algorithm consists of two phases. In the first phase called the up-sweep phase, the binary tree is traversed from the leaves to the root and the sum of the internal nodes is propagated. The process is also known as reduction. The second phase is called the down-sweep phase, where the tree is traversed from the root down to the leaves and the totals are propagated again. An integral image algorithm that can run in real-time on a GPU has been presented in [32]. Their algorithm computes the parallel prefix-sum over all the rows. Each row is treated independently. Next, the image or matrix is transposed so that the same prefix-sum kernel can be used for the columns. The prefix-sum is again calculated on the rows which are actually the transposed columns. Finally, the image can be transposed again if the original indexing has to be used, otherwise the transposed index can be used to read values from the integral image.

2.8

Discussion

The most popular feature detectors and descriptors were reviewed. SIFT and the faster but less robust SURF have been reviewed. Both of them are floating point descriptors and they are the most commonly used feature extractors to perform benchmarking comparisons with. The FAST and modified FAST-ER feature detectors have been reviewed, because the BFROST extractor implements an exact segment test detector which is closely related. It is important to understand what inspired the designed feature detector and why FAST can not be implemented on the GPU as it stands originally. BRIEF and BRISK are binary descriptors as opposed to the floating point descriptors. The shortfalls of BRIEF have been identified. The descriptor is sensitive to noise due to the discrete pixel sampling. The descriptor does not scale well, because more information is not sampled as the size of the scale increases. There are also issues with rotation invariance, because the complete sampling pattern needs to

22

be rotated. BRISK has the disadvantage of using large look-up tables. The descriptor is also sensitive to noise and needs a great deal of Gaussian smoothing to be more robust. Local gradients need to be calculated at every sampling point to estimate the feature orientation which in turn increases the computation time. The integral image forms an important part of many image processing algorithms. It provides a useful way to speed-up rectangular region operations on images. Several different implementations exist in the literature. It has been shown that GPU-friendly implementations can be implemented and that they can be used efficiently. In the next chapter, the information reviewed will be used to synthesize a new feature extractor called BFROST.

23

CHAPTER 3 The BFROST feature extraction method 3.1

Introduction

The aim of the BFROST1 feature extractor is to provide a fast implementation for local image feature detection and description that is implementable on a GPU. The GPU can execute a great number of instructions in parallel and is therefore useful for solving many general purpose computational problems, including image processing problems. The performance and repeatability of the FAST detector inspired the design of the BFROST detector. The BFROST detection process provides a way of implementing the FAST detector on a GPU. FAST has been proven to be a reliable feature detector with good repeatability. The original decision tree of FAST is replaced with a look-up table that resides in the GPU constant memory for fast access. A direct implementation of FAST in CUDA would not be feasible. Parallel execution performance dramatically decreases when different branches are executed within a block of threads on the GPU. The technique presented also extends the FAST detector by providing a simple feature orientation estimation. The simplicity and performance of the BRIEF and BRISK descriptors inspired the implementation of a binary descriptor that can be competitive. BFROST implements a binary descriptor, which utilizes little memory and comparisons between multiple descriptors are rapidly calculated because of the binarization of the feature vector. The BFROST feature detector is described in Section 3.2. The process of feature location estimation is described in Section 3.2.1. The feature orientation estimation is described in Section 3.2.2 and the simplistic scale measurement method in Section 3.2.3. The BFROST binary feature descriptor that works similarly to BRIEF and BRISK is described in Section 3.3. An explanation of how binary features can be compared and matched is given in Section 3.4. Finally, the chapter concludes with a discussion in Section 3.5 with results following in chapter 4. 1

A conference paper on BFROST [33] has been published and presented at the 22nd Annual Symposium of the Pattern Recognition Association of South Africa.

24

3.2

Feature detection

The aim of the feature detection process is to locate image patches or areas that can be detected repeatedly under various image transformations. The feature detection process involves detecting all the features in the image. The location of each detected feature needs to be determined. The attributes such as the scale, rotation and strength of each feature are calculated. These attributes are used for further processing such as the feature description process. Figure 12 illustrates the high level BFROST feature detection process.

Figure 12. BFROST feature detection process. During application initialization, the look-up tables are created and uploaded. For each detection scale the location of the features is determined. The scale or size of the features is determined to achieve scale invariance. The estimated feature scale is then used to resize the sampling pattern when the feature is described. When viewing an object from a closer distance, the region of interest that is described for the feature needs to be larger. The same logic applies when a feature is viewed from a further distance. The strength of each detected feature is determined. The chances that the feature will be detected under various image transformations are better when the strength of a feature is stronger. Features that are detected adjacent to each other, need to be merged or removed until one dominant feature remains within the pixel neighborhood. The process is

25

called non-maximal suppression. Features that do not contain a maximal strength within the neighborhood should be suppressed or removed. After the non-maximal features are suppressed, a rotation or orientation estimation is assigned to each feature. The estimation is used when describing the feature. It is important to determine the dominant orientation of a feature to achieve rotation invariance. The following sections will describe the feature location detection, rotation and scale estimation processes.

3.2.1

Location detection

The BFROST detection process is an exact implementation of the FAST detector. The FAST detector’s decision tree is created with machine learning techniques, which means that there is a possibility that it will sometimes classify features incorrectly. The BFROST detector relies on a deterministic detection process. The same continuous segment test classification is performed, but with no false classifications. The process is easy to implement and GPU-friendly. BFROST uses the same continuous pixel-set criteria as FAST to detect keypoints. However, executing thousands of if-else statements on a GPU are not feasible, if even compilable. GPU’s are very sensitive to branch instructions, especially if different branches execute within the same warp. All the branches executed in the same warp, will also be executed on all the threads in the warp. A warp consists of 32 threads. It is very likely that each thread will execute a different branch of the nested if-else statements, therefore the execution time will be more or less 32 times slower for each thread. To determine the location of a feature, each pixel in the image is tested to determine if the pixel location classifies as a feature or not. All the pixel locations that pass the classification are stored in an array for further processing. Just like FAST, BFROST needs a given detection threshold value t that determines the detection sensitivity. Consider the bit string B created by comparing the pixel intensity I(p) of pixel p with the sixteen pixels on a circle around the pixel p under classification. Bi describes the bit value of the ith pixel comparison on the circle. Therefore, i is in the range [0..15]. Let Ci be the ith pixel position on the circle and t the detection

26

threshold. The bit string B can then be constructed with equation 3. ( Bi =

1, 0,

for I(p) + t < I(Ci ) otherwise

(3)

The bit string B contains sixteen binary elements and when converted to a decimal value, it falls in the range of [0..65535]. To remove the need of thousands of if-else statements used by FAST, a look-up table is created. Note that the maximum number of possible configurations is only 65536. For each of these 216 possible configurations, the algorithm determines if the configuration classifies p as a corner or not and store the binary result in the table. This look-up table of 216 binary values is precomputed and stored as a binary string T . The length of T is 216 bits = 8192 bytes. T is uploaded to constant memory on the GPU. Constant memory on the GPU is a special kind of memory that remains constant throughout the execution of kernels. Accessing the constant memory is fast, because the access is optimized on the hardware. NVidia GPU’s contain 64KB of constant memory, therefore it is important to compress the look-up table by storing the classifications as bits, rather than a byte for each classification. The pre-computation of the look-up table simply involves iterating over all the 216 possible configurations and determining if 9 or more continuous set bits within B exist. Figure 13 illustrates the detection process with different detection threshold values. The same image patch is considered in all cases. Pixel intensities range within [0..255], with 0 being black and 255 being white. A green block around a pixel indicates that the pixel passed the test described by equation 3. A red block around a pixel indicates that the pixel failed the test. The blue block in the middle indicates the pixel p under classification. Consider the 16 pixels on the circle, starting from the top middle pixel at the 12 o’clock position and advancing in a clockwise order. Figure 13 shows the bit string B created in each case. The first row shows the result when the detection threshold t = 0. Only two tests failed, which resulted in a continuous set of 14 pixels that passed the test. The resulting binary value stored in the look-up table for the patterns is shown in the far right column. B is converted into a base 10 decimal value. This decimal value is the index used to access the look-up table. For example, if B = 11011111111100012 , then the decimal value is 5732910 and the 57329th bit in the look-up table will contain the desired classification value.

27

Figure 13. BFROST detection with different threshold values.

28

A notable implementation optimization is the memory storage location of the input image on the GPU: • A CudaArray with the size of the image is allocated and the image data is bound to the CudaArray. The tex2D texture sampling instruction is used on the CudaArray, which is faster than reading from global memory. Texture sampling will cache the sampled data and increase the memory access performance. Global memory is the slowest kind of memory on the GPU and is only partly cached on the latest hardware. • The texture filtering mode is set to point filtering such that discrete points are sampled. No interpolation is performed. The original FAST detector also performs discrete pixel look-ups. Switching the filtering to bi-linear filtering might improve the repeatability on noisy images and needs to be investigated in future work. • The texture addressing mode is set to clamp any address that falls outside the size of the image. A small border around the image is not processed. The feature location detection process can be described with the following steps: • Only at the initial setup stage, the classification look-up table is created and uploaded into the constant GPU memory. • Bind the input image of the current frame to the allocated CudaArray. • Execute the feature detection CUDA kernel. One thread is assigned to each pixel in the image. • Each thread loads the intensities of the center pixel p and sixteen surrounding pixels C0−15 into local thread memory. Each thread owns a small amount of local thread memory and it can only be accessed by that thread. Access to local thread memory is faster than shared and global memory. • Each thread builds the binary string B by performing the tests in equation 3 for all sixteen pixels on the circle of radius 3. • The binary classification value for B is read from the classification look-up table. If the image patch classifies as a keypoint, the (x, y) location of the pixel p is encoded into a 32 bit value and written into a 1-dimensional buffer containing all the detected keypoints. The lower 16 bits contain the value of x and the higher 16 bits contain the value of y. The AtomicInc CUDA instruction is used to retrieve the index where the keypoint information should be written to. The atomic increment operation needs to be used, because

29

more than one thread could request to write keypoint information into the buffer at the same time. • The classification test is performed twice, the binary string B created once with the I(p) + t < I(Ci ) test and then with the I(p) − t > I(Ci ) test. The same look-up table T is used for both tests.

Non-maximal suppression The detected features might end up being detected at locations right next to each other. In such cases, it is preferred to use only the strongest feature. The process of non-maximal suppression eliminates the weaker features. The strength of each feature needs to be known for such a process to work. The strength of each detected feature is calculated by finding the maximum value of t for which the image patch still classifies as a corner. The function on t is monotonic, which means that a simple binary search can be performed on t to quickly find the maximal value. Each thread in the CUDA kernel calculates the feature strength for one keypoint. A keypoint index map is created that maps an image location (x, y) to a keypoint index. The index map ensures that the eight neighboring keypoints of any given image location can be computed rapidly by performing an index look-up into the index map. The non-maximal suppression kernel executes one thread for each keypoint. Each thread compares the feature strength of its own keypoint to the feature strengths of the eight neighboring keypoints. The index map is used to determine the index of the adjacent keypoints and whether they exist. All keypoints with a strength greater than all of the neighbors, are written into the final detected keypoint buffer. The AtomicInc atomic instruction is used again to avoid memory write conflicts.

3.2.2

Rotation estimation

The original FAST detector does not provide an orientation estimate. This work proposes a very simplistic method to estimate the rotation of a feature. Only 16 possible rotation estimations are considered. Consider a feature detected with continuous brighter intensities from Ca to Cb , where the detected segment length is >= 9. The rotation index τ and rotation angle θ can then be estimated by the

30

following equations:  a+b   , 2 τ=   a + b + 16 2

for a < b (4) mod 16,

θ=

for a >= b

2πτ 16

(5)

Figure 14 shows examples of the rotation estimation. The green blocks indicate the detected continuous segment. The estimated orientation is simply the center of the segment. The red line indicates the direction of the orientation estimation.

Figure 14. Examples of the simple BFROST rotation estimation technique. Firstly, during the initial setup stages, a look-up table that maps the binary string B to the rotation index τ is created. This orientation look-up table is uploaded as a 216 byte vector onto the GPU. The look-up table is created by iterating over all possible combinations of B and finding the center of the detected segment if the combination is classified as a feature. The binary string B has already been generated for each detected feature during the location estimation stage. A simple look-up into the orientation look-up table, provides the rotation estimation after non-maximal suppression has been performed. Although the method seems very simplistic, the accuracy of the rotation estimation has been proven to work well and can be viewed in chapter 4. The method is very fast to compute and only one look-up into memory is required for each detected feature. Other feature extractors such as BRISK, SIFT and SURF use much more sophisticated methods which are much slower to compute.

31

3.2.3

Scale estimation

The current implementation of BFROST does not provide a complex method of determining the scale of the features. The process is very simple. The input image is used to detect features with scale σ = 1. Thereafter, the image resolution is divided by two and the scale is multiplied by two. Features are detected again on this smaller image and their scale is set to σ = 2. The feature detection process concatenates the resulting features into the same buffer on the GPU. The process of half-sizing the image is repeated until the desired maximum scale is reached. The half-sizing of the image is done efficiently on the GPU without memory transfers back to the CPU. More complex scale determination methods such as those used in BRISK, can be used along with the other features of BFROST.

3.3

Feature description

The aim of a feature descriptor, is to describe an area surrounding a detected feature in such a way that the description describes the area as uniquely and unambiguously as possible. The BFROST feature descriptor describes an area around a detected feature point with a binary string. Just like BRIEF [24] and BRISK [25], the descriptor is built by comparing intensities surrounding the keypoint region. The main difference between the method presented in this work and those of BRIEF and BRISK, is that no single pixel intensities are used for comparison tests. Instead, the average pixel intensity of an area is used. The average pixel intensity of a region can be quickly determined by using an integral image, or sometimes called a summed area table. Figure 15 illustrates the description process. At initialization, a look-up table for the base sampling point positions is created and uploaded to the GPU constant memory. Another constant look-up table, that contains the pairs of sampling points that need to be compared, is created and uploaded to form the base sampling pattern. The sampling points and pattern are discussed in Section 3.3.1. Firstly, an integral image is created from the input image. Different implementation methods were discussed in the literature review. For each feature that needs to be described, the area intensities at the sampling points are sampled and pairwise comparisons are performed to build the binary descriptor. Information calculated across multiple GPU threads is combined and the feature

32

Figure 15. BFROST description process. description is provided. The process is executed in parallel for each detected feature. The descriptor construction process is discussed in Section 3.3.2.

3.3.1

Sampling points

The sampling pattern used for BFROST extracts information about the feature, by sampling the average pixel intensities surrounding the feature. The sampling pattern stays fixed for all rotations and can be scaled accordingly. The base pattern is uploaded into constant GPU memory. This avoids the calculation of sin and cos functions, which are expensive to calculate during run-time. The pattern contains 64 sampling points, whereas the BRISK [25] pattern contains 60 sampling points. The number of BRIEF [24] sampling points equals the length of their descriptor. The pattern was designed in such a way that it does not need to be rotated for different orientations. The BFROST detector estimates only 16 possible orientations, therefore the sampling pattern also consists of 16 sampling points with the same radius from the feature center. Four sets, each with a different radius of 16 sampling points, form the 64 sampling points.

33

Figure 16. 64 Sampling point locations with their associated square regions. The sampling point offsets (X(i), Y (i)) for each i in [0..63], with a feature scale (σ), is calculated by: r(i) = σ22+(i

mod 4)

i φ(i) = d e 4

(7)

2πφ(i) ) 16

(8)

2πφ(i) ) 16

(9)

X(i) = r(i) cos(

Y (i) = r(i) sin(

(6)

At each sampling point, the sum of all pixel intensities in a square region is calculated. Figure 16 illustrates the square regions associated with each sampling point. The width and the height of the square region for each sampling point i, are defined by Z(i):

Z(i) =

34

πr(i) 8

(10)

The size Z(i) increases as the radius of the sampling point increases. This allows the capturing of feature information to cover the entire region around the feature, without leaving too many open spaces that are not sampled. The area of the square region is given by n(i), as shown in equation 11. n(i) = (2r(i) + 1)2

(11)

Let the location of the feature point j be x(j) and y(j) in image space. Let S(i) be the average intensity of the square region sampled from sampling point i and let II be the integral image computed from the input image I. Equation 12 describes how to calculate the average pixel intensity over the region at sampling point i. The four corners of the region are sampled from the integral image and used to calculate the sum of the pixel intensities of the region. The total is then divided by the area to obtain the average pixel intensity.

S(i) =( II(x(j) + X(i) + Z(i), y(j) + Y (i) + Z(i)) + II(x(j) + X(i) − Z(i), y(j) + Y (i) − Z(i)) − II(x(j) + X(i) + Z(i), y(j) + Y (i) − Z(i)) − II(x(j) + X(i) − Z(i), y(j) + Y (i) + Z(i))) ÷ n(i)

(12)

The CUDA kernel that executes on a GPU, creates the descriptor in parallel as follows: • 32 threads for each feature j are created. The CUDA block size is set to 512 threads. Therefore, each block computes the descriptor for 16 features in parallel. • Each thread samples two square regions, S(k) and S(k + 32), where k is the thread index for feature j. The normalized sampled results are stored in shared memory such that the other threads describing the feature can access the sampled result. • A syncthreads call is launched from the kernel, to ensure that all 32 threads receive the full set of 64 samples for the feature through shared memory. Shared memory is accessible across threads within the same block. All the information for each feature and each sampling point has been captured. The next step is to construct the binary descriptor, which is described in the following section.

35

Figure 17. Complete testing pattern, generating a 256 bit descriptor. The image on the right illustrates a zoomed-in version of the pattern. 3.3.2

Descriptor construction

The BFROST descriptor is 256 bits long. Each of the 64 samples is compared with four other samples to form the 256 bit descriptor. Figure 17 illustrates the complete testing pattern with 256 comparisons. Each sample is compared to four other samples, three of the other samples fall on the same circle surrounding the feature and the last sample falls on an inverted circle. Figure 18 illustrates the testing pattern for single samples. The blue lines indicate the comparisons between sampling points. To construct the 256 bit descriptor, each sampling point i is compared with four other sample points. The pattern is designed to capture as much detail from the region as possible, thus samples that are mostly far away from each other are compared with each other. The index of these four sampling points is specified by: • ((i + 8) mod 64) • ((i + 24) mod 64) • ((i + 36) mod 64) • (4φ(i) + 4 + (3 − (i mod 4))) To achieve rotation invariance on the descriptor, the index of the sampling points is simply modified by adding 4τ when performing the comparisons. The pattern 36

Figure 18. Sampling point comparisons. is artificially rotated by adding a multiple of four to the index. A multiple of four needs to be added, because the sampling points consist of four circles surrounding the feature. Each GPU thread computes 8 binary values by comparing the samples according to the sampling test pattern. The 8 bits are compacted into a byte and written into the descriptor’s global memory. The entire descriptor uses only 32 bytes of memory for each feature and all the features are computed entirely on the GPU.

3.4

Feature matching

The feature matching process is the process of matching two sets of features with each other. A distance metric needs to be defined that measures the distance between two features. A distance of zero indicates that two features are exactly the same. Consider a feature set A and a feature set B. Set A contains m features and set B contains n features. The closest matching feature to Ai would be the feature j in set B, where the distance between Ai and Bj is the least. The closest matching

37

feature is also called the nearest-neighbor. A matching threshold can be applied to the distance of the nearest-neighbor. The threshold determines if the nearest-neighbor is an actual match or not. It may happen that a feature from A does not have a match in B, in which case the distance to the nearest-neighbor should be larger than the matching threshold. The distance metric between two binary strings, such as the BFROST descriptor, can be defined by the Hamming distance. The Hamming distance is defined by the number of elements that differ between two strings. The Hamming distance between ”ABCD” and ”AZZD” is 2, because there are two characters that do not match when the strings are compared. For binary strings, the Hamming distance is simply the number of 0’s and 1’s that differs. The distance between a binary string 0110012 and 0000002 is 3, because three bits differ. Consider two binary strings D1 and D2 . Applying the XOR binary operator on D1 and D2 , will result in a binary string Dxor , that contains a binary 1 at each position where D1 and D2 differ. To calculate the Hamming distance, the number of set bits in Dxor needs to be counted. CPU architectures support the popcount instruction, whereas CUDA architectures support the popc instruction. These instructions return the number of set bits within the given variable, which is exactly the Hamming distance. It is much more efficient to execute a XOR and popc instruction, than computing the squared distance between two floating point vectors. The matching threshold that applies to a binary Hamming distance metric, is described by a number of bits. The match will not be accepted when the binary strings differ with more than the specified number of bits. BFROST uses an adjustable matching threshold with a default value of 16. If more than 16 of the 256 bits differ, the match will not be accepted as such. Increasing the matching threshold will result in more false matches, but decreasing the threshold will result in less matches. A high threshold means that more bits are allowed to differ, in order to accept a match. A low threshold means that the features may only differ with very few number of bits. The number of features detected verses the matching threshold is presented in the results chapter. The current BFROST feature matching implementation, implements a simple brute-force nearest-neighbor search. For each feature in set A, the distance to every feature in set B is calculated and the closest feature is then selected as the nearest-neighbor. The matching threshold is then applied to filter out most false matches. More false matches can be filtered out by applying a RANSAC [34] method, but to test the precision of BFROST, the RANSAC method was not important to implement.

38

For each feature in set A, a CUDA thread is created that iterates through the features in set B. The Hamming distance is then calculated by applying the XOR and popc instructions. If the closest feature’s distance is less than the matching threshold, the matching feature index is written into global memory. The process is repeated in both ways. Set A is matched with set B and then set B is matched with set A. A list of matches that remains consistent in both ways is created. A matching pair is added to the list, if feature Bj is the best match for feature Ai and feature Ai is the best match for feature Bj . The simplistic matching scheme works well, but further improvements are possible. Faster descriptor matching algorithms such as BK-Trees [35], can be investigated for further improvements. Approximate string matching algorithms can be applied to the binary string matching problem that relates to the binary feature matching problem in Computer Vision.

3.5

Discussion

A simple and fast binary feature detector and descriptor have been proposed that are scalable, rotation and translation invariant and robust to noise. The detector, descriptor and matcher can be implemented on a GPU to make use of the parallel computing power. The time to transfer memory between the CPU and the GPU is critical for real-time applications and should be kept to a minimum to achieve optimal performance. The GPU needs to stay busy as much as possible to gain the best performance. Therefore, the CPU and GPU will wait for the data while transferring memory from the CPU to the GPU or vice versa. Therefore, it is better to limit the amount of memory transfers as well as the amount of data transferred. Within the feature extraction pipeline, it is important to perform most of the operations on the GPU and to transfer only the end result to the CPU, if required. Algorithms that make use of the features extracted, can be implemented on the GPU and will then be able to access the features from the GPU memory without any additional memory transfers to the CPU. The detected feature information is kept on the GPU, non-maximal suppression is performed and the features are described with the binary descriptor, without transferring excessive data between the GPU and the CPU. The rotation estimation of the feature can be extracted with negligible extra computation cost. The robustness of the rotation estimation is proven by comparison against SIFT and SURF in chapter 4. The descriptor maintains a low memory requirement and does not use excessive look-up tables to obtain rotation invariance like BRISK does. Fast binary descrip-

39

tor matching can be performed, resulting in BFROST being usable in real-time applications. Binary features can be matched efficiently by using the Hamming distance as the distance metric. The metric can be computed efficiently on a GPU by using the appropriate instructions supported by the hardware architecture. The main differences between BFROST and comparable feature extractors are highlighted in Tables 1 to 5. All of the feature extractors use similar non-maximal suppression techniques. Table 1. Detection comparisons between BFROST and other feature extractors. BFROST BRIEF BRISK SIFT SURF

Detection FAST implemented on GPU. None. Use AGAST method Difference-of-Gaussian (DoG). Approximate Hessian matrix.

Table 2. Rotation estimation comparisons between BFROST and other feature extractors. Rotation estimation BFROST Continuous detected segment center. BRIEF None. BRISK Local gradients at sampling points. SIFT Orientation histogram. SURF Haar wavelets compute orientation on a sliding arc. Table 3. Scale detection comparisons between BFROST and other feature extractors. Scale detection BFROST Detection on multiple down-sampled images. BRIEF None. BRISK Interpolation between octaves and intra-octaves. SIFT Search over scale-space image pyramid created by DoG. SURF Integral image replaces image pyramid search.

40

Table 4. Descriptor comparisons between BFROST and other feature extractors. BFROST BRIEF BRISK SIFT SURF

Descriptor Rectangular area intensity comparisons. Discrete pixel intensity comparisons. Gaussian smoothed pixel intensity comparisons. Gradient histogram over multiple sub-regions. Haar wavelet responses over multiple sub-regions.

Table 5. Memory utilization comparisons between BFROST and other feature extractors. Memory usage BFROST Binary. 32 Bytes for each feature. BRIEF Binary. 16, 32 or 64 Bytes for each feature. BRISK Binary. 32 Bytes for each feature. 40MB look-up table. SIFT Floating point. 512 Bytes for each feature. SURF Floating point. 256 Bytes for each feature.

41

CHAPTER 4 Results This chapter presents results from tests performed on the BFROST detector and descriptor. Section 4.1 briefly explains the testing environment and the data set used. Results of the repeatability tests performed for the feature detector are presented in Section 4.2. Section 4.3 presents results of the BFROST feature descriptor robustness tests. Finally, Section 4.4 concludes the chapter.

4.1

Testing environment

The hardware used in the testing environment consists of a NVidia GeForce GTX 460 for the GPU processing and an Intel Core i7 2.67 GHz for the OpenCV and CPU tests. Figure 19 shows the test images. Images used for testing were captured with a GoPro camera on-board an ArduCopter. The images were left in their original distorted state. Camera distortions were not removed. All the feature extractors were tested with the same distorted images, therefore the lens distortions did not influence the results. The aerial images range between industrial, farmland and mixed scenes.

4.2

Feature detector

A metric that measures how well a feature detector performs is required. The repeatability metric is used in this work. The same repeatability measure that was used in [36], is computed. Features are detected in one image. The image is then transformed by a known homography, noise is added or the illumination conditions are modified. Features are detected in the modified image. For each feature in the original image, the exact expected position in the modified image can be calculated, since a known transformation has been used. The repeatability is then calculated as the percentage of features that were detected in both images and which were within a two pixel distance from the expected calculated position in the modified image. The BFROST, FAST and BRISK detection thresholds were set to 20. The following subsections present results for the case of in-plane rotations, additive noise

42

a.

b.

c.

d. Figure 19. Images used for testing.

and illumination variations. Finally, the execution time of the BFROST detector is compared with the execution time of the original FAST detector.

4.2.1

Rotation invariance

The test image shown in Figure 19a was transformed with planar rotations. The image was rotated with intervals of 5 degrees from 0 to 360 degrees. At each interval, features were detected and the repeatability was calculated. Figure 20 shows the repeatability of various popular detectors compared with BFROST. It can be seen that BFROST performed the best, while FAST, BRISK and SIFT performed slightly worse. The SURF detector performed the worst. Note the spikes near orthogonal rotations in the repeatability of all the detectors. These results show that the repeatability of the BFROST detector performs well under rotation transformations.

43

100

Repeatability (%)

80

60

40

BFROST FAST SURF SIFT BRISK

20

0 0

30

60

90

120

150 180 210 Rotation (Degrees)

240

270

300

330

360

Figure 20. Repeatability results for the detector rotation invariance tests. 4.2.2

Noise sensitivity

Gaussian noise was added to the test image after which features were detected. The repeatability was then calculated based on the features detected in the original input image and the image with the artificially added noise. Figure 21 illustrates the repeatability with variations of added Gaussian noise. The standard deviation is based on the pixel intensities (as the input image intensities range from 0 to 255). BFROST performs well with low levels of noise, but the repeatability degrades as noise increases due to the image not being smoothed as a pre-process. BFROST behaves similar to FAST as expected. BRISK performs slightly better, while SURF performs the best.

4.2.3

Illumination invariance

The feature detectors were evaluated under different illumination conditions. The first condition consisted of a variation in image brightness. Features detected in

44

100

BFROST FAST SURF SIFT BRISK

Repeatability (%)

80

60

40

20

0 0

5

10 15 20 Standard Deviation (Pixel intensity)

25

30

Figure 21. Repeatability results for the detector under additive Gaussian noise. the input image were compared with features detected in the same image but with a variation in brightness. Figure 22 shows the repeatability results. All detectors behaved very similarly, with FAST being slightly worse than the rest. The second condition consisted of a variation in image contrast. The contrast in the input image was reduced and increased where after the repeatability between detected features was computed again. Figure 23 shows the repeatability results under different contrast variations. All detectors behaved very similarly, with only a slight variation at high contrast. The repeatability at high contrast values is not significant. The intensity J is the modified pixel under contrast (α) and brightness (β) variations. The original pixel intensity is I and δ is an intermediate variable used to compute J with equations 13 and 14. ( δ=

I ∗ (1 + β), I + ((1 − I) ∗ β) 45

for β < 0 for β >= 0

(13)

π J = (δ − 0.5) ∗ tan((α + 1) ∗ ) + 0.5 4

100

BFROST FAST SURF SIFT BRISK

80

Repeatability (%)

(14)

60

40

20

0 -1

-0.5

0 Brightness

0.5

1

Figure 22. Repeatability results for the detector under brightness variations.

4.2.4

Statistics

The number of features detected for each feature detector was measured. Figure 24 shows the total number of features detected while performing the rotation invariance tests. Different feature detectors use different detection threshold techniques. The graph indicates how many features were detected and can not be used to measure the performance based on the number of features detected, because each feature detector can be adjusted to detect more or less features. The execution times of the feature detectors were measured. Figure 25 shows the average normalized execution time for detecting 10000 features. Timings were normalized by the BFROST timing (each timing was divided by the BFROST timing). Table 6 shows the actual timing values in seconds. The graph in Figure 25 indicates how the detectors performed relative to each other.

46

100

BFROST FAST SURF SIFT BRISK

Repeatability (%)

80

60

40

20

0 -1

-0.5

0 Contrast

0.5

1

Figure 23. Repeatability results for the detector under contrast variations. BFROST detects features much faster than the other feature detectors. FAST is approximately 6 times slower and BRISK 9 times slower than BFROST when detecting features on the specific input image. Table 6. Average execution time for detecting 10000 features. BFROST FAST BRISK SURF SIFT

Time (s) 0.0125531 0.0761088 0.1142020 1.8587700 2.9600700

The BFROST detector uses a detection threshold that determines the detection sensitivity. Figure 26 shows the number of features detected as the detection threshold increases. The number of features decreases as the threshold increases. The test was repeated with all four test images shown in Figure 19.

47

450000 400000 350000

Features

300000 250000 200000 150000 100000 50000 0 FAST

BFROST

SURF

SIFT

BRISK

Figure 24. The number of features detected by the different feature detectors.

250

200

150

100

50

0 BFROST

FAST

BRISK

SURF

SIFT

Figure 25. Normalized execution time for detecting features. Timings were normalized by the BFROST timing. 48

20000

Image 1 Image 2 Image 3 Image 4

Features

15000

10000

5000

0 5

10

15

20

25 30 35 Detection threshold

40

45

50

Figure 26. The number of features detected with various BFROST detection thresholds. 4.3

Feature descriptor

In order to compare the robustness of feature descriptors, another metric is required. For descriptor evaluation purposes, the test image is transformed with a known homography or color intensity effect to produce a transformed image. Features are detected and described on both images. For each feature in the test image, the nearest neighbor in the transformed image is found based on the distance between the feature descriptors. Cross correlation is performed by matching the features in the transformed image with the features in the test image and counting only the matching features that matched in both ways. The matching score is calculated by taking the ratio between the number of inliers and the total number of matched features. The BFROST matching threshold was set to 20 bits. Any match that differed with more than 20 bits, was not considered to be a match. The following subsections present results for the cases of planar rotations, additive noise, illumination variations and, finally, the execution time of the BFROST descriptor. 49

4.3.1

Rotation invariance

Each test image was transformed with in-plane rotations parallel to the viewing plane. The image was rotated with intervals of 5 degrees from 0 to 360 degrees. At each interval, features were detected and described and the matching score was calculated. BFROST features were detected for the BRIEF descriptor. Figure 27 shows the matching score of various popular descriptors compared to BFROST. SIFT performed the best, while BRIEF performed the worst, because it was not designed for rotation invariance and only performs well with small rotations. The BFROST descriptor performs similarly to the SURF descriptor with planar rotations. BRISK does not perform very well with rotations compared to the other descriptors.

100

Matching Score (%)

80

60

40

BFROST SURF SIFT BRIEF BRISK

20

0 0

30

60

90

120

150 180 210 Rotation (Degrees)

240

270

300

330

360

Figure 27. Matching scores for the descriptor rotation invariance tests.

4.3.2

Noise sensitivity

Gaussian noise was added to the input image after which features were detected and described. The matching score was calculated based on features described in the original image and the artificially added noise image.

50

Figure 28 shows the matching scores with variations of Gaussian noise. The standard deviation is based on pixel intensities (as the input image intensities range from 0 to 255). BRIEF and BRISK performed poorly, because their descriptors were built by using a low number of pixel intensities. SURF and BFROST performed the best, while SIFT also performed well. These results show that the sampling areas that BFROST uses, filtered out some of the noise.

100

Matching Score (%)

80

60

40

BFROST SURF SIFT BRIEF BRISK

20

0 0

5

10 15 20 Standard Deviation (Pixel intensity)

25

30

Figure 28. Matching scores for the descriptor under additive Gaussian noise.

4.3.3

Illumination invariance

The feature descriptors were evaluated under different illumination conditions. The first condition consisted of a variation in image brightness. Features described in the input image were matched with features described in the same image but with a variation in brightness. Figure 29 shows the matching score results. All the descriptors performed similarly. Note that the scale of the graph differs from the graph shown for feature detectors. Under very low brightness conditions, the image becomes totally black and the matching scores quickly fall to zero. At very high brightness conditions, the image 51

100

Repeatability (%)

80

60

40

20

BFROST SURF SIFT BRIEF BRISK

0 -0.6

-0.4

-0.2

0 Brightness

0.2

0.4

0.6

Figure 29. Matching scores for the descriptor under brightness variations. becomes totally white and again the matching scores quickly fall to zero. Only the range of relevant standard deviations is shown. The second condition consisted of a variation in image contrast. The contrast in the input image was reduced and increased, after which the matching scores between the described features were computed again. Figure 30 shows the matching score results under different contrast variations. The scale of the graph does not show the extreme contrast variations as in the detector graphs. The descriptors do not behave well under extreme contrast variations. BRIEF and BRISK performed poorly as the contrast increases while BFROST, SIFT and SURF performed similarly for high contrast variations. The poor performance of BRIEF and BRISK is due to the use of discrete pixel sampling instead of area sampling.

52

100

Repeatability (%)

80

60

40

20

0

BFROST SURF SIFT BRIEF BRISK -0.4

-0.2

0 Contrast

0.2

0.4

Figure 30. Matching scores for the descriptor under contrast variations. 4.3.4

Statistics

The execution times of the feature descriptors were measured. Figure 31 shows the average execution time for describing 10000 features that are normalized to be comparable with the execution time of BFROST. Table 7 shows the actual timing values in seconds. The graph in Figure 31 indicates how the descriptors perform relative to each other.

Table 7. Average execution time for describing 10000 features. BFROST BRIEF BRISK SURF SIFT

Time (s) 0.00164335 0.77243700 1.68828000 14.13420000 45.76700000

53

30000

25000

20000

15000

10000

5000

0 BFROST

BRIEF

BRISK

SURF

SIFT

Figure 31. Normalized execution time for describing features. The speed of BFROST is shown as 1 and the other descriptors were normalized by the BFROST timings. BFROST describes features much faster than the other feature descriptors. BRIEF was approximately 470 times slower and BRISK 1027 times slower than BFROST for describing features on the specific input image. SIFT was 27849 times slower. The results show that the parallel implementation on a GPU and the use of integral images greatly improves the execution time of BFROST over the other descriptors.

4.4

Discussion

Tests were performed on feature detectors such as BFROST, FAST, SIFT, SURF and BRISK. The rotation invariance, noise sensitivity and illumination invariance of these detectors were evaluated. It was shown that BFROST compares well with the other well known feature detectors. The repeatability of the detectors was calculated and compared.

54

The execution time of the BFROST detector clearly shows an advantage over the other detectors and it is worthwhile to use a GPU implementation of the popular FAST feature detector. The feature descriptors of BFROST, BRIEF, BRISK, SIFT and SURF were tested. The performance was measured by calculating the percentage of inliers over all matching features. The matching score results showed that the BFROST descriptor performs well under most typical transformations. The simplistic rotation estimation of BFROST proves to be valid. The execution time of the BFROST descriptor is superior in comparison to the other feature descriptors. The BFROST features can be described in a fraction of the time.

55

CHAPTER 5 Applications Feature based methods are important in computer vision. Popular feature extractors already mentioned in the literature review have been used by a number of developers within useful applications. Likewise, the BFROST feature extractor can also be applied to various applications. The following sections will briefly describe some of the applications to which the BFROST feature extractor can be applied. BFROST has been applied to each application and the results will be illustrated. Note that these are not the only applications that could be implemented.

5.1

Optical flow

The motion in a scene can be described as the optical flow or image velocity field. Imagine images produced by a stationary security camera. The motion of objects passing within the field of view of the camera can be described by a 2D-motion vector. This motion vector is parallel to the image plane when seen in 3D. The orientation of the vector describes the direction in which the object is moving and the magnitude of the vector describes the speed at which the object is moving. Within a static scene and stationary camera, the motion vectors of the background and stationary objects will remain zero. Furthermore, all motion vectors in a scene can be clustered into groups. Each cluster will then correlate to objects or features that are moving in approximately the same direction. The clusters can help to segment the scene into different objects. The optical flow of images captured with a moving camera provides detail on the camera movement and on the scene itself. More detail on camera pose estimation will be provided in Section 5.3. The same clustering can be applied even if the camera is moving. All motion vectors will have the camera motion included, which needs to be removed before the motion of each object according to the world frame can be calculated. An aerial vehicle that follows a moving target on the ground at exactly the same speed, will result in a zero motion vector for the target. Motion vectors related to the speed and direction of the aerial platform, will result from the background motion vectors. BFROST can be used to compute correlation-based optical flow. The number of detected features is sparse and features need to be matched or correlated between consecutive frames. Figure 32 illustrates two consecutive image frames captured

56

Figure 32. Feature based optical flow. from an aerial vehicle. The BFROST features detected on a single scale are indicated with red circles. A full feature matching process is performed. The difference in location in the vertical and horizontal axis, provides the motion vector in the image plane for each matched feature. The motion vectors are displayed in green on the image at the right in Figure 32. More detail on different methods to compute the optical flow in images can be found in [37].

5.2

Video stabilization

Long range camera systems with a typical narrow field of view are subject to image jitter. Even the slightest angular motion caused by wind or vibration can cause undesired image translations between image frames. The video sequence can be stabilized if the translation between the frames is known. The process of determining the translation is sometimes referred to as image registration. Tracking of SIFT features [10] has been used to stabilize a video in [38]. The same method can be applied by tracking BFROST features across video frames. The process of feature based video stabilization can be described as follows: • A sparse feature based optical flow is calculated for each frame as described in Section 5.1. • All the motion vectors detected in the frame are added together to calculate the frame motion vector. This frame motion vector describes the relative 57

translation between the frames. • Calculate the absolute translation vector over time, by adding the frame motion vector to the absolute translation vector. • Calculate the filtered translation vector over time, by adding a filtered frame motion vector to the filtered translation vector. The filter used in this work was a low-pass filter. The function of the filter is to smooth the translation over time. • The difference between the absolute and the filtered translation vector is used as translation when displaying the video frame. The filter ensures that intentional camera movement is not regarded as image jitter.

Figure 33. Feature based video stabilization. Figure 33 illustrates a sequence of video frames. The top row of images displays the original unstable video frames. The bottom row of images displays the motion vectors calculated that were used to stabilize the video. Time elapsed can be viewed from left to right. The detection threshold was set to 25 and the matching threshold was set to 8. Figure 34 shows a graph with the horizontal and vertical motion vector elements, together with their corresponding filtered values. The results in the graph were generated from a long range maritime surveillance system. The camera was scanning in a horizontal direction indicated by the rising X values. The difference between the filtered and absolute values can be seen on the graph. Figure 35 shows a zoomed-in version of the same graph. The results show that BFROST can be successfully used to stabilize video for real-time applications.

58

Absolute X Filtered X Absolute Y Filtered Y

400

Pixel translation

200

0

-200

-400 20

40

60 80 Frame number

100

120

Figure 34. Video stabilization values over time.

300

Absolute X Filtered X Absolute Y Filtered Y

250

Pixel translation

200 150 100 50 0 -50 -100 45

50

55

60 Frame number

65

70

Figure 35. Zoomed-in stabilization values over time. 59

75

5.3

Camera pose estimation

Camera pose estimation is useful for many applications. Stereo vision algorithms need to determine the pose of both cameras relative to each other to be able to perform depth estimation. The pose of a camera is sometimes referred to as the extrinsic characteristics of the camera. The pose consists of a 6-Degrees-offreedom (6-DOF) transformation relative to a world frame. The translation in 3-axis and rotation such as yaw, pitch and roll, form the degrees of freedom. Camera calibration methods determine the intrinsic and extrinsic characteristics of the camera. Intrinsic characteristics are determined by the focal length and lens distortions which remain fixed, while the extrinsic values change according to the camera pose. The relative pose transformation between two cameras can be described with the fundamental matrix. Various methods for computing the fundamental matrix are described in [39]. The fundamental matrix transforms the location of features detected in one frame to match the same features visible in a second frame. The fundamental matrix conforms to the epipolar geometry [40]. To be able to compute the fundamental matrix, feature correspondences between the two frames are required. The fundamental matrix can be used when the camera intrinsic values are unknown. The essential matrix [41] is used when the intrinsic values are known. The camera transformation can be extracted from the essential matrix as shown in [42]. The absolute camera pose can be estimated when the 3D-locations of the detected feature points are known. The problem is called the Perspective-n-point-Problem (PnP). The 3D-locations of n features need to be known as well as the corresponding 2D-normalized rays from the camera need to be known. The solutions of the PnP problem provide the absolute camera pose. The recent work of [43] can be used to directly calculate four possible solutions for P3P. A feature based approach would be to keep track of features with known 3D-locations, detect them and then solve the camera pose estimation by solving P3P. The 3D-locations of new features can be estimated by triangulation [44] from the estimated camera positions and matched features. These new features can then be added to the list of features with estimated known locations. Both the relative camera pose technique and absolute camera pose estimation technique require local image features matched between multiple frames. BFROST can provide a fast and efficient way to find corresponding features. Therefore, it would be feasible to use BFROST in an application that needs to estimate the location of a MAV.

60

5.4

Object recognition

Automatic object recognition from images is the process of recognizing an object in an image from a set of predefined objects. Usually, such a system will be able to locate the object in the image as well. Such a system can learn the appearance of an object by first observing a training image of the object. During testing a similar object can be shown to the system, where the goal of object recognition is to successfully classify the shown object into the correct class. Imagine training the system by showing it the cover of a book, a cup, a teddy bear and a toy car. After the training process, the system should be able to identity the toy car as a toy car and not a book, even if the toy car was rotated slightly when testing. The authors of SIFT have shown in [45] that local image features can be used to perform object recognition. BFROST can be used as well to perform realtime object recognition. Firstly, the system needs to be trained with the desired objects. Each object is shown to the BFROST test application, where features are extracted and stored as the appearance model of the object. The process is repeated for several objects. During the testing phase an object is shown to the system, possibly from a different viewpoint. The features are extracted again and matched with the appearance models of all the objects that the system trained on. The object with the highest matching score is selected as the recognized object. A positive recognition is registered when the matching score is above a specified threshold. Figure 36 illustrates the BFROST test application that successfully recognized a remote control box that was turned upside down. The red circles show the detected features from the live image. The closest matching objects from the database are shown at the right of the image. The top right image shows the best match with the appearance model of the object. The lines that connect the features from the live image with the appearance model show the matching features. In Figure 36, it can be seen that the box has been turned upside down and the matching features indicate this. Figure 37 shows a linear algebra book turned on its side that has been successfully recognized over the image processing book. During the localization process of the MAV, objects or landmarks with known GPS coordinates can be recognized in a similar fashion. The locations of the detected landmarks can then be used to determine the MAV location. The application has shown that BFROST can be used in a simple object recognition system. Objects can be recognized even under different image transformations.

61

Figure 36. Upside down box successfully recognized.

Figure 37. Book cover successfully recognized under rotation transformation.

62

5.5

Object tracking

Object tracking can be useful in many applications. Security surveillance applications may want to detect an intruder and track his location. Maritime applications may want to track all the vessels near the harbor and report suspicious behavior autonomously. Sport applications may want to track the players on the field to calculate statistics for post-match analysis. Object tracking with SIFT features and the mean-shift algorithm was shown to be possible in [46]. BFROST was used to implement a simplistic short term object tracker. Firstly, a rectangular region that contains the object to track, needs to be selected in the test application. All the BFROST features within the region are then extracted and stored as the appearance model for the object. The tracking phase consists of performing a full feature matching scheme with the live image and the appearance model of the tracked object. The object location is updated when more than 50% of the features in the appearance model match with the scene. The new location is simply the mean location of all the matched features. The size of the rectangular region remains the same. The appearance model is replaced by the features within the newly detected object region. The tracker is very simplistic for demonstration purposes and the updating of the appearance model needs to be done more intelligently if long term tracking is desired. The system can track multiple objects simultaneously. Figure 38 illustrates the tracker tracking a building from aerial images. The green rectangle indicates the tracked object. The red circles show the features or appearance model of the tracked object. Note that the building was tracked correctly across multiple viewpoints. Also note that the features detected in frame 1, look different than the features in frame 6. The appearance model was updated automatically while tracking the building. Figure 39 illustrates a vessel being tracked. The size of the tracking region stays fixed and all features inside the region are used for tracking. This can lead to drift if too many background features infiltrate the region. Note the drift in Figure 39, caused by the background features detected on the ocean. BFROST was successfully used to implement a simple tracker. Good results were obtained even though only the object translation was tracked. The real homography between the appearance model and live image was not calculated, which limits the tracking to the rectangular shape specified by the user. RANSAC [34] was not used to remove outliers and should improve the tracking performance when used. The tracker needs to detect enough features on the tracked object, which limits its abilities to track objects in low contrast scenes.

63

Figure 38. A building tracked across multiple viewpoints.

Figure 39. A vessel being tracked.

64

5.6

Image indexing and retrieval

Image databases may contain thousands and even millions of images or photographs. To search through such a database for a specific image can be computationally expensive and time consuming. For example, the location of a photograph might be determined by searching through all the available images from Google street view [47] and then selecting the closest match as the predicted location. Firstly, a similarity method to compare two images with each other should be defined. A feature based approach would be to detect the image features in both images and perform a complete match between the features. The outliers can be removed through a RANSAC [34] approach. The percentage of matching features can indicate the similarity between the two images. It is not feasible to perform such a complete match between thousands of images and it requires a lot of computation power. The idea of image indexing and retrieval is to find a way to quickly index a given image and quickly retrieve similar images from a potentially huge database. The image needs to be described in a low dimensional space, so that it can be unambiguously retrieved among other images. In computer vision, an image can be described as a bag-of-visual-words or bag-of-features [48]. Each local image feature can be classified as a visual word. A histogram of all the visual words in the image can then be calculated. Usually, the number of visual words in the visual vocabulary is limited and far less than the number of possible feature descriptions. The histogram of visual words can then also be used to describe the image on a high level. Matching between images can be performed by finding the distance between the histograms before performing a complete feature matching process. Only images with histograms being closer than a threshold are considered further. BFROST is a binary feature descriptor. In the application of image indexing and retrieval, the visual vocabulary firstly needs to be defined. By using a set of training images from the application domain, a set of BFROST features is described. All these features are then clustered into a binary hierarchical k-means tree or vocabulary tree [49]. The tree is then used to assign a visual word to each feature described in the image. Lastly, a histogram is computed from the visual words and this histogram is used to sort the images in the database according to similarity. Figure 40 illustrates the BFROST image indexing and retrieval application in action. The blue histogram at the top left is the histogram of the live image. This histogram contains 64 bins, which indicate that the visual vocabulary only consists of 64 possible visual words. The small images on the right are sorted according to their similarity when compared with the live image. In Figure 41, it can be seen

65

Figure 40. Feature based image indexing and retrieval. that the images containing the tower are more similar to the live image that also contains the tower. Within a MAV localization application, image indexing and retrieval can be used to quickly search for areas where the MAV has previously flown. This will improve the localization accuracy, because the estimation can now be determined from previously observed areas and not only on a frame by frame basis. It was shown that BFROST can be useful in image indexing and retrieval applications.

Figure 41. Real-time image indexing and retrieval with BFROST.

66

5.7

Discussion

Local image feature based algorithms are important for image processing algorithms as shown in this chapter. Feature based methods can be applied to sparse optical flow methods as shown in Section 5.1. Video stabilization can be performed by shifting the output image towards the most stable direction as shown in Section 5.2. The camera pose can be estimated with various methods as discussed in Section 5.3. Basic object recognition can be performed by matching features, which was discussed in Section 5.4. Section 5.5 described how a basic feature based object tracker can be implemented to track multiple objects. Images can be efficiently indexed and retrieved by using a bag-of-visual-words as shown in Section 5.6. Many applications have been implemented with popular feature extractors such as SIFT. BFROST can be used to implement the same kind of applications. This was illustrated throughout the chapter.

67

CHAPTER 6 Conclusion This chapter concludes the study with an overview of what has been reviewed, implemented and achieved. The main contributions are listed in Section 6.1. Section 6.2 discusses the limitations of the BFROST feature extractor, while Section 6.3 proposes future research.

6.1

Contributions

A literature review on popular and the most recent local image feature detectors and descriptors has been completed. The positive and negative sides of each feature extractor have been highlighted. The information gained from the literature review was used to design and implement a new feature extractor, namely BFROST. Comparison studies between BFROST and other feature extractors were performed. The main contributions are listed below: • The first GPU implementation of the FAST detector has been implemented. • A novel and simplistic feature orientation estimation has been implemented and tested. The results proof that the estimation performs well. The very fast rotation estimation of BFROST requires no local gradient calculations. • The BFROST feature detector can compete with the most popular detectors such as SIFT, SURF and BRISK regarding repeatability and performance. • The BFROST feature descriptor can describe a feature faster than the other feature descriptors as shown in the results. • The feature extractor is robust to rotations, illumination variations and noise. • Training data is not required and the extractor does not depend on the target application domain to be robust. • BFROST scales well and the descriptor is robust to noise, because of the use of the intensity of regions for comparison tests instead of discrete pixels.

68

• The sampling pattern is not rotated like BRISK and BRIEF to achieve rotation invariance. • BFROST does not require a large look-up table for the sampling pattern and also does not need to perform Gaussian smoothing on the sampling patches. • BFROST uses an integral image to sample area intensities in order to construct the descriptor. It was useful for speed and noise reduction. The descriptor can be calculated at different scales without any performance penalties because of the use of the integral image. • The BFROST feature extractor method has been published and presented at a South-African conference.

6.2

Limitations

The BFROST feature extractor was implemented in CUDA. Only CUDA-enabled hardware is able to execute the current implementation. The scale detection of BFROST is very simplistic, but it will not be able to detect the exact scale of a feature. Only discrete scales are detected. A more sophisticated scale detector such as the one used in BRISK should be implemented instead. Just like FAST, the BFROST detector is sensitive to noise. The image needs to be smoothed to improve the repeatability against noise. The results might improve if bi-linear interpolation is used when sampling the image for detection. In low contrast scenes, the BFROST detector does not detect many features. The detection threshold is static throughout the detection process. The threshold should be adjusted automatically to match the scene conditions.

6.3

Future research

Possible improvements and future research possibilities are listed below: • Improve the detection of features for low contrast scenes. The detection threshold can be adjusted dynamically. • Very noisy scenes produce many false detections. These false detections should be removed by applying a smoothing function on the image. • A more sophisticated and exact scale determination method needs to be researched.

69

• Different sampling patterns can be analyzed in more detail and compared with each other. • The brute-force feature matching process can be optimized by implementing and investigating tree-based search methods as well as approximate nearestneighbor algorithms. • Testing the robustness of the BFROST feature extractor in real-time visual SLAM applications could be performed.

70

LIST OF REFERENCES

[1] E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in European Conference on Computer Vision, vol. 1, May 2006, pp. 430–443. [2] A. Puri, “A Survey of Unmanned Aerial Vehicles (UAV) for Traffic Surveillance,” Department of Computer Science and Engineering, University of South Florida, 2005. [3] P. Campoy, J. Correa, I. Mondrag´on, C. Mart´ınez, M. Olivares, L. Mej´ıas, and J. Artieda, “Computer vision onboard UAVs for civilian tasks,” Unmanned Aircraft Systems, pp. 105–135, 2009. [4] L. Young, “Small Autonomous Air/Sea System Concepts for Coast Guard Missions,” in USCG Maritime Domain Awareness Requirements, Capabilities, and Technology (MDA RCT) Forum, Santa Clara, CA, May, vol. 2, 2005. [5] K. Lee, “Development of Unmanned Aerial Vehicle (UAV) for Wildlife Surveillance,” Ph.D. dissertation, University of Florida, 2004. [6] “Arducopter platform,” http://code.google.com/p/arducopter/. [7] “HD GoPro,” http://www.gopro.com. ´ Mozos, A. Gil, M. Ballesta, and O. Reinoso, “Interest point detectors for [8] O. visual slam,” Current Topics in Artificial Intelligence, pp. 170–179, 2007. [9] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000. [10] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [11] H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: Speeded up robust features,” in In ECCV, 2006, pp. 404–417. [12] L. Ledwich and S. Williams, “Reduced sift features for image retrieval and indoor localisation,” in Australian Conference on Robotics and Automation, vol. 322, 2004. [13] Y. Ke and R. Sukthankar, “Pca-sift: A more distinctive representation for local image descriptors,” in Computer Vision and Pattern Recognition, 2004. 71

CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 2. Ieee, 2004, pp. II–506. [14] M. Grabner, H. Grabner, and H. Bischof, “Fast approximated sift,” Computer Vision–ACCV 2006, pp. 918–927, 2006. [15] F. Porikli, “Integral histogram: A fast way to extract histograms in cartesian spaces,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 829–836. [16] C. Wu, “SiftGPU: A GPU implementation of scale invariant feature transform (SIFT),” http://cs.unc.edu/∼ccwu/siftgpu, 2007. [17] NVIDIA-Corporation, “NVIDIA CUDA programming guide version 4.0.” [18] C. Evans, “Notes on the opensurf library surf: Speeded up robust features,” University of Bristol Tech Rep CSTR09001 January, no. 1, 2009. [19] N. Cornelis and L. Van Gool, “Fast scale invariant feature detection and matching on programmable graphics hardware,” in Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on. Ieee, 2008, pp. 1–8. [20] E. Rosten and T. Drummond, “Fusing points and lines for high performance tracking.” in IEEE International Conference on Computer Vision, vol. 2, October 2005, pp. 1508–1511. [21] J. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106, 1986. [22] E. Rosten, R. Porter, and T. Drummond, “Faster and better: A machine learning approach to corner detection,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 1, pp. 105–119, 2010. [23] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by simulated annealing,” science, vol. 220, no. 4598, p. 671, 1983. [24] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF: Binary robust independent elementary features,” Computer Vision–ECCV 2010, pp. 778– 792, 2010. [25] S. Leutenegger, M. Chli, and R. Siegwart, “BRISK: Binary robust invariant scalable keypoints,” in International Conference on Computer Vision 2011 ICCV2011, 2011.

72

[26] E. Mair, G. Hager, D. Burschka, M. Suppa, and G. Hirzinger, “Adaptive and generic corner detection based on the accelerated segment test,” Computer Vision–ECCV 2010, pp. 183–196, 2010. [27] P. Viola and M. Jones, “Robust real-time face detection,” International journal of computer vision, vol. 57, no. 2, pp. 137–154, 2004. [28] J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and A. Lastra, “Fast summed-area table generation and its applications,” in Computer Graphics Forum, vol. 24, no. 3. Wiley Online Library, 2005, pp. 547–555. [29] N. Zhang, “Working towards efficient parallel computing of integral images on multi-core processors,” in Computer Engineering and Technology (ICCET), 2010 2nd International Conference on, vol. 2. IEEE, 2010, pp. V2–30. [30] G. Blelloch, “Prefix sums and their applications,” Synthesis of Parallel Algorithms, pp. 35–60, 1990. [31] M. Harris, S. Sengupta, and J. Owens, “Parallel prefix sum (scan) with cuda,” GPU Gems, vol. 3, no. 39, pp. 851–876, 2007. [32] B. Bilgic, B. Horn, and I. Masaki, “Efficient integral image computation on the GPU,” in Intelligent Vehicles Symposium (IV), 2010 IEEE. IEEE, 2010, pp. 528–533. [33] J. Cronje, “BFROST: Binary Features from Robust Orientation Segment Tests accelerated on the GPU,” in 22nd Annual Symposium of the Pattern Recognition Association of South Africa, 2011. [34] M. Fischler and R. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981. [35] R. Baeza-Yates and G. Navarro, “Fast approximate string matching in a dictionary,” in String Processing and Information Retrieval: A South American Symposium, 1998. Proceedings. IEEE, 1998, pp. 14–22. [36] C. Schmid, R. Mohr, and C. Bauckhage, “Evaluation of interest point detectors,” International Journal of computer vision, vol. 37, no. 2, pp. 151–172, 2000. [37] S. Beauchemin and J. Barron, “The computation of optical flow,” ACM Computing Surveys (CSUR), vol. 27, no. 3, pp. 433–466, 1995.

73

[38] S. Battiato, G. Gallo, G. Puglisi, and S. Scellato, “Sift features tracking for video stabilization,” in Image Analysis and Processing, 2007. ICIAP 2007. 14th International Conference on. IEEE, 2007, pp. 825–830. [39] Q. Luong, R. Deriche, O. Faugeras, T. Papadopoulo, et al., “On determining the fundamental matrix: Analysis of different methods and experimental results,” 1993. [40] Z. Zhang, “Determining the epipolar geometry and its uncertainty: A review,” International Journal of Computer Vision, vol. 27, no. 2, pp. 161–195, 1998. [41] J. Philip, “A non-iterative algorithm for determining all essential matrices corresponding to five point pairs,” The Photogrammetric Record, vol. 15, no. 88, pp. 589–599, 1996. [42] W. Wang and H. Tsui, “A svd decomposition of essential matrix with eight solutions for the relative positions of two perspective cameras,” in Pattern Recognition, 2000. Proceedings. 15th International Conference on, vol. 1. IEEE, 2000, pp. 362–365. [43] L. Kneip, D. Scaramuzza, and R. Siegwart, “A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 2969–2976. [44] R. Hartley and P. Sturm, “Triangulation,” Computer vision and image understanding, vol. 68, no. 2, pp. 146–157, 1997. [45] D. Lowe, “Object recognition from local scale-invariant features,” in iccv. Published by the IEEE Computer Society, 1999, p. 1150. [46] H. Zhou, Y. Yuan, and C. Shi, “Object tracking using sift features and mean shift,” Computer Vision and Image Understanding, vol. 113, no. 3, pp. 345– 352, 2009. [47] D. Anguelov, C. Dulong, D. Filip, C. Frueh, S. Lafon, R. Lyon, A. Ogale, L. Vincent, and J. Weaver, “Google street view: capturing the world at street level,” Computer, vol. 43, no. 6, pp. 32–38, 2010. [48] Y. Jiang, C. Ngo, and J. Yang, “Towards optimal bag-of-features for object categorization and semantic video retrieval,” in Proceedings of the 6th ACM international conference on Image and video retrieval. ACM, 2007, pp. 494– 501.

74

[49] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2. Ieee, 2006, pp. 2161–2168.

75

APPENDIX A List of terminologies Integral Image

Matching score Repeatability

Image feature

Non-maximal suppression

The integral image is an image used to quickly calculate the sum over a rectangular area within an image. The ratio between the number of inliers and the total number of matched features. The percentage of features that were detected in two images and which are within a two pixel distance from the expected calculated position in the modified image. An interesting part in an image that can be easily recognized from multiple viewpoints or color transformations. Elements that do not contain a maximal value within the neighborhood of elements are suppressed or removed.

76

APPENDIX B List of abbreviations 2D 3D AGAST BFROST BRIEF BRISK CPU CUDA DOF DoG DoM FAST GPU GPS HD IMU MAV MB PCA PnP RANSAC SIFT SLAM SURF XOR

Two Dimensional Three Dimensional Adaptive and Generic Accelerated Segment Test Binary Features from Robust Orientation Segment Test Binary Robust Independent Elementary Features Binary Robust Invariant Scalable Keypoints Central Processing Unit Compute Unified Device Architecture Degrees-of-freedom Difference-of-Gaussian Difference-of-Mean Features from Accelerated Segment Test Graphics Processing Unit Global Positioning System High-Definition Inertial Measurement Unit Micro Aerial Vehicle Mega byte Principal Component Analysis Perspective-n-point-Problem Random Sample Consensus Scale-Invariant Feature Transform Simultaneous Localization And Mapping Speeded Up Robust Feature Exclusive or

77

Suggest Documents