Performance Comparison of Point Feature Detectors ...

2 downloads 2485 Views 4MB Size Report
Android platform the recent work [25] introduces a new binary descriptor, and ... Java programming language, utilizing the DALVIK compiler. The OpenCV library ... small WiFibot mobile robot, and the Samsung Galaxy Note 3 smartphone (Fig.
Performance Comparison of Point Feature Detectors and Descriptors for Visual Navigation on Android Platform Michał Nowicki, Piotr Skrzypczy´nski Institute of Control and Information Engineering, Pozna´n University of Technology ul. Piotrowo 3A PL-60-965 Pozna´n, Poland Email: [email protected], [email protected] Abstract—Consumer electronics mobile devices, such like smartphones and tablets, are quickly growing in computing power and become equipped with advanced sensors. This makes a modern mobile device a viable platform for many computationintensive, real-time applications. In this paper we present a study on the performance and robustness of point features detection and description in images acquired by a mobile device in the context of visual navigation. This is an important step towards infrastructure-less indoor self-localization and user guidance using only a smartphone or tablet. We rigorously evaluate the performance of several interest point detector and descriptor pairs on three different Android devices, using image sequences from publicly available robotics-related data sets, as well as our own data set obtained using a smartphone. Keywords—point features, detectors, descriptors, mobile devices

I.

I NTRODUCTION

The computing power in modern mobile devices is increasing steadily, as new models arrive the market. Moreover, contemporary mobile devices are equipped with high-resolution cameras and other sensors, such like Inertial Measurement Unit (IMU) and magnetometer. These improvements make possible to use smartphones or tablets in new roles. A task particularly interesting from the point of view of potential applications is self-localization in GPS-denied environments. Knowing the exact position of the user in the environment enables a multitude of location-aware services, from advertisement to guidance in large buildings. The most powerful sensor readily available in a mobile device is the camera. Visual navigation received much attention in robotics research. The Simultaneous Localization and Mapping (SLAM) problem [21] has been solved using a monocular camera as the only sensor [6]. Another approach to visual navigation is the visual odometry [7], which estimates the egomotion of a camera attached to a moving agent (e.g. a human carrying a mobile device). While the accuracy and robustness of visual feature detectors and descriptors are crucial to the performance of visual navigation algorithms, the speed of the extraction and description operations decides if the given detector-descriptor pair can be applied in real-time self-localization. Although performance evaluation of point feature detectors/descriptors is recently a popular research topic in computer vision and robotics, as far as we know, there is no study available on the performance of point feature detectors and

descriptors implemented on mobile devices in the context of visual navigation. With this paper we try to bridge this gap between the computer vision research in robotics and the area of mobile computing. We present an evaluation of classic and recent detectors, descriptors, and their pairs implemented on the popular Android platform on three devices: two smartphones, and a tablet. The outcome of this research allows to select the detector-descriptor pair that is most suitable for the navigation-related applications on a mobile device.

II.

R ELATED W ORK

The visual navigation problem becomes much more complicated if a handheld or portable device is being used for indoor self-localization. It is possible to determine coarse location of the user from images received from a mobile phone worn by this person [14]. However, the precision of appearance-based self-localization is insufficient for navigation and user guidance. Therefore, recent approaches to indoor selflocalization using sensors typically found on smartphones and tablets employ methods and algorithms known from mobile robotics. The system described in [17] uses a smartphone’s camera to detect simple artificial landmarks, while the approach of [13] works in unmodified environment, using SURF point feature detectors/descriptors for camera-based tracking of a person wearing the device. The literature on point feature detectors/descriptors is rich in papers concerning performance evaluation and comparison of various algorithms [11], [23] and implementations [3]. Although few researchers evaluate point detectors/descriptors with respect to their applicability in visual navigation, a comparative study presented in [18] was the main source of inspiration for the evaluation methodology in our work. The rapid proliferation of mobile devices has brought with it a demand for efficient feature detectors and descriptors on the new hardware and software platforms. But the literature on the performance of feature detectors/descriptors on mobile devices is still scarce. An example directly related to our research is the work of Wagner et al., [24] which evaluates two classic approaches to point feature description in the context of tracking on mobile devices. However, this comparison is performed on older hardware/software platforms. For the Android platform the recent work [25] introduces a new binary descriptor, and compares its performance to several known solutions on a data set from a mobile phone. In [12] the speed of several feature detectors and descriptors implemented by native programming and Java is compared.

III.

A LGORITHMS AND I MPLEMENTATION

A. Detectors and Descriptors We evaluate seven detectors used to extract distinctive point-like features from images. We have to omit characterization of these algorithms due to the tight page limit, only pointing out the differences. Two classic corner detectors, Harris [8] and Shi-Tomasi, known also as GFTT from the abbreviation of the title of the paper [20], differ only in the measure of the “cornerness”, and are not associated with any particular descriptor. In contrast, the SIFT, SURF and ORB point feature detection algorithms are closely related to the respective descriptors. The SIFT (Scale Invariant Feature Transform) [10] and SURF (Speeded Up Robust Feature) [4] are both based upon the concept of the scale-space, which makes them theoretically the most robust ones with respect to the changing scale, rotation, and viewpoint. They are, however, also the slowest ones. The Oriented FAST and Rotated BRIEF (ORB) algorithm [16] is an attempt to build a robust detector/descriptor by combining the efficient FAST (Features from Accelerated Segment Test) point feature detector [15] with the recent binary descriptor BRIEF. This detector estimates the orientation of the feature, which then enables to construct a descriptor robust to in-plane rotation. The relatively recent STAR is a version of the algorithm presented in [1] – a multiscale point feature detector with full spatial resolution. The detectors used with descriptors may be divided into two categories: corner-like (FAST, ORB), and blob-like (SIFT, SURF, STAR). We evaluate performance of six point feature descriptors. The SIFT, SURF, and ORB have their own detectors, while the remaining three are recently introduced low-complexity, binary descriptors optimized for speed, and they may be used with any of the detectors. The BRIEF (Binary Robust Independent Elementary Features) [5] is a fast algorithm with a binary string for feature description. However, the BRIEF is not invariant to in-plane rotation. The BRISK (Binary Robust Invariant Scalable Keypoints) [9] and FREAK (Fast Retina Keypoint) [2] descriptors also belong to the family of low-complexity binary descriptors, but differ from BRIEF and ORB by having a fixed pixel sampling pattern consisting of concentric circles centered at the point feature. The FREAK has a biologically inspired sampling pattern with a higher density of pixels in the center. We evaluate the speed of all these detectors and descriptors. Then, we run feature matching efficiency tests on four selected pairs: the classic SIFT-SIFT and SURF-SURF, the more recent, low-complexity ORB-ORB, and the FAST-BRIEF combination, which we consider the most promising one due to its high speed and stability of the implementation on all tested platforms. In our tests all the descriptors are used with their default vector length. B. Implementation on Android Platform As our comparison is application-oriented, we tested the detection and description of point features on images with the algorithms implemented in the commonly used OpenCV library (version 2.4.3) for the x64/Linux and Android platforms. This version of the OpenCV for desktop PC computers contains most of the current state-of-the-art image processing

algorithms, while its Android platform counterpart does not provide the whole functionality of the desktop version. The vast majority of the Android application is written using Java programming language, utilizing the DALVIK compiler. The OpenCV library provides Java wrappers for most of the considered feature detection algorithms, excluding the SIFT and SURF detectors/descriptors, which are patented in the US, and thus cannot be used in commercial applications. We decided to investigate the speed of the detectors and descriptors in both Java and native code in C++. While writing the whole Android application in the native code is unnecessarily complicated, one can combine both programming languages in one application, combining the simplicity of Java with the performance of C++. A further advantage of the native code approach is the possibility to re-use the code written in C++ for x86/x64 libraries that are in widespread use. This possibility was exploited in our research by re-using a part of the x86/x64 platform OpenCV library containing the SURF and SIFT detectors and descriptors. The required functions were easily compiled to a shared object library and used in the experiments. In the presented comparison the possibility of parallel computing on multiple cores of the mobile device’s CPU was not used, because the application context of navigation with a mobile device assumes that the remaining cores may be busy with tasks related to other sensors (e.g. IMU) or the navigation algorithm itself (e.g. SLAM). Also, computations with the general purpose GPU (Graphics Processing Unit), which are supported by the OpenCV version for the x86/x64 platform, and can greatly reduce the computation time, are not used in this research as the equivalent libraries and hardware for the Android platform (e.g. Nvidia Tegra 5) are still under development. IV.

E XPERIMENTAL E VALUATION

A. Experiments and Data Sets For a visual navigation system, either SLAM or visual odometry, the most important properties of the examined detectors/descriptors are their speed, reliability, and invariance to the varying scale and viewpoint. For navigation using mobile computing, the computation burden of the investigated algorithms is significantly more important than for typical desktopbased applications, due to the limited computing power of mobile devices. The performance comparison of the selected detectors and descriptors has been performed off-line, using two data sets containing images acquired in a scenario of indoor visual navigation. A part of the experiments, mostly related to investigating the speed of detectors and descriptors, was accomplished using the data set recorded at the Freiburg University [22]. The data set contains the RGB and depth images recorded using the Kinect sensor, moved around by a person, or mounted on a mobile robot, which is controlled to move inside a building. This data set was chosen because the RGB camera in Kinect is a close equivalent to the cameras commonly found in mobile devices, with respect to both the image quality and frame rate. Moreover, the sequence in which the sensor is moved freely by a person can imitate the motion of a mobile device held by the user. The Freiburg data set contains precise ground truth,

a)

b)

Fig. 1. WiFibot the mobile robot with a mast holding the smartphone and the landmark for collecting ground truth poses (a), example view of the lab from the smartphone’s camera (b)

as to the computation speed. For the tests several hardware platforms were used, a laptop PC with Intel i5 Core2Duo processor as a reference desktop machine, and three Androidbased devices: Asus Nexus 7 tablet, and two smartphones, Samsung Galaxy S Advance, and more recent Samsung Galaxy S2. Besides the execution speed, a key feature of a detectordescriptor pair used in visual navigation is the point feature matching efficiency. This parameter is defined with respect to the number of point features successfully matched between two different images of the same scene. According to the application context of our comparison we have used for matching pairs of consecutive images obtained along the trajectory of a moving sensor (either the smartphone or the Kinect). To measure the performance of each detector-descriptor pair, the following procedure was performed: 1)

obtained from a professional motion capture system. These data were essential to estimate the accuracy and robustness of the point feature matching. While the Freiburg data set is commonly used in the robotics community, for our purposes it has an obvious drawback – it has not been captured using a mobile device. In order to perform at least a part of the comparison on images acquired from a camera of an actual mobile device, we prepared our own data set, at the robotic vision lab of the Poznan University of Technology (PUT). The experiments involved the small WiFibot mobile robot, and the Samsung Galaxy Note 3 smartphone (Fig. 1a). The collected image sequences display scaling and affine transformation, but little in-plane rotation (Fig. 1b), as we assume that the user holds the device in an upright pose while navigating. Because the tested methods can cope with changing illumination conditions [3], we collected all sequences under natural lighting. To gather the ground truth for the moving smartphone, this device was mounted on an aluminium mast attached to the WiFibot’s frame. This setup allowed us to have the smartphone at the elevation of about 1.5 m above the ground, which is a good approximation of the position of a mobile device while being held by an adult person. This ensures that our data set contains images that are compatible with our application scenario with regard to both the quality and the viewpoint. To obtain trustable ground truth the smartphone was synchronized during the experiment with the system of ceilingmounted cameras, which observed the robot and captured the images in the same moment. The images from the ceilingmounted cameras were used to precisely localize the robot equipped with a large marker pattern (cf. Fig 1). The spatial transformation between the coordinate system of the robot, and the coordinate system of the smartphone was estimated by calibration, which allowed to calculate the ground truth motion of the mobile device [19].

2) 3) 4)

5) 6)

The point features from both images were matched by minimizing the appropriate distances between their respective descriptors. Describing the i-th point coordinates on the first image as (ui , vi ) and j-th point coordinates on the second image as (Uj , Vj ), the symmetrical reprojection error was computed: error = max{d(ej , (ui , vi )), d(ei , (Uj , Vj ))}, (2) where d(x, y) stands for the operator of Euclidean distance of a point y to the line x, and ei , ej are epipolar lines calculated using the essential matrix:

7)

B. Evaluation Methodology The experiments started with the comparison of execution speed of the detection and description methods on the selected hardware platforms. The used implementations come from the same OpenCV library, which should allow for fair comparison

The point features were detected using the chosen detector on both images. The detected point features were described by the chosen descriptor. The coordinates of the features were normalized and undistorted using the camera intrinsic calibration parameters. The ground truth information was used to calculate the essential matrix E describing the epipolar geometry of the images. The known relative translation t and rotation R between the poses of the considered images are used to calculate this matrix:   0 −tz ty 0 −tx  E = R  tz (1) −ty tx 0

8)

[eix , eiy , eiz ]T

=

E[(ui , vi , 1)]T ,

[ejx , ejy , ejz ]

=

[(Uj , Vj , 1)]E.

(3)

If the error was smaller than the fixed threshold, the match was considered as an inlier. The threshold was chosen as a value in the normalized coordinates that corresponds to 2 pixel error in the actual image. This particular value was motivated by our experience in feature matching for visual navigation – any larger error (e.g. 3 pixels) usually renders the matching too imprecise for calculation of the transformation between frames. The ratio of inlier matches to the total number of matches was used as a measure of the strength of the detector-descriptor pair.

TABLE I. detector algorithm Harris GFTT SIFT SURF FAST STAR ORB

TABLE II.

D ETECTOR EXECUTION TIMES IN MILLISECONDS ON VARIOUS DEVICES FOR freiburg1/desk

average number of keypoints 361 993 1105 1526 2570 274 498

Nexus 7 Java 89.88 94.12 1723.36 477.05 12.14 165.15 47.49

Nexus 7 native 89.81 93.90 1725.23 550.86 12.07 165.58 49.33

Galaxy S2 Java 158.48 145.63 2091.06 969.36 37.36 217.07 95.88

Galaxy Advance Java 190.61 199.22 3099.98 1310.02 67.67 468.23 109.61

Galaxy Advance native 200.42 218.04 2779.41 1300.72 62.90 354.76 113.87

PC Linux 8.06 12.67 110.32 51.02 1.80 11.38 6.04

D ESCRIPTOR EXECUTION TIMES IN MILLISECONDS PER KEYPOINT ON VARIOUS DEVICES FOR freiburg1/desk

detector-descriptor pair algorithms SIFT-SIFT SURF-SURF ORB-ORB FAST-BRIEF STAR-BRIEF FAST-BRISK FAST-FREAK STAR-FREAK

V.

Nexus 7 Java 2.032 0.697 0.198 0.027 0.061 0.029 0.036 0.133

Nexus 7 native 2.453 0.702 0.200 0.026 0.056 — 0.151 1.606

Galaxy S2 Java 3.179 1.497 0.238 0.037 0.115 0.038 0.044 0.125

R ESULTS

The comparison of the execution time was performed on the whole freiburg1/desk data set, which consists of 613 images. The feature detection and description time was averaged over the whole sequence. The speed comparison of the feature detectors shown in Tab. I it is evident that the simpler, corner-like detectors: Harris, GFTT, and FAST have much shorter execution times than the others. The fastest one is by far the FAST detector, which however seems to be the least discriminative one, producing the largest number of keypoints. The execution times on mobile devices are substantially longer than on the PC, with the Nexus 7 being the fastest mobile computer in the test. The Nexus 7 is approximately 10 times slower than the PC laptop. This means, that while SIFT and SURF detectors can be computed in a reasonable time on PC, it takes respectively approx. 1.7s and 0.5s to do it on a mobile device. A somewhat surprising result is the very small gain on the use of the native code in Android – for some device and algorithm combinations the Java version was even faster. However, the differences are so small, that either one can be used. TABLE III.

M ATCHING RATIOS VS . THE NUMBER OF KEYPOINTS FOR THE SELECTED DETECTOR - DESCRIPTOR PAIRS IN freiburg1/desk

number of keypoints 100 200 300 500 1000

Galaxy S2 native 155.45 146.70 2050.66 967.17 35.34 214.95 98.12

FAST-BRIEF 83.73% 85.23% 86.04% 86.19% 85.91%

ORB 84.54% 85.12% 85.18% 86.06% 85.85%

SURF-SURF 85.63% 84.54% 83.43% 81.34% 76.90%

SIFT-SIFT 73.88% 73.03% 71.42% 69.11% 66.75%

The descriptor is an algorithm, which describes the neighborhood of each detected keypoint. Thus, the time of description of an image content depends on the number of the detected keypoints, while the time the algorithm spends on description of a single keypoint is largely independent of the detector being used. However, it should be noted that some detector-descriptor pairs, namely SIFT, SURF, and ORB have been designed to work together, and there is some exchange of information between the detector and the descriptor. This is the reason, why the execution time of the descriptors should be determined and compared per keypoint. The results of description speed are presented in Tab. II. The most complicated descriptor –

Galaxy S2 native 3.017 1.441 0.249 0.030 0.088 — 0.228 2.246

Galaxy Advance Java 3.916 1.856 0.307 0.045 0.126 0.048 0.058 0.276

Galaxy Advance native 3.936 1.869 0.333 0.049 0.141 — 0.258 2.727

PC Linux 0.00025 0.00018 1.9×10−5 5.0×10−5 9.6×10−6 0.0009 9.3×10−5 9.8×10−6

SIFT is by far the slowest one, while SURF is 2 to 3 times faster, depending on the device. The recent low-complexity descriptors are much faster on all platforms, but it is worth noting that the ORB, being nominally a combination of the concepts of FAST detector and BRIEF descriptor is much slower than the actual FAST-BRIEF pair. The missing results for the BRISK descriptor are caused by the memory leakage in the OpenCV BRISK implementation. While for Java the garbage collector helps with the management of the memory, for the native code further calculations (on a series of images) resulted in larger memory leak, and eventually led to the application being killed on Nexus 7 and Galaxy S2, or to system shutdown on Galaxy S Advance. TABLE IV.

M ATCHING RATIO FOR SELECTED DETECTOR - DESCRIPTOR 500 KEYPOINTS FOR FULL FRAME RATE SEQUENCES

PAIRS WITH

data set freiburg1/desk freiburg1/room put/phone1 put/phone2 put/phone3

FAST-BRIEF 80.04% 86.33% 78.24% 77.25% 81.49%

ORB 80.35% 88.14% 78.98% 77.50% 78.29%

SURF-SURF 77.92% 82.78% 74.55% 73.59% 75.77%

SIFT-SIFT 66.75% 73.10% 64.60% 63.39% 65.76%

An useful detector-descriptor pair hast to robustly match the features in two images to calculate the transformation. While the robustness (or invariance) to such factors like changing scale, viewpoint, or illumination is a property of the descriptor [11], the quality of the obtained transformation depends not only on the matching error (2), but also on the number of keypoints. Thus, an experiment was conducted that investigated the relation between the successful matching ratio and the number of detected point features. The 100, 200, 300, 500 and 1000 strongest keypoints were used on the same sequence to determine the effects of the number of keypoints on the matching ratio. The results are presented in Tab. III. For SURF and SIFT, there is a drop in matching ratio as the increase of the number of keypoint means that the additional accepted keypoints are less descriptive than the already detected ones. This effect is not visible for FASTBRIEF and ORB, which matching ratio results are stable regardless of the number of keypoints. We can concluded, that there is no significant gain in matching 1000 features when compared to a smaller number of keypoints. Using only 100

FAST-BRIEF 87.03% 68.12% 54.58% 43.37% 35.74% 29.39% 25.53% 21.33% 18.90% 17.42%

ORB-ORB 89.11% 68.75% 54.38% 42.28% 34.39% 27.64% 23.54% 19.92% 16.84% 15.12%

SURF-SURF 84.27% 64.38% 50.55% 39.61% 32.12% 26.09% 22.90% 19.47% 16.84% 15.21%

SIFT-SIFT 72.96% 54.08% 41.86% 32.76% 26.30% 21.06% 18.61% 15.80% 13.57% 12.63%

MATCHING RATIO [%]

keypoints can be dangerous as having low matching ratio might result in not enough matches to calculate the transformation between frames. It can be safely assumed that a typical visual odometry method needs to detect 300–500 keypoints to successfully match and compute the transformation between consecutive frames [7]. The successful matching ratio for all five data sets used in our experiments and the selected four most characteristic detector-descriptor pairs are shown in Tab. IV, using 500 strongest keypoints in each case.

FAST/BRIEF ORB SURF SIFT

FRAME NUMBER

Fig. 2. Successful matching ratio at full frame rate for the four considered solutions along the freiburg1/room sequence

While matching images with high percentage of shared view is relatively easy, the experiment on freiburg1/room was performed to determine the effect of the decreased processing frame rate on the matching ratio. As it can be seen from Tab. V, all examined methods perform worse as the time difference between the processed images increases. Though, while the visual odometry methods are considered, one needs to find minimum 5 or 8 correct matches to calculate the relative transformation between frames with the n-point algorithm [7]. This observation implies that when comparing 80% to the 20% matching ratio, the difficulty of finding correct matches increases not 4 times, but at least 4 to the power of 5, which means that it is crucial to perform the computation with high frame rate. The performance of the same four detectordescriptor pairs that are considered in Tab. V is shown in Fig 2 along the trajectory of the freiburg1/room data set at full frame rate. From this figure one can conclude that the SIFT and SURF algorithms score slightly lower matching ratio than ORB and FAST-BRIEF, however, their performance never drops to very low values, which occasionaly happens to FASTBRIEF. As we have shown that the frame rate achieved by the

FAST/BRIEF ORB SURF SIFT

MATCHING RATIO [%]

frame rate 1 (30 Hz) 2 (15 Hz) 3 (10 Hz) 4 (7.5 Hz) 5 (6 Hz) 6 (5 Hz) 7 (4.3 Hz) 8 (3.75 Hz) 9 (3.33 Hz) 10 (3 Hz)

S UCCESSFUL MATCHING RATIO VS . THE FRAME RATE FOR freiburg1/room

FRAME NUMBER

Fig. 3. Successful matching ratio at realistic frame rates for the four considered solutions along the freiburg1/room sequence

detector-descriptor pair is an important factor for comparison in the context of visual navigation, we show results demonstrating the impact of this frame rate on point feature matching performance on mobile devices. Native code timing on the very popular Nexus 7 tablet was used as the reference computing time for mobile devices. To satisfy the requirement of real-time work, the maximum frequency of the detector and descriptor computations for 500 features was calculated. It turned out, that in a real-time application for the mobile device, the FASTBRIEF detector-descriptor pair can be computed with at least 15 Hz rate, the ORB-ORB with 5 Hz, SURF-SURF with only 1 Hz, while SIFT was the slowest one, resulting in 0.3 Hz speed. These frame rates are realistic taking into account the limited resources of a mobile device. Computation of the FAST-BRIEF variant at 5 Hz on Nexus 7 took 33% of the CPU time and 331 mWh of energy, while at 10 Hz these figures were 67% and 346 mWh, respectively, as measured using the PowerTutor program. Having this constraint for realtime work, the matching ratio experiment for 500 features was repeated on the freiburg1/room and the results are presented in Fig. 3. The low computation frequency of SURF and SIFT in most cases rendered these algorithms practically unusable, as they had to match the images with very small or non-existing overlap. Thus, any faster translation or rotation of the sensor made it impossible to find the transformation. FAST/BRIEF ORB SURF SIFT

MATCHING RATIO [%]

TABLE V.

FRAME NUMBER

Fig. 4. Successful matching ratio at realistic frame rates for the four considered solutions along the put/phone3 sequence obtained using a smartphone

The same concept of experiment was applied also to the data sets obtained from the smartphone. Figure 4 depicts the successful matching ratio for the put/phone3 sequence. The performance of the four compared detector-descriptor pairs is very similar to the results achieved with the constrained, realistic frame rates on the Freiburg data set, which indirectly validates also the use of Kinect images in our experiments. Interesting results were obtained for the put/phone1 sequence (Fig. 5), which involved quick changes of the WiFibot’s orientation, resulting in jerky motion and induced oscillations of the mast with the smartphone. Under such conditions the successful matching ratio of all detector-descriptor pairs dropped to almost zero for some parts of the trajectory. This was caused by significant motion blur in some frames, as shown by the inset image (A). For normal quality images (inset B) the FAST-BRIEF pair, even on a mobile device was able to achieve high number of correct matches, enough for visual navigation. FAST/BRIEF ORB SURF SIFT

B

MATCHING RATIO [%]

A

FRAME NUMBER

Fig. 5. Successful matching ratio at realistic frame rates for the four considered solutions along the put/phone1 sequence with example matchings

VI.

C ONCLUSIONS

The presented experimental results confirm that detection and description of point features on a mobile device is possible in real-time, which opens the possibility to use these algorithms in visual navigation systems, such like visual odometry or SLAM. The results demonstrate how important it is to have a very fast detector and descriptor, even if the chosen pair does not work perfectly for larger viewpoint changes between images. We recommend the pair FAST detector and BRIEF descriptor as the best combination when it comes to the speed and performance on mobile devices. Moreover, the results suggest that there is no significant speed difference between the Java and native code C++ implementation of the algorithms from the OpenCV for Android. The chosen best feature extraction and description method will be used in our further work focusing on implementing and testing a real-time visual odometry system on mobile devices. ACKNOWLEDGMENT The authors would like to thank M. Fularz and A. Schmidt for their assistance with the experiments involving the PUT Vision Ground Truth System. This work is financed by the Polish Ministry of Science and Higher Education in years 2013-2015 under the grant DI2012 004142.

R EFERENCES [1] M. Agrawal, K. Konolige, M. Blas, “CenSurE: Center surround extremas for realtime feature detection and matching”, Proc. ECCV’08, Marseille, 2008, 102–115. [2] A. Alahi, R. Ortiz, P. Vandergheynst, “FREAK: Fast retina keypoint”, Proc. IEEE CVPR, Providence, 2012, 510-517. [3] J. Bauer, N. S¨underhauf, P. Protzel, “Comparing several implementations of two recently published feature detectors”, Proc. IAV’07, Toulouse, 2007, 143–148. [4] H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, “SURF: Speeded up robust features”, Comp. Vis. and Image Underst., 110(3), 2008, 346–359. [5] M. Calonder, V. Lepetit, C. Strecha, P. Fua, “BRIEF: Binary robust independent elementary features”, Proc. ECCV’10, Hersonissos, 2010, 778– 792. [6] A. J. Davison, I. Reid, N. Molton, O. Stasse, “MonoSLAM: Real-Time Single Camera SLAM”, IEEE Trans. on PAMI, 29(6), 2007, 1052–1067. [7] F. Fraundorfer, D. Scaramuzza, “Visual odometry: Part II – matching, robustness and applications”, IEEE Rob. & Aut. Mag., 19(2), 2012, 78– 90. [8] C. Harris, M. Stephens, “A combined corner and edge detector”, Proc. 4th Alvey Vision Conf., 1988, 147–151. [9] S. Leutenegger, M. Chli, R. Siegwart, “BRISK: Binary robust invariant scalable keypoints”, Proc. ICCV’11., Barcelona, 2011, 2548-2555. [10] D. G. Lowe, “Distinctive image features from scale-invariant keypoints”, Int. J. of Comp. Vis., 60(2), 2004, 91–110 [11] K. Mikolajczyk, C. Schmid, “A performance evaluation of local descriptors”, IEEE Trans. on PAMI 27(10), 2005, 1615–1630. [12] K. Muzzammil bin Saipullah, A. Anuar, N. A. binti Ismail, Y. Soo, “Real-time video processing using native programming on Android platform”, Proc. IEEE 8th Int. Col. on Signal Proc. and its App., 2012, 276–281. [13] M. Quigley, D. Stavens, A. Coates, S. Thrun, “Sub-meter indoor localization in unmodified environments with inexpensive sensors”, Proc. IEEE/RSJ Int. Conf. on IROS, Taipei, 2010, 2039–2046. [14] N. Ravi, P. Shankar, A. Frankel, A. Elgammal, L. Iftode, “Indoor localization using camera phones”, Proc. 7th IEEE Work. on Mobile Comp. Sys. and App., 2006, 1–7. [15] E. Rosten, T. Drummond, “Machine learning for high-speed corner detection”, Proc. ECCV’06, Graz, 2006, 430–443. [16] E. Rublee, V. Rabaud, K. Konolige, G. R. Bradski, “ORB: An efficient alternative to SIFT or SURF”, Proc. ICCV’11, Barcelona, 2011, 2564– 2571. [17] A. Santos, L. Tarrataca, J. Cardoso, “Analysis of navigation algorithms for smartphones using J2ME”, in: Mobile Wireless Middleware, Operating Systems, and Applications, Berlin, Springer, 2009, 266–279. [18] A. Schmidt, M. Kraft, M. Fularz, Z. Domagala, “Comparative assessment of point feature detectors in the context of robot navigation”, JAMRIS, 7(1), 2013, 11–20. [19] A. Schmidt, M. Kraft, M. Fularz, Z. Domagala, “The registration system for the evaluation of indoor visual SLAM and odometry algorithms”, JAMRIS, 7(2), 2013, 46–51. [20] J. Shi, C. Tomasi, “Good features to track”, Proc. IEEE CVPR, Seattle, 1994, 593–600. [21] P. Skrzypczy´nski, “Simultaneous localization and mapping: A featurebased probabilistic approach”, Int. J. of App. Math. and Comp. Sci., 19(4), 2009, 575–588. [22] J. Sturm, N. Engelhard, F. Endres, W. Burgard, D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems”, Proc. IEEE/RSJ Int. Conf. on IROS, Vilamoura, 2012, 573-580. [23] T. Tuytelaars, K. Mikolajczyk, “Local invariant feature detectors: a survey”, Found. and Trends in Comp. Graph. and Vis., 3(3), 2008, 177–280. [24] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, D. Schmalstieg, “Pose tracking from natural features on mobile phones”, Proc. ISMAR’08, Cambridge, 2008, 125–134. [25] X. Yang, K.T. Cheng, “LDB: An ultra-fast feature for scalable augmented reality on mobile devices”, Proc. ISMAR’12, Atlanta, 2012, 49–57.

Suggest Documents