ization and mapping (SLAM) only line and point fea- tures are used ... pose to use natural, salient image patches as additional ... using vision is the work from Bosse [2] where he intro- duces a ... is not only in robot localization a key issue. Thus it is ... points. In a first step local interest points are detected on the whole image.
Natural, Salient Image Patches for Robot Localization Friedrich Fraundorfer Horst Bischof Sandra Ober Graz University of Technology Institute for Computer Graphics and Vision, Austria {fraunfri,bischof,ober}@icg.tu-graz.ac.at
Abstract This paper addresses a major problem in mobile robot localization. In most approaches to simultaneous localization and mapping (SLAM) only line and point features are used which easily lead to ambiguities in loopclosing and global relocalization. In this work we propose to use natural, salient image patches as additional high discriminative landmarks to aid in unclear localization scenarios which might occur in modern highly symmetric indoor environments. A new method of landmark extraction is presented as well as a matching method for storing and retrieving the landmarks in a location database.
1
Introduction
Simultaneous localization and mapping (SLAM) is a hot topic in mobile robotics. While there are elaborate solutions for the SLAM-problem using sensors like laser range finders (see [6]) SLAM using vision sensors is still not that far. One promising approach to SLAM using vision is the work from Bosse [2] where he introduces a shape-from-motion algorithm integrated into a SLAM framework (see [1]). The main features in this work are vanishing points and 3D lines. Localization is mainly done by matching the 3D lines extracted and reconstructed from a scene. The work also addresses loop-closing and global relocalization. Loopclosing must be done if the robot encounters an already mapped environment like for instance by following a loop road. This case has to be detected, so that the robot builds not two maps of only one location. In the case of global relocalization the robot must be able to determine its location in a mapped environment from scratch if it looses track or if it is moved while it is switched off. Both cases are difficult to solve if using 3D lines only. It can easily lead to ambiguities in repet-
itive structured indoor scenes. Such ambiguities could be resolved by incorporating more discriminative features than 3D lines. In fact, by using 3D lines only, most of the information gathered by the vision sensors is discarded. Therefore this paper addresses the possibility of using natural, salient image patches as additional features to resolve ambiguities which occur when using a sparse 3D structure for localization, especially in loop-closing and global relocalization. We will present a method to detect natural, salient image patches for use as landmarks. The landmarks consist of clusters of interest points and we show how to normalize them in an affine invariant way by creating a local affine frame. We present a highly reliable method for finding corresponding landmarks within images from different viewpoints which is necessary for localization. Detected landmarks are stored in a database with the currently estimated robot position. If the robot looses track, ambiguities in position can be resolved by querying the database with the currently visible landmarks. If corresponding landmarks were found the robot is now able to perform a coarse localization. In section 2, we describe the proposed landmarks. Section 3 addresses landmark matching. Experimental results are shown in section 4, and a summary concludes the paper.
2
Landmark detection
The detection of reliable re-recognizable landmarks is not only in robot localization a key issue. Thus it is vital to detect image regions which have a discriminative description and which are stable against viewpoint changes. In this work we investigate the possibility to use clusters of interest points (of various kinds) for landmark detection. This is based on the assumption that image regions with a high concentration of interest points are more descriptive than other parts of the image.
(a)
(b)
(c)
(d)
Figure 2. LAF for corresponding regions. (a) Detected left. (b) Normalized left. (c) Normalized right. (d) Detected right. Figure 1. Regions detected by MST clustering.
2.1
Interest region detection by clustering interest points
In a first step local interest points are detected on the whole image. Areas of high interest point concentration are detected by a graph based clustering algorithm. The interest points with coordinates xi = (x1 , x2 ) represent the nodes of an undirected weighted graph in 2D. Clustering is performed by a method proposed by Zahn [7] using a minimal spanning tree (MST) and removing inconsistent edges to split the tree into several clusters. The weights for the edge between q two graph nodes i, k is their geometric distance
dik = (xi1 − xj1 )2 + (xi2 − xj2 )2 to which we will also refer to as edge length. The resulting MST now shows mainly edges between geometrically nearby nodes. To obtain several clusters the MST is split into several sub-trees by removing inconsistent edges. Edges with an edge length significantly longer than the length of neighboring edges are entitled inconsistent. We calculate a threshold T as the average edge length of the k-nearest neighbors of a graph node. Every edge connected to one of the k-nearest neighbors with length L > T is removed from the graph. This criterion is a local one and self adaptive. It accounts for the existence of clusters of different sizes and does not assume cluster shapes in advance. Fig. 1 shows an example of regions detected with this method.
2.2
Local affine frame
We aim to create a local affine frame (LAF) for every detected cluster to simplify region matching as discussed in detail in [4]. Once a LAF is constructed the region can be transformed by an affine transformation to a canonical frame which allows direct comparison of all the detected regions in one common frame. In our case we construct the LAF by describing a region by an ellipse which is normalized by transforming it to the
unit circle. Ellipses are fitted to the detected clusters by calculating the eigenvectors and eigenvalues of the covariance of the interest point set. For normalization of the LAF the affine transformation which maps the ellipse to an unit circle is calculated. The part of the image is now resampled into a fix-sized window of 60 × 60 pixels using the calculated normalization transformation. Fig. 2 demonstrates the normalization of the LAF on two corresponding regions. The normalization works only with planar image patches. Under the assumption of planarity the perspective distortion of a viewpoint change can be locally approximated by an affine transformation. For nonplanar patches the normalization will give incorrect results and therefore the subsequent matching method will only find matches for the planar patches.
2.3
Interest point selection
Although the proposed method works for all kinds of interest points, the quality of the results depends on the used interest point detection method. The interest point evaluation of Schmid et al. [5] provides a good basis for this decision. In this work we detect interest points by making use of the structure tensor. For every image pixel we evaluate the structure tensor and calculate a Harris corner strength measure (cornerness) as given by Harris and Stephens [3]. We then calculate the standard deviation of the logarithmic cornerness value (cornerness values ≤ 1 are set to zero) for a 3 × 3 mask for every pixel. We select N pixels which show the highest values as interest points. Generally speaking we detect interest points by selecting pixels which show a cornerness variation of several magnitudes in a small local area. This avoids the critical step of defining a single global threshold on the cornerness value.
3
Region matching
Region matching is done in a two-step approach. In the first step tentative matches are detected by corre-
lation of the whole LAF. In a second step point correspondences are established in an iterative way to verify possible matches. We use normalized cross correlation to detect tentative corresponding regions. To speed up the correlation and to alleviate the effects of an inaccurate normalization the correlation is done on a low resolution image patch created with a Gaussian image pyramid. Regions with a correlation value higher than some threshold TA are considered as tentative matches. Note, that the selection of the threshold is not critical. The main purpose of this step is to speed up the matching process by discarding non-matching regions in a fast way. The second step verifies the tentative correspondences detected in the previous step. In one patch of a tentative corresponding pair a new set of interest points (we use Harris corners) is detected, now in the normalized patch. Within the normalized frame the coordinates of the corresponding points in the other region are given instantly, if the affine transform used for normalization is accurate. This can be verified by calculating a correlation value for a correlation window around every point correspondence. For mis-matches the correlation value will be low and can thus be detected. In the case of true matching regions we can now improve the normalization by searching for the coordinate with the highest correlation value within a search window around the estimated corresponding coordip i 2 0i i 2 nate. We define di = (x0i 1 − x1 ) + (x2 − x2 ) as the i distance from the estimated coordinate x = (x1 , x2 ) to the coordinate x0i = (x01 , x02 ) with the highest correlation P value within the search window. The sum D = i di over all N point correspondences (x0i , xi ) gives a measure for the accuracy of the transformation. Iteratively the estimated point correspondences are refined by area based sub-pixel matching. A homography Hi can be estimated from this updated correspondences which can be applied to one of the image patches to improve the normalization. The iteration can be stopped if D is sufficiently small and will not improve with additional iterations. After N steps the normalized image coordinates are given by xinl = Al xil for the left image and xinr = HN HN −1 ...H0 Ar xir for the right image where Al and Ar are the affine transformations used for normalizing the image regions. Fig. 3 shows detected corresponding regions in images from different viewpoints.
4
Experiments
We demonstrate the capabilities of the proposed landmark extraction and the matching method on a localization example. We gathered data from 13 different locations of the institutes hallway using a stan-
Figure 3. Detected corresponding regions between two views.
dard video camera (floorplan see Fig. 4). Fig. 5 shows one image of each location. While the structure of the hallway is highly repetitive the appearance of every location is quite different due to attached posters. The images show that the illumination conditions vary strongly. Due to big windows a mixture of natural and artificial lighting produces difficult lighting conditions like highlights and specularities. From the video we extracted 4255 frames with a resolution of 720 × 288 pixels. From these images we created three training sets with 2, 3 and 5 images per location. All other images are used as test images. From the images in the training set landmarks are extracted, stored in the database and annotated with the location. In order to obtain stable and reliable landmarks they have to be found in two consecutive frames. Coarse localization is done by detecting landmarks in the images of the test set and searching for corresponding regions in the database. The criterion for determining the location is based on the number of corresponding regions and the achieved average cross-correlation value of the interest points within the matched regions. The first part of the experiment evaluates the influence of the number of training images. For this we selected 6 images per location to build a small testset. We calculated the location recognition rates for training sets with 2, 3 and 5 images per location. The resulting graph is shown in Fig. 6 and Table 1 provides the corresponding figures. The rate for not recognized locations is based on the number of all frames while the rates for correct and false localization is based on the number of detected matching frames. The number of correct matches is very high and is almost stable for all three cases which demonstrates the view point insensitivity of the proposed method. The number of false positives is very low. The number of not recog-
nized locations is decreasing with an increasing number of training images. In our experiment 5 images per location are enough to reduce the number of not recognized frames to a minor fraction. In the second part of the experiment localizaton was performed for all 4255 frames using 5 training images per location. Table 2 shows the rates for correct and false matches.
Figure 4. Floorplan of the institute showing the trained locations.
Figure 5. Images of the different locations (from 0 to 12).
Loc. correct recognized Loc. false recognized Loc. not recognized
T. set 2 0.97 0.03 0.26
T. set 3 0.94 0.06 0.10
T. set 5 0.95 0.05 0.01
Table 1. Rec. rates (correct, false, not rec.) for small test-set and different training sets.
Loc00 Loc01 Loc02 Loc03 Loc04 Loc05 Loc06 Loc07 Loc08 Loc09 Loc10 Loc11 Loc12
#frames 335 317 349 376 339 258 270 252 431 494 235 288 311
correct 0.93 0.96 0.96 0.77 0.94 0.93 0.93 0.99 1.00 0.99 0.96 0.82 0.81
false 0.07 0.04 0.04 0.23 0.06 0.07 0.07 0.01 0.00 0.01 0.04 0.18 0.19
Table 2. Loc. recognition rates (correct, false) for a training set with 5 images per loc.
world indoor environment with uncontrolled illumination. A next step will be to integrate the detected landmarks in the position estimation system.
References
Figure 6. Loc. recognition rates (correct, false, not rec.) for small test-set and different training sets.
5
Summary
We have presented an approach of using natural, salient image patches as landmarks for mobile robot localization. The landmarks can be integrated in a SLAM framework which will benefit from the additional information especially in the case of loop-closing and global relocalization. The presented landmark extraction and matching method proved reliable in a real
[1] M. Bosse, P. Newman, J. Leonard, and S. Teller. An atlas framework for scalable mapping. In IEEE International Conference on Robotics and Automation, 2003. [2] M. Bosse, R. Rikoski, J. Leonard, and S. Teller. Vanishing points and 3d lines from omnidirectional video. In ICIP02, pages III: 513–516, 2002. [3] C. Harris and M. Stephens. A combined corner and edge detector. In Alvey Vision Conference, 1988. [4] S. Obdrzalek and J. Matas. Object recognition using local affine frames on distinguished regions. In Proc. 13th British Machine Vision Conference, Cardiff, UK, volume 1, pages 113–122, 2002. [5] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. International Journal of Computer Vision, 37(2):151–172, 2000. [6] H. Surmann, A. Nuechter, and J. Hertzberg. An autonomous mobile robot with a 3d laser range finder for 3d exploration and digitalization of indoor environments. 45:181–198, 2003. [7] C. T. Zahn. Graph theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 20(1):68–86, January 1971.