Visual Place Recognition for Autonomous Robots - Semantic Scholar

1

Visual Place Recognition for Autonomous Robots Hemant D. Tagare Department of Diagnostic Radiology and Department of Electrical Engineering Drew McDermott Hong Xiao Department of Computer Science Yale University ftagare, mcdermott,[email protected]

Abstract | The problem of place recognition is central to robot map learning. A robot needs to be able to recognize when it has returned to a previously visited place, or at least to be able to estimate the likelihood that it has been at a place before. Our approach is to compare images taken at two places, using a stochastic model of changes due to shift, zoom, and occlusion to predict the probability that one of them could be a perturbation of the other.2 We have performed experiments to gather the value of a statistic applied to image matches from a variety of indoor locations. Image pairs gathered from nearby locations generate low 2 values, and images gathered from dierent locations generate high values. The rate of false positive and false negative matches is low.

I. Introduction

sources: (1) Odometry | the approximate displacements and headings of the robot as it goes from place to place; and (2) Similarity | the ability to tell, with reasonable accuracy, that two places are similar as measured by some sensor, such as a sonar or camera. The two sources are complementary: places which look similar but are far apart can be distinguished on the basis of odometry, and places which are close can be distinguished on the basis of image similarity. Our focus in this paper is on visual sensors for place recognition. Furthermore, we have focused on matching images directly, as opposed to attempting to recover the three-d structure of the world and matching the structures recovered from two dierent images. We anticipate using metric information about the shape of the network, but not about the shape of places the robot sees. All image-based techniques can be classi ed according to how they address three fundamental issues: how they estimate the correspondence between the images, how they estimate the domain of the correspondence, and the relation they assume between image gray levels at the corresponding pixels.

This paper presents a new visual place recognition algorithm. The algorithm accepts two images taken from two poses, and tests whether they the two poses are (probably) close to each other. The success of the algorithm is judged by its statistical misclassi cation rates (false positive and false negative rates) against ground truth in real-world experiments. These values suggest that an autonomous mobile robot can use the algorithm to recognize whether it is close to a location that it had visited in the past. Visual place recognition is only a part of the larger problem of map building. We intend to use this algorithm in the context of a map-learning algorithm that represents maps as networks of places with probabilistic transitions between them. [< Kuipers, Engelson, SimmonsKoenig >] Image I A \place" is a small area, no bigger than the robot itself, Image I’ from which the view in a particular direction is relatively stable. A robot can go from place to place by running action routines. The space between two places need not itself be resolvable into places; it depends on whether there is a reason to stop along the way. P P’ The advantage of this kind of map representation are Correspondence several: The map is expressed in terms of actions that t with the robot's eector system. Once the robot knows the map, it also knows the actions that will take it from one Fig. 1. Correspondence and its domain. place to another. Further, no more places are created than the robot can get to or care about. A place that can't or The three issues are illustrated in gure 1 which shows needn't be reached reliably woan't be represented at all. the robot with a pin-hole camera obtaining an image I Finally, there is no need for a theory of the shape of places, from pose P and an image I 0 from a nearby pose P 0.1 The only a theory of the shape of the network. Inferring a map of this kind requires two information 1 The points are drawn as if the camera is oriented in completely These rays image the same point of the external world

Domain of the correspondence

2

correspondence between I and I 0 is a pairing of the pixels is the use of certainty grids, in which space is resolved into

of the two images such that a pixel-pair images the same point of the external world. A particular pixel-pair in the correspondence is illustrated in gure 1. Since, in general, the camera axis at P and P 0 may not be aligned, a precise estimate of correspondence has to allow for projective transformation under arbitrary rotation of the camera axis. Further, a part of the external world which is visible from P may be occluded from P 0, every pixel need not have a corresponding pixel. The subset of the pixels at P 0 that have corresponding pixels in P is called the domain of the correspondence. Any estimate of the correspondence requires an estimate of its domain. Finally, if P and P 0 are close to each other, the corresponding pixels of the two images view the external world from similar directions. This implies that the amount of change in gray levels at corresponding pixels is small. The exact nature of the relation is the third issue in image comparison. The simplest image-based algorithms use correlation to compare the two images [8], which makes sense only if the correspondence is the identity function. More sophisticated approaches attempt to estimate the correspondence during image comparison. These approaches tend to be expensive, and usually require further simplifying assumptions. [< bibref>] Although our approach is also image-based and ignores the three-dimensional geometry of the external world, it diers signi cantly from previous approaches. Instead of calculating the actual three-dimensional geometry at P or at P 0 we assume that the geometry belongs to a class of geometries (i.e. it is a stochastic process). We use knowledge only of the class of geometries and make no attempt to identify which member of the class occurs in a given situation. The class of geometries is speci ed as non-informatively as possible (analogous to models for white noise). Finally, we pose the recognition problem in the framework of statistical decision theory. II. Related Work

The basic idea of treating maps as networks of places is from [10], [11]. A recent program uses this approach to learn oce-building layouts very eectively [12]. It has recently been re ned by treating the networks as Markov decision processes with probabilistic transitions [18], [17]. [13] investigates a map representation in which nodes were regions (more generally, behaviors), and the edges indicated transitions among them. In [19], nodes are obstacle boundaries from which landmarks are visible. Most workers in this area have used sonar as their main sensor for place recognition Others have used data from cameras, including [9], [8] and especially [16], [19], [4], [5], [6]. The major competing paradigm for map representation dierent directions at the two points; this is simply because the diagram would be too cluttered if we drew the two points approximately coinciding in position and orientation, which is the case we are really interested in.

a grid of squares, and sonar data are used to assign a probability of occupancy to each square [14], [3], [2]. Data from stereo vision can be used in a similar way [15]. Matching panoramic views (as we describe below) to achieve place recognition has been pursued before. Hong et al. [7] describe a system for taking a one-dimensional panorama for use in robot localization. It uses a spherical mirror to produce a 360-degree image of a location. One slice through this image remains at the same height as the robot moves around. This slice is used to produce a one-dimensional list of features used as landmarks. Our approach uses a full two-dimensional panorama. In [20] there is described a scheme for taking a panorama of what the robot sees to its side, and looking for features to use to retrace the same route. The panorama is obtained as a result of translation and does not model either perspective transformation or camera-axis rotation, as we do, albeit probabilistically. III. Our Approach

Using the notation developed in the last section, we consider a location L randomly chosen in the neighborhood of P 0, and ask what the chances are that P is one such L. To answer this question, we need to have a stochastic model of how images are perturbed by small camera motions. Before we provide that answer, we point out that it may seem backward to measure the probability of generating P, the old image, as a perturbation of P 0, the new one. It should be clear, however, that the two derivations are symmetric. There are two technical reasons for doing it in this direction. One can be explained only after we present our algorithm. The other we explain now. The largest perturbation to contend with is camera rotation around the y (vertical) axis. Even a slight rotation between the pose P and the pose P 0 can cause the overlap between the two images to be small. To avoid this problem, we have the robot collect a panorama at a location it may want to match later. This is a \virtual wide-angle" image obtained by gluing several ordinary images together. Now when we collect a new image, I 0 , at a pose P 0, all we need to do is nd the slice of the panorama taken at P that matches best. In what follows, we will treat image I as such a slice. This eliminates the registration problem. Of course, most of the correspondence problem remains, because dierent parts of the image will have shifted by dierent amounts, and some (the parts outside the domain ) are completely lost. One consequence of the use of panoramas is that we must use spherical coordinates for our images, so that camera rotations are simple x-coordinate translations. Space does not allow us to go into details in this paper, but in the rest of the paper we will use longitude and latitude (e and l) instead of x and y. We now present our stochastic model of image perturbation. See Figure 2. We are letting L be a random variable that ranges over poses near to P 0, all of which we assume have a z (depth) axis parallel to that at P 0. Every point

3

I’

Stochastic correspondence

Stochastic external geometry

+

Random location

Observed image process

Warped image process

Noise n()

Partial Occlusion

Domain Ω

Extra Image E

Compare

I1 I2

Rotation and Interpolation T()

X

In Aperture/Illumination factor m Rotation angle φ

Fig. 2. The Model.

of the external world that is visible from L is visible from P 0 except for unpredictable occlusions. We use C for the correspondence from R, the image domain at L, to R0, the domain at P 0. Hence we derive the following expression for the warped image process at L: I~ (e; l) = I 0 (C(e; l)) + n(e; l) (e; l) 2 R

Given m and , the likelihood that mI comes from the observed image process is

Z

pIj ;E (mI )p ( )pE (E)d dE:

Further, if p() and pm (m) are the prior distributions of the camera orientation and aperture/illumination factor, then the likelihood ?(I1 ; ; In) that some interpolated image between I1 ; ; In comes from the observed image process is ?(I1 ; Z ; In) = pIj ;E (mI ) (1) p ( )pE (E)pm (m)p () d dE dm d: To complete the model, we have to specify the prior distributions. We use the simplest plausible priors. For p ( ), we use an exponential distribution that makes high levels of occlusion less likely. Because we know nothing about E, we assume all E's are equally likely. The aperture factor m has a uniform distribution over the range [0:5; 2]. The rotation prior p () is uniformly distributed over the range [?; ]. To proceed in a strictly Bayesian manner, we would evaluate the log-likelihood

As explained in Section I, occlusions limit the domain

of the correspondence; outside of image pixels are prolog?(I1 ; ; In) vided by an unknown process E. Hence the observed image process at L can be written as that some interpolated image at P comes from the observed image process at L; and if the log-likelihood is high enough, W I~ + (1 ? W )E; we can declare that P is in fact L. However, although this alternative is theoretically apwhere W is the window function pealing, it is practically infeasible because the integral in equation (1) does not have a closed-form expression and u2

W (u) = 10 ifotherwise, is expensive to evaluate numerically. To obtain a more practical decision procedure, we make a key simpli cation: and, E() is a stochastic process which describes the un- Rather than think about the correspondence function C as known image in R ? . This expression is summarized by a whole, we think about the probability of individual pixelto-pixel correspondences. In principle, we can eventually the upper arm of Figure 2. Sampling the image process I on a discrete grid of pixels recover the distribution of entire correspondence by considallows us to write down a joint probability distribution for ering pairs of pixels, triples, and so forth. But we get good results by considering the simplest possible approximation, it namely, the distribution for a single pixel. We therefore ask the question, for a given pixel (e; l) of pI ; ;E (I ; ; E) = pIj ;E (I )p ( )pE (E); the image I, what range of pixels might it correspond to where, p and pE are prior distributions on and E (and in the image I 0 ? Assuming that the location of P 0 with where we have taken advantage of the fact that and E respect to L is (x ; z ) and that the external point is at a distance r from L, the correspondence function C(e; l) = are independently distributed). Turning to the set of images obtained at P, we let the (e0 ; l0) can be written as shown in Equation 2. (In this relative orientation between the coordinate system at P equation f is the focal length of the camera.) Notice that the correspondence function depends only and L be . As explained above, the image I at P is actually chosen by selecting a slice of a panorama built from on the relative displacements rx and rz , and in the limit several images I1 ; ; In obtained at P. Slices correspond rx ; rz ! 0, the correspondence function tends to the idento y-axis rotations of angle , so we denote them by I . We tity function. To get a sense of how the correspondence use a factor m models the aperture and overall illumination function departs from identity, it is useful to plot the ranges change. All of this is summarized by the lower arm of of (e0 ; l0) for a given range of rx and rz as a function of (e; l). Figure 3 shows this plot. The gure was constructed Figure 2.

4

!

sin fe sin fl ? rx e0 = f tan?1 ; cos fe sin fl ? rz

0 l0 = f cos?1 @ q

cos fl

cos2 fl + (sin fl sin fe ? rx )2 + (sin fl cos fe ? rz )2

as follows. First the relative displacements were set to zero so that the extended correspondence function was identity. The images of (e; l) at 22:5of increments were plotted as cross hairs. Next the relative displacements ( rx ; rz ) were varied in the unit square [?0:2; 0:2] [?0:2; 0:2] and the extreme values of the extended correspondence were plotted as a curve surrounding the cross hairs. These are the bent quadrilaterals in gure 3. They show the region within which the extended correspondence varies as the distance r of the imaged point varies from 5:0 max(x; z) to in nity.

45 deg

1 A:

(2)

pings of (e1 ; l1 ) and (e2 ; l2) are statistically independent when (e1 ; l1 ) 6= (e2 ; l2). Given these independence assumptions, the probability that a pixel (e; l) actually corresponds to some pixel in P(e; l) depends on the distribution of gray levels in P(e; l). Although more complex models are possible, we assume that the gray levels have a Gaussian distribution, and hence that the probability is l) ? (e; l))2 (3) exp ? (I(e;2 P(e;l) (I(e; l)) = p 12 2 (e; l) 2 (e; l) where (e; l) is the average gray level over the parallelogram P(e; l), and 2 (e; l) is the variance of the gray levels over P(e; l). The probability of the entire unoccluded image is then simply Y PI~ (I~j ) = P(e;l) (I(e; l)) (e;l)2

0 deg.

l f

−45 deg e f

−45 deg

0 deg.

45 deg

Fig. 3. Limits of the correspondence process.

The sides of the quadrilaterals in gure 3 are images of segments of lines corresponding to rx = constant or z = constant. Although these sides appear bent, their r curvature is not high, and this suggests that the correspondence function C can be approximated well by its rst order Taylor series, which space does not allow us to display. We will use the notation P(e; l) to refer to the parallelogram of points in I 0 that might be mapped to (e; l) in I. To proceed, we have to specify the distribution of the displacement x; y and of the depth r and calculate the statistics of correspondence function via the approximation of equation (2). Since we are not willing to assume anything about the three-dimensional structure of the world, the most conservative approach is to (1) assume that the correspondence process maps (e; l) uniformly from P(e; l), and (2) ignore all correlations, i.e., to assume that the map-

where is a suitably discretized subset of the image. The MAP estimates of ; E; ; m are obtained by maximizing the log likelihood. The constant and pE (E) (which is also a constant) in the expression for the log likelihood can be deleted during the maximization. Since we have discretized the images, the camera rotation need be estimated only to the resolution of the discretization. There is no closed form solution to the maximization and we choose a numerical method for calculating the maximum. The numerical procedure uses a multi-resolution strategy. It is called with the limits of the search range of and an increment . The procedure iterates through all feasible values of and nds the maximum log likelihood, say at . The procedure is recursively called with the the search limits set of and increment set to ()=2. The recursion stops when the increment corresponds to one pixel { at which point the the maximum of the log likelihood found by the procedure is returned. During the search for the maximum likelihood, the maximizing values of and m are found by an iterative coordinate ascent procedure. This procedure alternately maximizes and m. For a given interpolated image and m, the correspondence domain which maximizes the log likelihood is easily found using standard methods [1]: ? (e; l))2 + 2(e; l) <

= (e; l) j (mI(e;2l)2(e; l) . Similarly, for a given , we pick m (the aperture factor) as the least-square estimate

5

l) : m = e;l2 I(e;2l)(e; (e; l) e;l2

At the end of the MAP procedure outlined above, we have the MAP estimates ; ; m . The model in gure 2 indicates that the interpolated image agrees with the warped image process in the correspondence domain. Once we have the MAP estimates, we can evaluate the similarity of the interpolated image I = T( ; I1; ; In) and the warped image process by calculating the standard 2 -statistic: s))2 ; 2 = j 1 j r;s2 (I(r; s)2?(r;(r; s) where, j j is the area of the estimated correspondence domain in pixels. We expect to get low values of when the images I1 ; ; In and I 0 are obtained from nearby locations and \look similar." Conversely, we can expect to get high values of when images are dissimilar. At this point we can explain the second technical reason for inquiring whether P is derived from P 0 rather than the other way around. Typically our map-learning algorithm will acquire a single new image, and then check it against a set of candidate stored panoramas. Each such check requires computing, for each pixel (e; l), the probability of its gray level assuming that it was drawn from a Gaussian distribution with the mean and variance (e; l) and 2 (e; l), as discussed in connection with Equation 3. That requires computing two derived arrays and 2 , which we call the mean image and variance image. These are arrays can be computed for I 0 once and for all, and then used repeatedly as we compare I 0 to each panorama in turn.

15.0

10.0

5.0

0.0 0.0

10.0

20.0

30.0

40.0

50.0

Fig. 5. Histogram of 2 values for matches 15.0

10.0

5.0

0.0 0.0

10.0

20.0

30.0

40.0

50.0

Fig. 6. Histogram of 2 values for mismatches

with 2 > 10 as \mismatches," then the false positive rate is 5.6%, and the false negative rate is 0%. Obviously, further work is required to esh these statistics out. We intend to rerun the same comparisons again using a simple correlation test to see if our technique is more reliable, and if so by how much. We have also done some preliminary experiments to nd the area over which the camera can be moved while preserving the criterion that 2 < 10. The area depends on the depth of objects in the scene, but it appears that, when objects are two meIV. Experiments ters or more away, the camera can be moved over an area In our experiments, we created three panoramas from of about one square meter without causing a mismatch. three dierent locations on the ground oor of the Watson V. Conclusions building of the Yale Computer Science Department. Each panorama was constructed by taking 9 shots as the camera A promising approach to place recognition is to employ rotated, then gluing them together. In addition, we took a stochastic model of image perturbations in order to de14 more images from points near the original camera po- cide if one image of a place could be produced by moving sition of one of the panoramas. Figure 4 shows one of the the camera within an area near where a previous image panoramas. Figure 7 shows one of the images, from the was taken. The model is based on the assumption that set of 14, that matched this panorama. the only motions of the camera are rotation around the y Each image was matched twice, once to a panorama axis, plus translation in the x-z plane. The resulting set of taken near where it were taken, once to a dierent image perturbations includes large horizontal shifts, scale panorama. Running on a Sparc 2 the program, written changes, and unpredictable occlusions. We factor all of these into a model of where pixels in the in C, took 3 seconds to produce the mean and variance images for the image, and 3 seconds to match it to each perturbed image can come from in the unperturbed image. panorama. We computed the 2 statistic for each compar- For each pixel, there is an area in the unperturbed image ison. Figure 5 shows the distribution for the case where it might have been drawn from, and we can use a simple the two images were taken from nearby locations. Figure 6 Gaussian distribution to estimate the probability that it shows the distribution for the other case. Obviously, the actually did come from there. We take the unperturbed values are much closer to zero for the rst case. If we clas- image to be the one just taken, and the perturbed image to sify all comparisons with a 2 of as \matches" and all those be a slice of a panorama from a previously visited location.

6

Fig. 4. An example panorama

Taking panorama slices solves the problem of sensitivity to y-axis rotation. Using the new image as the unperturbed image allows us to compute the mean and variance of the unperturbed image just once before matching it to several candidates. Preliminary experiments show that the method does a good job of distinguishing previously visited locations from new ones, at least in indoor environments. Further work includes doing detailed comparison with other methods, investigation of how far the camera can be moved before a match breaks down, and how the method can be adapted to outdoor conditions and other conditions where lighting is more variable. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

References Andrew Blake and Andrew Zisserman. Visual Reconstruction. MIT Press, 1987. Joachim Buhmann, Wolfram Burgard, Armin B. Cremers, Dieter Fox, Thomas Hofmann, Frank E. Schneider, Jiannis Strikos, and Sebastian Thrun. The mobile robot Rhino. AI Magazine , 16(2):31{38, 1995. Alberto Elfes. Using occupancygrids for mobile robot perception and navigation. IEEE Computer, pages 46{58, 1989. special issue on Autonomous Intelligent Machines, June. Sean Engelson. Passive Map Learning and Visual Place Recognition. Technical Report 1032, Yale Computer Science Department, 1994. Sean P. Engelson and Drew V. McDermott. Image signatures for place recognition and map construction. In Proceedings of SPIE Symposium on Intelligent Robotic Systems, Sensor Fusion IV, 1991. Sean P. Engelson and Drew V. McDermott. Error correction in mobile robot map learning. In Proc. IEEE Conf. on Robotics and Automation, pages 2555{2560, 1992. Jiawei Hong, Xianan Tan, Brian Pinette, Richard Weiss, and Edward M. Riseman. Image-based homing. In Proc. IEEE Conf. on Robotics and Automation, pages 620{625, 1991. Ian Horswill. Specialization of Perceptual Processes. PhD thesis, 1993. David Kortenkamp, L. Douglas Baker, and Terry Weymouth. Using gateways to build a route map. In Proc. IEEE/RSJ Int'l. Workshop on Intelligent Robots and Systems, 1992. Benjamin Kuipers and Yung tai Byun. A robust, qualitative method for robot spatial reasoning. In Proc. AAAI 7, pages 774{779, 1988. Benjamin Kuipers and Yung tai Byun. A robot exploration and mapping strategy based on a semantic hierarchy of spatial representations. Robotis and Autonomous Systems , 8:47{63, 1991. Clayton Kunz, Thomas Willeke, and Illah R. Nourbakhsh. Automatic mapping of dynamic oce environments. In Proc. IEEE Conf. on Robotics and Automation, pages 1681{1687, 1997. Maja J. Mataric. Environment learning using a distributed representation. In Proc. IEEE Int'l. Conf. on Robotics and Automation, pages 402{406, 1990.

[14] Hans P. Moravec and Alberto Elfes. High-resolution maps from wide-angle sonar. In Proc. IEEE Int'l Conf. on Robotics and Automation, 1985. [15] Don Murray and Cullen Jennings. Stereo vision based mapping and navigation for mobile robots. In Proc. IEE Int'l Conf. on Robotics and Automation, pages 1694{1699, 1997. [16] Randall Nelson. Visual Navigation. PhD thesis, 1989. [17] Illah Nourbakhsh, Rob Powers, and Stan Birch eld. Dervish, An Oce-navigating Robot. AI Magazine , 16(2):53{60, 1995. [18] Reid Simmons and Sven Koenig. Probabilistic navigation in partially observable environments. In Proc. Ijcai, 1995. [19] Camillo J. Taylor and David J. Kriegman. Explorationstrategies for mobile robots. In Proc. IEEE Int'l. Conf. on Robots and Automation, 1993. [20] Jian YuZheng, Matthew Barth, and Saburo Tsuji. Autonomous landmark selection for route recognition by a mobile robot. In Proc. IEEE Conf. on Robotics and Automation, pages 2004{ 2009, 1991.

7

Fig. 7. Image matching panorama

Visual Place Recognition for Autonomous Robots - Semantic Scholar

Visual Place Recognition for Autonomous Robots - Semantic Scholar

Suggest Documents

Semi-Autonomous Robots for RoboCupRescue - Semantic Scholar

Evolving Visual Object Recognition for Legged Robots

HOW TO EVOLVE AUTONOMOUS ROBOTS - Semantic Scholar

Using n-grams models for visual semantic place recognition

Visual Speech Recognition - Semantic Scholar

Autonomous Robots

Autonomous Probing Robots for the detection of ... - Semantic Scholar

Cooperative Autonomous Low-cost Robots for ... - Semantic Scholar

Visual Programming of Robots - Semantic Scholar

Transfer Report: Object Recognition for Wearable Visual Robots

Action Recognition using Visual Attention - Semantic Scholar

Visual symmetry recognition by pigeons - Semantic Scholar

Audio-Visual Spontaneous Emotion Recognition - Semantic Scholar

VISUAL SPEECH RECOGNITION USING ... - Semantic Scholar

A Realistic Benchmark for Visual Indoor Place ... - Semantic Scholar

Autonomous Stair-Hopping with Scout Robots - Semantic Scholar

Are Autonomous Mobile Robots Able to Take Over ... - Semantic Scholar

Robots, Dennett and the autonomous: a ... - Semantic Scholar

Building Location Models for Visual Place Recognition - Hal

Moving Furniture with Teams of Autonomous Robots - Semantic Scholar

Deep Learning Features at Scale for Visual Place Recognition - arXiv

Energetically Autonomous Robots - CiteSeerX

Totally Autonomous Soccer Robots

Unsupervised Place Discovery for Visual Place Classification