An appearance-based visual compass for mobile robots J. Sturm, A. Visser ∗ Intelligent Systems Lab Amsterdam, Informatics Institute, Universiteit van Amsterdam, The Netherlands
Abstract Localization is one of the basic skills of a mobile robot. Until now, most approaches, however, still rely either on special sensors or artificial environments. In this article, a novel approach is presented that can provide compass information purely based on the visual appearance of a room. A robot using such a visual compass can quickly learn an appearance map – independent of the existence of artificial landmarks. The visual compass algorithm is efficient, scaleable and can therefore work in realtime on almost any contemporary robotic platform. Extensive experiments on a Sony Aibo robot have then validated that the approach works in a vast variety of environments. Key words: appearance-based, mobile robot localization, active vision, machine learning PACS: 42.30.Tz, 89.20.Ff, 45.40.Ln 2000 MSC: 68T40, 68T45
∗ Corresponding address: Arnoud Visser, Intelligent Systems Lab Amsterdam, Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands http://www.science.uva. nl/research/isla Email address:
[email protected] (A. Visser).
Preprint submitted to Robotics and Autonomous Systems
18 December 2006
1
Introduction
A central goal of robotics and AI is to be able to deploy a mobile robot that acts autonomously in the real world [1,2]. Much progress has been made in this field over the past years. From the spectrum of possible sensors, laser range scanners are often used for robot localization. However, these sensor modalities have a number of problems. Laser range scanners are relatively heavy and their measurements lack – especially in natural environments (like corridors or similar offices) – sufficient features of distiguishable quality. Therefore, vision has long been advertised as providing a solution to these problems. From a conceptual perspective, it is desireable that a robot could perceive the world in the same way that humans do. The real world is full of visual indications that have especially been installed to facilitate (human) localization, such as room personalization (with photos and posters), public signage, arrows or other salient landmarks (like high buildings etc). Classical solutions [3] rely on fixed, artificial landmarks. Static algorithms then allow efficiently to detect and extract landmarks from the observations, that can subsequently be used by a Kalman filter for robot localization. Oppositly in natural environments, no assumptions of specific landmarks can be made, and therefore it is commonly asserted that robots must be able to learn a map autonomously from the surroundings. Contemporary publications indicate that promising solutions exist, ranging from the visual SLAM domain (i.e. SIFT features [4,5]), visual homing techniques [6,7] to purely appearancebased approaches [8,9]. However, current implementations only work at low frame rates or make simplifying assumptions about the environment. Outline In this article, a novel approach to robot localization will be described. First, existing solutions will be examined (section 2). Our novel approach is subsequently described in section 3: color frequency patterns (extracted from camera images) are used for finding the robot’s orientation, given a corresponding map. The robot can also use these patterns to learn autonomously a map of its environment. The necessary operations are computationally extremely efficient, and have therefore been implemented on a Sony Aibo robot (section 4). It is shown that orientation and location estimates with respectively an accuracies better than 10◦ and a standard deviation of 33 centimeter. Finally, the results will be discussed, the approach’s characteristics will be summarized and the results will be discussed in comparison to the work of others (section 6).
2
2
Related work
Vision confronts a mobile robot with an enormous amount of data. Therefore, the robot needs to transform the images into a much smaller feature space. In what follows, the most important classes of feature spaces will be presented that are used for vision-based localization: Geometric approaches on the one hand extract geometrical information about landmarks from the camera images. Classical approaches apply this landmark information to a localization filter, that iteratively estimates the robot’s pose. A distinguished subclass form (biologically inspired) approaches that circumnavigate explicit localization, by mapping the visual input from the camera directly to motor commands, and thus moving the robot iteratively into its target position (visual homing). Appearance-based approaches on the other hand try to circumvent complex image preprocessing, by analyzing the images directly with statistical means. This usually results in low computational complexity, and therefore such approach have to be classified as low-level vision.
2.1
Geometric approaches
Geometric approaches assume the presence of landmarks in the environment. These landmarks can either statically be known to the robot (as the case with RoboCup soccer robots, or in sucessfull industrial applications), or be detected in a generical manner from the environment. In either case, the robot extracts geometrical landmark information from the camera images, and applies localization filtering to find and track the robot’s pose. The approaches relying on static landmarks render by definition useless in unstructured and natural environments, where no assumptions about specific landmarks can be made. Generic approaches (using SIFT features [4]) have been applied successfully, allowing the robot to dynamically create a map of its environment as it moves (simultaneous localization and mapping, SLAM). Such approaches however require a by magnitudes higher computational power, resulting in a significantly reduced frame rate in implementations on contemporary mobile robots [5]. Subsequent localization estimation can be done using standard filtering techniques, like Kalman filters or particle filters. Kalman filters are extremely efficient in computation, but particle filters allow the explicit representation of ambiguity in the robot’s beliefs and are therefore the typical choice in sym3
metrical or ambiguous environments [3].
2.1.1
Visual homing
Other approaches support the landmark paradigm – but reduce its complexity significantly. The source of inspiration is that small animals (such as insects) navigate through natural environments seemingly with little effort. Honey bees and desert ants – despite their relatively simple nervous system (and implicitly small data storage capacity) – are able to find their way back to their hive or anthill. The different here is that those animals are assumed to lack a absolute pose estimate, rather they seem to be able to find back certain spots without explicit localization. Such an approach is described in [10,11], where a robot correlates the current image with a stored snapshot and derives directly a motion vector that is assumed to bring the robot closer to its target pose.
2.2
Appearance-based approaches
The lack of absolute pose estimates is certainly a drawback, but such approaches prove that relatively simple image conversion is in principle sufficient for sucessful robot navigation. The class of appearance-based approaches emphasizes this aspect, by analyzing the image (and subsequently the feature space) only with direct statistical means. Such approaches aim at providing full (or partial) robot localization at extremely low computational costs. Such an approach is described in [6,7], where an robot directly compares the current and previous camera image to estimate the robot’s (relative) rotation, which the robot can subsequently use as a visual compass. Another interesting example is given in [8,9]. A robot with an omni-directional camera creates along a known path through a shopping environment thousands of pictures by storing the (direction-dependent) average color values. Subsequently this map was sufficient to localize the robot within a few camera frames using a particle filter. This proves that robot localization is possible with simple means, at low computational costs and without the need for artificial or natural landmarks.
4
3
Approach
The basic idea behind the visual compass approach is intuitive: as the robot should be able to adapt to any given environment, a direct approach would be to store a 360o panoramic image of its surroundings and use it subsequently as a map. Then the robot could match every new incoming image with this map to estimate its orientation. Typical cameras produce – despite their limited field-of-view – enormous amounts of data that needs to be reduced significantly before it can be used for localization. Therefore, a feature space is constructed that represents the captured images by their frequency pattern of color class transitions. Thereby, the effects originating from projective distortions disappear thanks to the probabilistic nature of the feature space and can subsequently be considered regular noise. The approach is introduced incrementally: it starts by assuming a singledirection, single-transition sensor (section 3.1) and is subsequently extended to be applicable to camera images of a mobile robot (section 3.2, 3.4). Additionally, it is shown how this approach can be used to estimate (learn) a map m ˆ in an unknown environment (section 3.3). Finally, the localization filter is constructed (section 3.5), that a robot can use to estimate its pose xt from incoming camera images ft (given a map m).
3.1
Single transition
The approach is based on the idea that the colors in a natural environment are not equally distributed. More specifically, the approach exploits the assumption that the occurrence pattern of color class transitions strongly depends on the orientation. For the moment, it is assumed that a robot has access to a probabilistic sensor that can directly measure the relative frequency of a transition between two (preliminary fixed) color classes i and j given an orientation φ. A measurement, taken at time t and under orientation φ will be denoted by ztij (φ). These assumptions will be relaxed in the following subsections so that finally a regular color camera can be used. The relative occurrence frequency of a transition between the two color classes i and j given a direction φ is then modelled by the random variable Z ij (φ) (see fig. 3.1). Each time the robot takes a measurement, it reads a slightly different frequency measurement, indicated by the series (z1ij , z2ij , . . .). For localization purposes, the robot needs to estimate the likelihood that these readings were 5
produced by the underlying distribution Z ij (φ).
Fig. 1. A robot is standing at pose xt =< φt , τt >. It measures the frequency patterns zt (φ) in different directions φ in the current image ft from its camera.
In this approach, it has been chosen to approximate Z ij (φ) by a histogram distribution, defined by the absolute frequencies of n logarithmically scaled ij ij bins mij (1) (φ), m(2) (φ), . . . , m(n) (φ). h
i
ij ij Z ij (φ) ∼ H mij (1) (φ), m(2) (φ), . . . , m(n) (φ)
(1)
This histogram distribution is then defined by the following probability distribution:
p z ij (φ) mij (φ) =
1 n P k=1
mij (k)
−1 mij < z ij (φ) ≤ 20 (1) (φ) if 2 −2 ij −1 ij
m(1) (φ) if 2 .. .
.. . mij (φ) (1)
if
< z (φ) ≤ 2
(2)
z ij (φ) ≤ 2−(n−1)
ij ij The parameters of this distribution mij (φ) = mij (1) (φ), m(2) (φ), . . . , m(n) (φ) can be seen to define the map mij (φ) – concerning the transition frequency sensor Z ij (φ).
If the robot now gets the measurements ztij from its frequency sensor of unknown direction, it can deduce the conditional probability (or posterior, or likelihood) that this observation was taken under a certain heading φ – given a map mij – by evaluating 6
p z ij φ, mij (φ)
(3)
using eq. (2). For a single direction, this evaluationcomplexity of this expression is linear in the number of histogram bins n (p z ij mij (φ) ∈ O (n)).
3.2
Transition patterns
The approach so far only relates to the frequencies z1ij (φ), z2ij (φ), . . . of one specific color class transition ij, given the map mij . This is now generalized by introducing a compound sensor Z(φ) that measures the relative frequencies of all color class transitions simultaneously:
11 1n Z (φ) · · · Z (φ) . .. .. Z(φ) = .. . .
Z n1 (φ) . . . Z nn (φ)
(4)
If the robot now z of its sensor, it can compute the a measurement reads posterior of p Z = z m(φ) – expressing the likelihood that the robot was facing in direction φ while measuring z – assuming that all transition frequency sensors Z ij (φ) are mutually independent: n Y p z ij φ, mij (φ) p z φ, m(φ) =
(5)
i,j
The evaluation complexity of expression grows quadratically with the this 2 number of color classes m (p z m(φ) ∈ O (c · n)).
3.3
Map learning
In an unknown environment, the map m is of course not available, and has ij therefore to be learned by the robot from the measurement series z1 (φ), z2ij (φ), . . . . The robot can learn the map by estimating its constituting parameters mij (φ). The estimated map and the estimated parameters will be denoted by m ˆ and i m ˆ j(φ) respectively. 7
Fortunately, histogram distributions are particularly easy to estimate. Their parameters m ˆ ij ˆ ij ˆ ij (1) (φ), m (2) (φ), . . . , m (n) (φ) are direct estimators, as they are actually representing the absolute frequencies of the corresponding bin. Each time, a measurement z ij (φ) falls into a certain bin k, its corresponding counter m ˆ ij (k) of the map can be increased by 1 if 2−k ≤ z ij (φ) < 2−(k−1) , then m ˆ ij ˆ ij (k) ← m (k) + 1
(6)
to update the part of the map describing this underlying distribution. The evaluation complexity of learning from a full feature vector z(φ) is then of the same complexity class as the evaluation of feature vector (m ˆ t (zt ) ∈ (c2 · n), see section 3.2).
3.4
Image feature extraction
Of course a robot typically is not able to directly measure the color class transition frequencies. However, such a sensor can be easily constructed if the robot is equipped with a camera. The images a robot perceives when pointing its camera in a direction φ can be denoted by ft (φ). Without loss of generality, it is assumed that the images can be continuously sampled, both in horizontal and vertical direction within the (physical) opening angle F OV of the camera: any pixel < α, β >∈ [−F OV /2, +F OV /2] × [0, +F OV /2] above the horizon can then be addressed by ft (φ) < α, β >. In the case of a color camera, the evaluation of a point in the picture will yield a color of a three-dimensional color space [0, 1]3 . In order to measure the transition frequencies between two color classes, a function g is required that maps a color c = ft (φ) < α, β > on its color class [c]: g :∈ [0, 1]3 → (0, . . . , c) c 7→ g(c)
(7) (8)
A mobile robot can create such a mapping autonomously by clustering the color space, for example with an Expectation-Maximization algorithm and a Gaussian mixture model. An efficient and working approach used by the Dutch Aibo Team has been presented in [12,13]. 8
It was chosen to measure the transition frequencies in vertical direction (from bottom to top). A transition at < α, β > between i and j then is defined as a change (in vertical direction) of color class in the image at this point, or mathematically equivalently as transij (< α, β >) = 1 ⇔ lim g (ft (φ) < α, β − >) = i ∧ g (ft (φ) < α, β + >) = j →0
(9)
With this proposition available, simply the absolute frequencies of a particular color class transition on the complete image (above the horizon) can be counted by
zt0ij (φ)
=
+FZOV /2
+FZOV /2
transij (< α, β >)dβdα.
(10)
0
−F OV /2
These absolute frequencies can be divided by the sum in order to obhtain relative frequencies:
ztij (φ) = P m i0 ,j 0
1 0i0 j 0
zt
zt0ij (φ)
(11)
(φ)
Now the robot can measure zt (φ) by evaluating an image ft (φ) from its camera by evaluating eq. (11). The conversion eq. (11) from a camera image into the transition frequencies is rather coarse: the whole image ft (φ) is sampled into a single frequency measurement zt (φ). A camera image, however, holds much more information, therefore it is reasonable to measure zt (φ) with a finer resolution. From now on, it is assumed that the camera points at an angle ψ horizontally of which it takes images ft (ψ), while the frequency measurement zt (ψ + φ) takes place within the picture, in a slightly different direction ψ + φ. Because a regular camera has a limited opening angle, a captured image will limit the relative measurement direction φ − ψ to a certain subinterval around the absolute camera direction ψ: φ − ψ ∈ [−F OV /2, +F OV /2]
(12) 9
with F OV being the opening angle of the camera’s field-of-view. Then, the transitions can be convolved by any underlying distribution gφ,∆φsensor < α, β >) (centered in the desired sensor direction φ and with a particular width (or resolution) ∆φsensor ):
zt00ij (φ)
=
+FZOV /2 −F OV /2
+FZOV /2
transij (< α, β >) · gφ,∆φsensor < α, β > dβdα (13)
0
expressing the absolute frequency of a color class transition i and j in direction φ with a certain angular resolution ∆φsensor . Subsequently eq. (10) can be applied to obhtain the relative frequencies zt (φ). When the feature vectors zt00ij (φ) are sampled from camera images at an an gular resolution ∆φsensor , it is sufficient to evaluate p z φ, m(φ) from eq. 5 (and therefore implicitely the map mij t (φ)) only at the same resolution: The sample directions can then be reduced to the set φ = Φ ∈ [ψ − F OV /2, ψ + F OV /2] ∩ ∆φsensor · N
(14)
This implies that n now – per camera image ft (ψ) – several transitionofrequency measurements zt (φ) φ ∈ [ψ − F OV /2, ψ + F OV /2] ∩ ∆φsensor · N in different directions ψ can be sampled. Consequently, the robot can evaluate all of them simultaneously after it has taken an image ft that was captured in an unknown direction. Because the measurements lie substantially far away from each other (namely have a distance of ∆φsensor ), these measurements can be assumed to be independent, and therefore can be denoted in the following equation: n o p ft (ψ) ψ, m = p zt (φ) φ ∈ Φ φ, m(φ)
=
n Y Y
(15)
p Z ij (φ) = z ij (φ) mij (φ)
(16)
φ∈Φ i,j
Given a limited-field-of-view camera, the robot would only perceive a small fraction (2π/F OV ) of its surroundings. Active vision can be used to increase this perceptive field: if the robot disposes over a turnable camera, it can scan its surroundings to remove actively ambiguities. Additionally, mobile robots can turn their whole body to see other parts of the scene, allowing them to learn a full circular map m. The evaluation complexity of this evaluation of camera images given a map m 10
then depends on the angular resolution ∆φsensor reciprocally: (p ft (ψ) ψ, m ∈
O c2 · n · 3.5
1 ∆φsensor
).
Localization filter
Robot localization [14] aims at improving the estimation of the pose at time t, i.e. xt , taking into account the movements of the robot (u1 , . . . , ut ) and the observations of the environment taken by the robot (z1 , . . . , zt ). The correu1 u2 sponding Markov process goes then through the sequence x0 −→ (x1 , z1 ) −→ ut ··· − → (xt , zt ). The Markov assumption states that the robot’s pose can be updated from the previous state probability p(xt−1 ), the last executed action ut , and the current observation zt . Applying Bayes, p(xt |ut , zt ) gives p(xt |ut , zt ) ∝ p(zt |xt )p(xt |ut ),
(17)
where the probability p(xt |ut ) can be computed by propagating from p(xt−1 |ut−1 , zt−1 ) using the action model p(xt |ut ) =
Z
p(xt |ut , xt−1 )p(xt−1 |ut−1 , zt−1 )dxt−1 ,
(18)
In Eq. (18), p(xt |ut , xt−1 ) is called the motion model, and p(xt−1 |ut−1 , zt−1 ) the sensor model of the Bayes filter. Eqs. (17) and (18) then define a recursive system for estimating the position of the robot. Both the continous representation and the integral in eq. (18) are difficult in implementations and are therefore typically approximated (for example using a probability grid, a Kalman filter or Monte-Carlo methods). In the following two subsections, the corresponding motion and sensor models will be constructed. Subsequently, the first part of the localization filter is presented that estimates the robot’s heading. To additionally estimate the robot’s translation, a single compass sensor is not sufficient. The idea is to use multiple compass sensors: if each sensor is trained at a different spot, the robot can interpolate its (translational) pose by mutually comparing the reported likelihoods.
3.5.1
Motion model
The motion of the robot is simply assumed to be sufficiently be described by the differential data from odometry ∆xt . The motion model then can be 11
written as: xt = xt−1 + ∆xt
(19)
or equivalently, by splitting the pose x =< φ, τ > in its rotational φ and translational τ components < φt , τt > = < φt−1 , τt−1 > + < uφt , uτt > = < φt−1 + 3.5.2
uφt , τt−1
+
uτt
(20)
>
(21)
Sensor model
The posterior probability distribution that a compass sensor (given a map m) estimates when supplied with an image ft has been stated in eq. (22). This distribution can directly be used as a sensor model:
p ft xt , m = p ft < φ, τ >, m = p ft φ, m
(22)
using eq. (15). Then the sensor model can express possible ambiguities in a natural way – thanks to the underlying posterior distribution. The posterior stated in eq. (22) forms the sensor model of the visual compass, by relating the captured images ft to the world state φ, given a map m. It can subsequently be used by a localization filter to integrate the likelihoods over time, merge them with data from odometry and eventually select the orientation with the highest likelihood that can be provided to higher-level layers. 3.6
Orientational belief filter
The filter then has to track the orientational beliefs p(φt uφt , ft ) of the robot. The belief distribution is represented by a one-dimensional circular grid with an angular resolution of ∆φf ilter .
At each time step, the beliefs p(φt uφt , ft , m) are updated according to the motion and sensor model, using the update rules from Bayesian filtering (see eq. (18)) and a mixing constant λ:
p− (φt uφt , ft , m) = p(φt − uφt ft , m)
(23)
p(φt uφt , ft , m) = λp ft x, m p− (φt uφt , ft , m) + (1 − λ)p− (φt uφt , ft , m) (24)
12
This belief distribution may in principle have an arbitrary shape. In order to extract an expedient single rotational estimate that the robot can use (for example for planning and navigation), the belief distribution can be approximated by a Gaussian distribution with its mean in the estimated heading φˆt and a variance σ ˆt :
p(φt uφt , ft , m) =
ˆ2 (φt − φ) 1 √ exp − 2ˆ σ2 σ ˆ 2π
!
(25)
In fig. 3.6, this process is sketched graphically. The evaluation complexity of both belief updates is reciprocally linear in the φ 1 angular resolution ∆φf ilter , so p(φt ut , ft , m) ∈ O ∆φf ilter . The robots can now estimate its current heading by estimating the Gaussian parameters:
φˆt = mean p(φt uφt , ft , m)
(26)
σ ˆt = var p(φt uφt , ft , m)
(27)
φt
φt
Fig. 2. Visualization of the Bayesian filtering process. In the first step, the belief distribution Belt−1 (x) is updated according to the motion model (odometry correction). In the second step, the sensor model is applied and a new belief distribution Belt (x) becomes available.
3.6.1
Four-spot sensor model
With the sensor model as constructed in eq. 22 it is possible to estimate the robot’s orientation (eq. 27) by learning the map m ˆ at only a single training spot. In order to estimate additionally the translational component of the robot’s pose, multiple training spots are required: then the (global) map consists of q patches m ˆ (1) , . . . , m ˆ (q) learned at the respective training spots T (1) , . . . , T (q) . Then for each patch, the maximal likelihood in the posterior probability distribution can be regarded as an indicator of the goodness of the match of 13
the current patch with the current image (which is inversely related to the distance). A single goodness-of-fit indicator can thus be extracted by extract(i) ing maximum likelihood Lt from corresponding posterior distribution the ˆ (i) : p ft φ, m
(i)
ˆ (i) (φ) Lt = max p ft φ, m
(28)
φ
From the combined sensor readings zt0 (consisting of the indicators of all q compass sensors)
(1)
(2)
(q)
zt0 =< Lt , Lt , . . . , Lt >
(29)
the translation can be estimated from such a reading directly by computing the weighted sum of the training spots T (1) , . . . , T (q) according to the correˆ (1) , · · · , L ˆ (m) : sponding likelihoods L
τˆ(zt0 |m ˆ (1) , . . . , m ˆ (q) )
q 1 X ˆiT i = Pq i L t ˆ i Lt i
(30)
The evaluation complexity of the such a sensor is in addition linear in the 1 (1) (q) 2 0 ˆ ,...,m ˆ ) ∈ O q · c · n · ∆φf ilter ). number of patches used τˆ(zt |m A Bayesian filter can subsequently directly use eq. (30) as a translational sensor model in the sensor belief update step.
3.6.2
Translational belief filter
The translational belief of the robot is expressed by a direct estimate τˆt . In each time step, the beliefs are updated first by the motion model as defined in eq. (19) and the sensor model constructed in eq. (30), using a mixing constant λ: τˆt− = τˆt−1 (τ − uτt ) τˆt = λˆ τ (zt0 ) + (1 − λ)ˆ τt− (τ ) n 1 X ˆ i T i + (1 − λ)ˆ = λ Pn i L τt− (τ ) t ˆ i Lt i 14
(31)
(32)
Its complexity class is simply linear in the number q of compass sensors used, so τˆt ∈ O (q). At each time step t, the translational filter outputs a stable translational estimate τˆt = Belt (τ ). Together, the orientational filter and localization filter yield per time step t a full estimate of the robots pose: xˆt =< φˆt , τˆt >
(33)
This estimate can subsequently be used by the robot for planning and navigation.
3.7
Summary
The main contribution of this chapter is the mathematical construction of the compass sensor that can provide a mobile robot with heading likelihoods for each supplied camera image.
The approach has been constructed incrementally: at first, a model p z ij φ, mij (φ) has been described that applies to a single transition frequency sensor Z ij , by using a map mij (φ) of frequency histograms. It has then been extended for a sensor Z(φ) that simultaneously reports all transition frequencies Z(φ) between every two color classes i and j. Subsequently, it has been explained how these transition frequencies can be sampled from the robot’s camera images ft (φ) < x, y >: the frequencies can be measured at arbitrary angular resolution ∆φsensor . Finally, an algorithm has been devised that can be used to learn autonomously ˆ in unknown environments. The new maps m derived sensor model p ft xt , m can be used for a robot to determine the likelihood that a taken picture ft originates from a certain heading φ. Two localization filters have been introduced that a robot can use to estimate its orientation φˆ and translation τˆ, respectively (forming together the robot’s pose x =< φ, τ >). As a pre-requisite, a motion model has been defined that the filters can use to update their beliefs according to the data from odometry. The orientational filter maintains the robots beliefs in form of a circular belief buffer Belt (x) at a resolution ∆φf ilter . It uses the motion and sensor models to update its beliefs – by applying the update rules of Bayesian belief filters. A mixing constant λ has been introduced that regulates how much of the robot’s beliefs are replaced by new sensor readings. The translational filter requires four sensors, and uses the max (i) imal likelihood Lt of the individual sensor posteriors p ft φ, m ˆ (i) (φ) to interpolate the robots translational pose τˆ.
15
4
Implementation on a Sony Aibo
The Sony Aibo is an entertainment robot used by many institutes as a robust research platform: it has a turnable head with a camera, four 3DOF legs, and an internal 64bit MIPSEL processor running at 567M Hz. The continuous approach depicted in the previous section was discretized in order to become applicable in an implementation (see fig. 4). A few parameters have been chosen as a starting point for exploration.
Fig. 3. The robot divides virtually in sectors of ∆φsensor = 4.5◦ . Left: Camera image with sector boundaries superimposed. The dominant color class transitions were displayed below each sector. Right: Visualization of a map learned at the UvA Robolab. Per sector, only the most frequent color class transition is displayed.
4.1
Sensor model
An angular resolution of ∆φsensor = 4.5◦ was chosen for the map and the sensor model. Lower resolutions are not realistic, because the camera is located in the head of the robot. The head can be steared along three axis, but there is some variance in the control of the corresponding motors [15]. The number of m = 10 color classes was chosen, equivalent with the number of (artificial) color classes on a RoboCup soccer field. Parameter studies revealed that this number could be reduced [15]. Further, a scanning grid evaluates by default 25% of the pixels of each incoming camera image. A further reduction is possible [15].
4.2
Localization filters
The circular belief grid of the orientational filter operates at an angular resolution of ∆φf ilter = 1◦ , and the mixing constant λ was chosen that half of the beliefs are replaced after 0.3s. The translational filter uses a pattern of 4 training spots to estimate the robot’s position. Due to the limited processing of the Aibo, it was decided to power update only a single sensor model p ft φ, m ˆ (j) per frame, and leaving the 16
others as they were. The filter updates its beliefs with the same mixing rate λ as the orientational filter.
4.3
Behaviors
Several behaviors have been implemented that allow the robot to calibrate and learn natural environments autonomously. The learning behavior allows the robot to learn a full circle in 8 steps of 45◦ . Per default, the learning time in each direction is 5s. Additionally, two localization behaviors were created that can be used to measure and demonstrate the capabilities of the visual compass approach. A simple button interface was implemented that allows the user to activate the most important functions.
17
5
Results
The localization filter as constructed in section 3 consists of two parts: an orientational and a translational filter. Both filters have been evaluated seperately in sections 5.1 and 5.2. Further experiments have been conducted to evaluate the robustness, validate its applicability in several natural environments, measure the actual computation times and evaluate the initial choice of parameters as proposed in section 4. The quantative experiments presented here have been conducted in the Robolab of the Universiteit van Amsterdam; a small room of about 7x7 meters: On one side of the room, there are tables, chairs and computers. Posters on the walls create a colorful panorama. The experiments have taken place on a soccer field of 6x4 meters. The artifial landmarks around the field are small and have a minor influence on the results. They can be removed without problems [15]. The applicability of this approach to other environments than our Robolab have been presented elsewhere [16].
5.1
Orientational filter
The robot learned a map at the center of the rear field half, and then the robot was moved manually across the field. Its rotational estimates were recorded. The robot was moved across an exhaustive grid of (3x3 meters) at a resolution of approximately 33cm. The results are displayed in fig. 5(left). All heading estimates point towards a single point on the opposite wall. Moreover, the variances increase the further the robot gets away from the training spot. The same results are displayed in fig. 5(right) in numerical representation: the average of the true error in rotational estimate indeed only increases slightly, and exceeds the 10◦ mark only after the robot is more than 2m away from its training spot. The average variance crosses approximately at this point the 50◦ mark.
5.2
Translational filter
A visual homing experiment was carried out on the soccer field of our Robolab. The kickoff circle was chosen as the target spot, and the robot learned the panorama of four adjacent spots in the distance of 1 meter from the center spot autonomously. Although the robot walked to the training spots using 18
150
100
x [cm]
50
0
−50
−100
−150
−150
−100
−50
0 y [cm]
50
100
150
Fig. 4. Exhaustive compass experiment conducted at the UvA Robolab. Left: The heading estimates point towards the opponent goal. Close to the training spot, the variance is relatively small, but increases steadily the further the robot is moved away. Right: Quantitative results of orientation filter experiments. True error and variance plotted versus the distance from the training spot.
the latest map as in its orientational filter to support odometry, the robot sometimes got a bit astray and then had to be aligned manually, before the map learning on the next spot started. Finally, the robot walked back to the target spot in order to perform a normalization of the individual likelihoods (i) (so that unity likelihoods Lt = 1 were reported from all compass sensors at the target spot). After learning was complete, the robot was kidnapped and placed somewhere randomly on the field (with random orientation). From there, the robot had to walk back to the center using both the orientational and translational filter, and the position τt where the robot stopped (for longer than 5 seconds) was marked with a sticker on the carpet. In total, 33 kidnappings have been executed and the reached positions have been measured (see fig. 6(left)). In most cases the robot stopped close to the target spot. However, it can also be seen that the robot walked (in average) to a spot slightly behind and right to the center spot. Finally, because not visible in the graph, it has to be stated that the robot never left the field or stopped on a point outside the displayed area. The measured positions τt of this experiment can also be expressed numerically, by computing their statistical mean and variance. The average position where the robot stopped can then be expressed as mean τt = (−22cm + 12cm)T with a variance of var τt = (17cm 15cm)T . A standard deviation in this magnitude is acceptable when compared to other techniques. The Dutch Aibo Team [17] for example reports a standard deviation of 13.5cm. It should be noted however, that in the current state of the translational filter, visual homing is only possible to a single (central) spot, while other localization approaches directly yield a positional estimate for each position of the field. 19
In order to be able to get more than a qualitative estimate on the current position even for other spots than the center, it would be necessary to gain more insight in the kind of shape of the likelihoods when the robot is moved away from the training spot (see fig. 6(right)). It remains a problem to interpret the likelihoods more precisely in order to deduce the robot’s absolute position. In the figure, a monotonic decrease in all directions exists. However, it is also visible that this decrease in the likelihoods does not seem to have a linear shape. Therefore, in order to achieve full translational localization it would be necessary to find ways that link the likelihoods more accurately to the robot’s distance. The systematic deviation to the bottom right can be interpreted in such a way that the likelihoods of the individual sensor models (or maps) do not decrease equally fast; the likelihood normalization on only a single spot does not seem to be sufficient. Considering these results, it can be concluded that the visual compass approach can in principle be used for translational localization. Several aspects remain for further research.
Fig. 5. Experiments of translational filtering. Left: Measured positions in a visual homing experiment. The four training spots are marked with a yellow cross. The blue dots correspond to the positions where the robot stopped. Right: Reported (i) likelihoods Lt ft m(i) when the robot is moved across a grid with a resolution of 50cm. The robot was trained at a spot T = (100cm 0cm)T .
5.3
Further experiments
Several more experiments have been conducted that validated other important properties [15]: • Its applicability in other natural environments has been proven by repeating 20
the experiment of section 5.1 at different places: an outdoor (human) soccer field, in a home environment and at an office. • By modifying the illuminiation, it could be shown that the algorithm still performs well with only half of the illumination as while learning. • Parameter studies revealed that three color classes are sufficient for accurate compassing. Moreover, the pixel coverage of the scanning grid could be reduced to less than 1% of the camera image without substantial impact on the variance estimate. • With runtime experiments, the approach could be proven to be particularly efficient: with the original set of parameter, a single camera frame can be processed in less than 4ms for learning and 13ms for localization.
21
6
Discussion and future work
In what follows, the most important features of our approach that have been validated by experiments will be summarized in short propositions: (1) The visual compass approach can supply a mobile robot with accurate heading information. (2) Visual compassing is applicable in a vast variety of different environments. (3) The reported heading information is robust to environmental changes. (4) Robots using visual compassing can learn quickly (and almost autonomously) to orientate in new environments. (5) Multiple learned spots can provide (absolute) positional information. (6) The visual compass algorithm is efficient and economic; it can be executed in real-time on-board a mobile robot, and even works with low-quality cameras. The characteristics of the visual compass can be compared to several existing approaches. A natural comparison is the original soccer localization module [17] of the Dutch Aibo Team itself. Of course it relies both on artificial landmarks as well as manual color calibration, and therefore can be categorized as a class of feature-based approaches (and thus is opposite to the visual compass approach). However, the resulting performance turns out to be comparable: the rotational accuracy of the Dutch Aibo Team soccer code was measured to be around 5◦ , while its positional accuracy was within 13.5cm. The relatively complex algorithm resulted in a reduced frame rate of (suboptimal) 17.2fps. All in all, and neglecting all the differences between both approaches, the resulting accuracies are of comparable magnitude. Other approaches in the literature use stable mobile platforms and panoramic cameras, and therefore allow the robot to take 360◦ -panorama pictures directly with a constant horizon. The visual compass as described in [6,7] yields comparable results (error below 10◦ ), but shows a long-term drift as it for each time only compares the two previous images from the camera. Interesting about this research is the comparison in accuracy with a traditional magnetic compass that was easily outperformed by their method: magnetic compasses rendered to be useless in their indoor experiments, and to be comparable with their results from outdoors. Therefore, the conclusion can be drawn that the approach presented in this article also outperforms a traditional magnetic compass in accuracy. Moreover, the prosed distance measure is computationally very expensive, and so the visual compass could only be evaluated at 2f ps. Another approach is described in [8,9]. Incoming images from a panoramic camera are split into sectors. Only the sector’s average color is subsequently 22
used for learning and matching 1 . However, instead of relying on a single (or small number) of training spot(s), they used a set of thousands of training images associated with the true pose, that were shot beforehand on long tours through the test environment (a small supermarket). The achieved accuracy was relatively low, namely never better than 31cm, but on average the localization error was around 50cm. Although their approach is probably lightweight enough to be used in real-time on a mobile robot, the evaluation was done offline (using the datasets). It can be concluded that the Visual Compass is a light-weight alternative to classical landmark-based localization, which can be used on almost any visually-guided mobile robot due to its good scalability and extreme efficiency.
1
Note that in our approach an average color can be derived from the matrix Z(φ)
23
7
Acknowledgement
The work described in this paper has been carried out at the Intelligent Systems Laboratorium at the Informatics Institute of the Universiteit van Amsterdam. It builds on software developed by the Dutch Aibo Team. At the RoboCup World Championship 2006, the Dutch Aibo Team [18] has won the 3rd prize of the Technical Challenges in the 4-Legged league, by demonstrating the implementation of the Visual Compass on a Sony Aibo as described in this paper.
24
References
[1] R. A. Brooks, How to build complete creatures rather than isolated cognitive simulators, Architectures for Intelligence (1991) 225–240. [2] A. K. Mackworth, On Seeing Robots, World Scientific Press, Singapore, 1993, Ch. 1, pp. 1–13. [3] S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics (Intelligent Robotics and Autonomous Agents), The MIT Press. [4] D. G. Lowe, Object recognition from local scale-invariant features, in: Proceedings of the International Conference on Computer Vision ICCV, Corfu, 1999, pp. 1150–1157. [5] S. Se, D. Lowe, J. Little, Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks (2002). [6] F. Labrosse, Visual compass, Proceedings of Towards Autonomous Robotic Systems. [7] F. Labrosse, The visual compass: performance and limitations of an appearancebased method, Journal of Field Robotics 23 (10) (2006) 913–941. [8] H.-M. Gross, A. Koenig, C. Schroeter, H.-J. Boehme, Vision-based monte carlo self-localization for a mobile service robot acting as shopping assistant in a home store, in: Proceedings of the International Conference on Intelligent Robots and Systems, 2002. (IROS 2002), Vol. 1, IEEE omnipress, Piscataway NJ, USA, 2003, pp. 256– 262. [9] H.-M. Gross, A. Koenig, C. Schroeter, H.-J. Boehme, Omnivision-based probabilistic self-localization for a mobile shopping assistant continued, in: Proceedings of the International Conference on Intelligent Robots and Systems, 2003. (IROS 2003), Vol. 2, IEEE omnipress, Las Vegas, USA, 2003, pp. 1505– 1511. [10] R. Moller, D. Lambrinos, T. Roggendorf, R. Pfeifer, R. Wehner, Insect strategies of visual homing in mobile robots, in: Proceedings of the Computer Vision and Mobile Robotics Workshop CVMR’98, Heraklion, Greece, 1998. [11] A. Vardy, F. Oppacher, Low-level visual homing, in: Proceedings of the 7th European Conference on Artificial Life (ECAL), Springer Verlag Berlin, 2003, pp. 875–884. [12] D. van Soest, M. de Greef, J. Sturm, A. Visser, Autonomous color learning in an artificial environment, in: Proc. 18th Dutch-Belgian Artificial Intelligence Conference, BNAIC’06, Namur, Belgium, 2006. [13] J. J. Verbeek, Mixture models for clustering and dimension reduction, Ph.D. thesis, Universiteit van Amsterdam (December 2004).
25
[14] J. M. Porta, B. J. A. Kr¨ose, Appearance-based concurrent map building and localization., Robotics and Autonomous Systems 54 (2) (2006) 159–164. [15] J. Sturm, An appearance-based visual compass for mobile robots, Master’s thesis, Universiteit van Amsterdam (December 2006). [16] A. Visser, J. Sturm, F. Groen, Robot companion localization at home and in the office, in: Proc. 18th Dutch-Belgian Artificial Intelligence Conference, BNAIC’06, Namur, Belgium, 2006. [17] J. Sturm, A. Visser, N. Wijngaards, Dutch Aibo Team: Technical report RoboCup 2005, Tech. rep., Dutch Aibo Team (October 2005). [18] A. Visser, J. Sturm, P. van Rossum, J. Westra, T. Bink, Dutch Aibo Team: Technical report RoboCup 2006, Tech. rep., Dutch Aibo Team (December 2006).
26