Initialisation for Visual Tracking in Urban Environments - CiteSeerX

Initialisation for Visual Tracking in Urban Environments Gerhard Reitmayr ∗

Tom W. Drummond †

Engineering Department Cambridge University Cambridge, UK

A BSTRACT

Index Terms: H.5.1 [Information Systems]: Multimedia Information Systems—Augmented Reality; I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Tracking, Sensor Fusion 1

I NTRODUCTION

Augmented reality in large-scale environments poses a challenging problem for localisation technologies. The simultaneous requirements of accuracy, real-time performance, robustness and coverage are not met by a single approach alone and require a solution that combines strengths of different approaches. Vision-based tracking delivers accuracy at interactive frame-rates [10], feature-based localisation achieves accuracy and coverage [12], while GPS provides robustness and coverage. Combining any of these approaches with additional sensors such as inertial measurement units improves robustness (e.g. see [17] for combination of vision-based tracking and inertial sensors or [15] for combination with GPS). Outdoor augmented reality setups have long made use of the simple combination of GPS and magnetic compasses for localisation [6, 9]. However, such systems are inherently limited by the accuracy of GPS. While GPS can deliver good performance in open areas, its accuracy typically degrades severely in urban environments. This is due to a number of contributing factors such as general signal attenuation through foliage and roofs, buildings blocking the line-of-sight to satellites resulting in urban canyons and multipath signals reflected from building surfaces [7]. More advanced GPS techniques including Differential GPS and RTK cannot counter these factors as they depend highly on the local environment. Vision-based tracking technologies usually meet the accuracy requirements of augmented reality applications. However, initialisation is usually the missing link between an accurate visual tracking system and a fully operational augmented reality system. Often, an ∗ e-mail:[email protected]

† e-mail:[email protected]

−12 −10 −8 −6 −4 Camera GPS Corrected GPS

m

Outdoor augmented reality systems often rely on GPS to cover large environments. Visual tracking approaches can provide more accurate location estimates but typically require a manual initialisation procedure. This paper describes the combination of both techniques to create an accurate localisation system that does not require any additional input for (re-)initialisation. The 2D GPS position together with average user height is used as an initial estimate for the visual tracking. The large gap in available GPS accuracy versus required accuracy for initialisation is overcome through a search procedure that tries to minimise search time by improving the likelihood of finding the correct estimate early. Re-initialisation of the visual tracking system after catastrophic failures is further improved by modelling the GPS error with a Gaussian process to provide a better estimate of the current location, thereby decreasing search time.

−2 0 2 4 6

−15

−10

−5 m

0

5

Figure 1: Sample camera track (black) with uncorrected GPS track (blue) and corrected GPS track (red), all projected into the ground plane as east and north map coordinates. The corrected estimate allows quicker re-initialisation.

initialisation step is required that assumes knowledge of the current position to within close bounds. This initial position could be provided by GPS and thus allow the combined system to cover any area. However, the fundamental difficulty in combining GPS with vision-based methods lies in the difference between measurement accuracies. GPS will only provide position measurements to within 5-10m in urban environments, while our vision-based tracking methods require sub-meter accuracy to initialise correctly. Several approaches to improve the accuracy of GPS estimates under the urban canyon conditions have been proposed. Using a precise external clock less than four satellites are required for 3D localisation [13], combination with inertial sensors can bridge gaps in the GPS reception [15] and map knowledge makes it possible to impose additional constraints and create a fix with less satellites [3]. The first approach depends on additional hardware support in the receiver and is not yet standard in simple COTS devices. Pedestrians move independently from coarse map features such as street centre lines making it impossible to use these as additional constraints. Pedestrian localisation using a hybrid GPS/INS system [11] or shoe mounted inertial sensors [5] has recently been investigated as well, but these systems still require multiple sensors worn on various locations on the body to accurately detect human motion.

This work discusses the extension of a hybrid tracking system with a GPS sensor to add initialisation and robustness against catastrophic failures of the original system. The GPS-based measurements provide initial 2D position information at start-up that allows the vision-based component to initialise correctly. To overcome the inaccuracy in the GPS position, a local search in the neighbourhood of the estimate is performed, in a manner similar to the approach by Coors et al. [2]. During operation, the same procedure can also carry out re-initialisation to recover from any catastrophic failures of the vision-based component. To improve the user experience, initialisation time should be as short as possible. Several methods to reduce the (re-)initialisation time are discussed and evaluated. While the vision-based tracking component is operating correctly, information about the GPS error is collected. If a re-initialisation is required, the system uses the aquired information to extrapolate the GPS error and produce a more correct estimate of the user’s position. The improved estimate allows the initialisation procedure to restart the vision-based component in less time than from a GPS cold start. 2 R OBUST M ODEL - BASED TRACKING The work described in this paper builds on a hybrid, model-based tracking system developed in earlier work [10] that provides a robust tracking solution for urban environments. The system combines an efficient edge-based tracking component with a recovery mechanism to overcome transient failures of the visual tracking. Here we briefly review some key aspects of the original system that are important for the following technical discussion regarding the radius of convergence of the visual tracking system and the GPSbased (re-)initialisation. 2.1 Edge-based tracking from a textured model The model-based tracking system follows the original edgetracking method described by Drummond & Cipolla [4]. Onedimensional measurements of the motion of sample points placed on model edges are measured in the live video frame. A non-linear optimisation of the camera pose parameters minimises the distances between all measured offsets and the re-projected sample points and calculates a new estimate for the camera pose given the current frame. Given a camera pose C = R, T as a rigid transformation from a world coordinate system in to the camera coordinate system, a point x = (x, y, z, 1) is projected into the view to the point (u, v) with the following observation function ! " u = proj(Cx). (1) v The rigid transformation is represented as a 4 × 4 matrix such that Cx = Rx + T . The function proj(·) models the projection from camera frame to image coordinates as     ! " r# x x z f 0 c u u  # y r z proj y = , where 0 fv cv z 1 (2) ' r = ( xz )2 + ( yz )2 and r#

=

1 + αr2 + β r4

are terms to compensate for radial lens distortion. The parameters fu , fv , cu , cv , α and β are determined in an off-line camera calibration step. A set of sample points xi placed at edges in the model image are projected into the video frame. Then a one-dimensional search for a maximum edge correspondence is performed along the projected normal ni of the local edge from the projected point location. The distance di of the correspondence from the projected point is a

single measurement to be optimised. The tracking step then computes a motion M of the camera such that the projected points under C+ = MC align with the observed correspondences. The motion M is parametrised by the six-vector µ corresponding to the exponential map parametrisation of the Lie group SE(3). Through differentiation of equation (1) with respect to µ and projection onto the normal of the edge, we obtain a Jacobian matrix J as ( ∂u ) ∂ di = ni ∂∂µvj Ji, j = (3) ∂µj ∂µ j

for the observations di . The motion parameters are then found by solving the equation Jµ = d (4) where d is the vector of measurements di . A robust M-estimator is used to solve for the parameter vector µ in the presence of outlier measurements. The edge sample points used in the measurement step are not determined a-priori in a CAD model of the environment but rather extracted during operation from a rendered view of the textured 3D model. Using the prior pose estimate C the system renders a view of the textured model. The resulting image is subjected to edgel extraction and the corresponding 3D points are obtained through reprojection of the 2D edgel location onto the model. Then a random, fixed size subset of the sample points is used to obtain edge displacement measurements. The rendering step provides automatic filtering of the appearance model based on the current estimate of camera pose. Features such as small edges or discontinues in the model with similar colours on either side do not contribute to the visual appearance and therefore are not used for measurements, reducing wrong data associations and incorrect pose estimates. 2.2 Sensor fusion with orientation sensor The model-based tracking system is integrated with a sensor pack measuring 3D rotation rates, 3D acceleration and 3D magnetic field strength to be more robust against fast motions and to have an absolute orientation reference through gravity and the magnetic field sensor. Sensor fusion is implemented with a standard extended Kalman filter using a constant velocity model for the camera pose dynamics. Different inputs such as a camera pose from the tracking system or measurements from the sensor pack are incorporated using individual measurement functions in a SCAAT-style approach [16]. Details of the filtering and sensor fusion can be found in [10]. 2.3 Tracking quality test and recovery The visual tracking system alone provides accurate and efficient estimation of camera pose in the urban environment. However, the system assumes a static world, frame-to-frame coherence and a complete and precise model to operate successfully. These assumptions are often violated in real environments through transient occluders such as cars, busses and passing pedestrians, quick and erratic motion or pointing it at featureless or unmodelled parts of the environment such as the ground, sky or a blank wall. To overcome these limitations, the system includes a statistical test for failure and a recovery method based on stored video frames. After a new pose has been estimated, a test compares the similarity of the expected features to the image intensities in the video image at the re-projected feature locations. The test compares the distribution of individual edge signature normalised cross-correlation scores with trained distributions for correct and failed tracking and decides which distribution is more likely to have generated the observed scores. It computes the log-likelihood ratio between the two hypotheses yielding a real number with the threshold value of 0 between the two outcomes. Once the system has determined that a frame could not be successfully tracked, it engages an additional recovery system that provides a new camera pose estimate.

As the test takes only differences in appearance into account, a video image can be compared to a rendered image from any pose, not only from the last posterior pose after optimising the image measurements. Therefore, it also allows the system to decide whether a given camera pose hypothesis is likely to be correct or not. Moreover, the test score allows to order posterior poses according to their likelihood of explaining the observed video image. These properties will be used in the search strategy for initialisation further described in section 4.

where λmax (A) denotes the largest eigenvalue of A, which corresponds here to the largest singular value of J. The visual tracking employs a local search with a maximal distance of r pixels. To ensure that the search will localise the correct correspondence and therefore the algorithm converge to the right minimum, we have to ensure that $δ p $ ≤ r. Through equation (5) we can optain the following series of inequalities to arrive at a bound for $δµ $: $δ p $ ≤ $J$$δµ $ $δµ $

≤ ≤

r r $J$

⇔

(6) (7)

Thus given a point on our model we can provide a neighbourhood of the correct camera pose which will guarantee convergence given that the data correspondences found by the search are correct. This neighbourhood is only an underestimate of the real convergence area. Fig. 2 shows a numerical simulation of the convergence area. The camera is 19m in front of the model, the search radius within the image corresponds to 20 pixels and the maximal convergence radius given by equation (7) is about 0.79m. A horizontal, regular grid of width 0.1m of starting positions around the true camera pose of the corresponding video frame was used to initialise one iteration of tracking per position. The cross denotes the true camera pose and the circle the convergence area of 0.79m around it. Light grey area shows points for which the posterior pose estimate was outside the area of convergence, while the white area denotes points for which the posterior estimate was inside. The dark grey outside area denotes points where the tracking quality test failed. Note, that the convergence area almost touches the border of the set of converging points indicating that it is indeed a close estimate of the maximal distance in that direction. Two important conclusions can be drawn from the simulation. Firstly, the convergence area estimated by equation (7) is a close but reliable estimate of the maximal error in initial pose the tracking algorithm can tolerate. Secondly, the set of points with positive

6

m

3 C ONVERGENCE OF VISUAL TRACKING Due to its iterative nature, visual tracking process only converges to the correct pose estimate, if the prior pose estimate is sufficiently close. Otherwise, the process may converge to a local minimum which is rather arbitrarily determined by the random similarities between the rendered view and the video frame. The appearance of the model changes in a highly non-linear manner with respect to the camera pose and thus, local search cannot provide a path to the correct global minimum from outside a restricted neighbourhood. Through linearisation of the observation function, an estimate of the neighbourhood of convergence can be obtained. Let J be the differential of the observation function (1) with respect to the 6 camera pose parameters evaluated at a point x in the world. Then δ p = Jδµ is the linear approximation to the change in the projection of the point x given a change δµ in camera pose. We can estimate the maximal change using the matrix norm of J as induced by Euclidian distance through ' (5) $δ p $ = $Jδµ $ ≤ $J$$δµ $ = λmax (J T J)$δµ $,

5

7

8

−7

−6

−5 m

−4

Figure 2: Numerical evaluation of the convergence area. True camera pose is marked with a cross and the convergence area with the circle centred on the cross. Dark grey denotes areas where the quality test failes and light grey denotes areas where the posterior pose is not within the convergence circle.

tracking quality test results (where the converged pose was deemed correct) almost coincides with the points yielding a result within the computed convergence area, indicating that we can immediately discard a starting point as a wrong hypothesis, if the test fails. The observed convergence shape is actually more ellipsoidal than circular due to the two different eigenvalues present in the matrix J T J. However, the area shown is valid for a single point only. As the system uses multiple points across the width of the image to compute a pose estimate, the true convergence area is the intersection of several individual ones and thus smaller. Therefore, the system assumes a circular convergence area to ensure convergence. 4

I NITIALISATION

FROM

GPS

The visual tracking system requires a rather accurate initial camera pose estimate as discussed in section 3. The integrated magnetic field and gravity sensors can provide an accurate estimate of the camera’s orientation, while using average height of a person is a useful estimate for the camera height. However, we have no good prior for the remaining two dimensions, the location on the ground. Here, GPS provides a robust and simple sensor to fill in the missing parameters for the 2D ground position. Unfortunately, the accuracy provided by GPS in urban environments does not meet the tight bounds required to robustly initialise the visual tracking system. In our own experiments we found GPS error to have a standard deviation σ = (1.9m, 4.3m) in east-west and north-south direction respectively. The anisotropic error distribution was observed in all data sets, independent of the influences of multipath disturbances, therefore we chose to represent it explicitly in the system. To be able to initialise with high likelihood an area described by an ellipse with the axes 3σ = (5.7m, 12.9m) around the reported GPS location has to be taken into account. This is far too large for the convergence radius demonstrated for the visual tracking system.

−5 GPS Camera pose Starting pose 0

m

5

10

15

20 −15

−10

−5

0

5

10

m

Figure 3: Sample GPS track for a stationary camera with search grid overlayed for a selected point. The ellipse denotes the 3σ area of the GPS estimate, the red circle the convergence area for the true camera pose.

4.1 Search pattern To overcome the discrepancy we employ a search strategy. From an initial search location, such as the current GPS estimate, the visual tracking is deployed iteratively. If the initial search location is too far away from the true camera location, we cannot expect the local minimum to lie close to the true location, because the relationship between visual appearance and camera pose is too complex. However, using the statistical test described in section 2.3, we can decide whether the tracking converged to the true location or not as long as the true location is not further away then the maximal convergence radius as decribed in section 3. Thus, if the initial search location is within the convergence area of the true position, the tracking quality test should be positive. To cover the whole area indicated by a GPS measurement, a grid of initial search locations is used. Given the maximal convergence radius a 2D search grid is created to cover the 3σ area given by the GPS standard deviation. Using separate standard deviations for latitude and longitude provides a tighter bound on the search area, than just using the largest deviation. The search grid has a hexagonal pattern to minimise the number of locations required to cover the whole area (see Fig. 3). For each of the search locations one iteration of the visual tracking algorithm is computed: an image of the model is rendered, edges are extracted and measurements in the video image are made to find a camera pose that minimises the measurement errors. Finally, the resulting pose and the score given by the tracking quality test are recorded. For the example pose given in section 3, about 180 search locations are required to cover the uncertainty area. This requires rendering and tracking as many frames totaling in a run-time cost of about 12s. This may be acceptable for the initial startup of the system. However, any reduction of this search time improves the usability of the system. Therefore, we investigated various heuristics to improve the possible search time. 4.2 Clipping Using more information about the user’s environment the possible camera locations can be restricted and impossible ones removed from the search grid. For example, if the GPS estimate indicates that the user is standing close to a building wall the system assumes that the user is actually not inside the building adjacent to the wall

Figure 4: Search locations clipped against accessible areas. The light grey area denotes the street and sidewalk, light, green points are valid locations, while dark, red points are clipped.

surface. Therefore, any search locations within the intersection of the building’s area and the uncertainty area of the GPS measurement are excluded. A model of the accessible areas is included into the system to test any possible search location against it. A ray is projected downwards along the vertical from the 3D position of the given location and intersected with a polygonal model of the accessible ground surface. If an intersection is found, the location is above a valid area and is included in the set of locations to search (see Fig. 4). 4.3 Stopping early Another approach to reducing the search time is to trade-off likelihood of obtaining the correct result with search time. Instead of searching all locations and finding the location with the maximal test score, the search can also be stopped as soon as a location with a good enough test score is found. The test already uses a threshold value of 0 to decide during operation if tracking was successful or not (see section 2.3). Using the same threshold for search, the stopearly method stops when a search location is found that generates a optimised location with a score higher than the threshold. However, the result of the search depends now on the order in which the locations are processed. To stabilise the search order and to increase the probability of finding a good location early, the search locations are sorted by their likelihood given the current distribution of the camera pose as generated by the Kalman filter described in section 2.2. This leads to a search outwards from the current mean of the camera pose. However, the location found in this way may still not converge to the correct location. Fig. 2 shows that starting locations (shown in light grey) with a positive test score may still not converge to the correct pose. Therefore, to stabilise the search result, we also search the 6 neighbours of the first location above the threshold to improve the final estimate. The quality test score typically decreases monotonically with distance to the true location, thus using the maximum in the neighbourhood of the first positive location yields a better estimate. 5

R ECOVERY

USING

GPS

The above initialisation procedure works well for startup. The user only needs to turn on the system, wait for GPS lock while pointing

2

(a) 6 4 2 m

1

0 −2 −4

0 0

50

100

150 s

200

250

300

(b)

−1

m

5

1

2

3

4

5

6

7

8

Figure 6: Example of a Gaussian process interpolating 3 sample points and extrapolating to the right. The black line denotes mean function, while the grey area shows the 3σ interval.

0

−5

−10

0

0

2

4

6

8

10

12

14

h

Figure 5: GPS error for one coordinate in different time scales. (a) slowly varying over a couple of hundred seconds. (b) distinct bias changing only over hours.

it at some visible building and press the “Init” button. The system will quickly search through a set of initial locations and switch to the online tracking mode, if the search for a converging starting location was sucessful. Nevertheless, the search procedure can take up to 12s, if the GPS error is large. While this performance seems acceptable for system startup, it is rather slow for recovery from catastrophic failures during operation. Re-initialisation is required if the system fails to recover from any tracking failure through transient occluders or other disturbances despite the recovery mechanisms already build in. For example, if the user moves the device to a comfortable resting position occluding the camera’s view and proceeds to walk to another location several tens of meters away, a quicker initialisation of the tracking system is desirable. Ideally it should recover immediately and without any further manual intervention. However, as discussed in section 4 this seems impossible without further information to reduce the error in the GPS estimate. Analysing the GPS position error, we find that while a normal distribution with mean 0 describes it well in the long term, it is also highly correlated over shorter time intervals. Fig. 5 shows that the GPS error typically varies slowly in the span of a view tens of seconds with some rapid changes interspersed. Similarly, there is noticeable bias which changes only over multiple hours. Typically, applications model GPS error as a normal distribution with zero mean and effectively treat GPS only as a very noisy, but unbiased sensor source. A more complex model is presented by Wang et al. [14], who add a cyclic term that roughly corresponds to the low frequency bias as seen in Fig. 5. The model is used to create a realistic simulation of a virtual GPS source. However, there is no attempt to estimate the parameters of the model from the given data. 5.1 Modelling the GPS error Re-initialisation can benefit from this time-local behaviour of GPS measurements because the error can be predicted and therefore reduced. To accomplish this, samples of the GPS error are recorded during operation of the tracking system by storing the offset between the camera pose estimate and the GPS estimate. Then, after a failure the GPS error is extrapolated beyond the last known error sample and added to new GPS measurements to create a better estimate of the current position. We propose to model the GPS error using a Gaussian process[8].

A Gaussian process is a probability distribution of a set of functions f : x → y. If two values of x are considered, x1 and x2 , then the probability distribution over f induces a probability distribution over y1 = f (x1 ) and y2 = f (x2 ). A Gaussian process specifies that the distribution of (y1 , y2 ) is a multivariate Gaussian distribution where the covariance between y1 and y2 is a function of x1 and x2 : cov(x1 , x2 ). Most commonly this covariance is simply a function of the difference: cov(|x1 -x2 |) defining a stationary Gaussian process. The covariance matrix for (y1 , y2 ) can be computed as " ! cov(0) cov(|x1 − x2 |) . (8) Σ= cov(|x1 − x2 |) cov(0) The mean of (y1 , y2 ) is usually chosen to be (y, ¯ y) ¯ where y¯ is the mean value for y over the range of x and may be zero. Similarly the covariance matrix for a set of values y1 , . . . , yn can be built from the individual covariances which are functions of x1 , . . . , xn . If f (x) is known for several values of x: x1 , . . . , xn and takes values y1 , . . . , yn , then the distribution for y = f (x) can be computed by building the distribution for (y, y1 , y2 , . . . , yn ) and marginalising out y1 , . . . , yn yielding a Gaussian distribution P(y|y1 , . . . , yn ) = N(µy , σy2 ). The parameters are given by µy

=

σy2

=

where Σi j and ki

= =

k T Σ−1 y

(9) T −1

cov(0) − k Σ

cov(|xi − x j |) cov(|x − xi |).

k,

(10) (11) (12)

The evaluation at a new sample point can be applied to either interpolate between measured values of f or extrapolate beyond the interval of known samples [1]. In the later case the distribution of a function value tends towards the normal distribution with mean 0 and variance cov(x, x), if the covariance function cov(xi , x j ) converges towards 0 for |xi − x j | → ∞. This represents a model where the distribution has zero mean in the long run, which is a reasonable assumption on the error of the GPS estimate. Fig. 6 demonstrates the principle of a Gaussian process. The centre line indicates the mean of the function distribution at all points while the grey area shows the 3σ interval for the distribution. The mean function interpolates smoothly between the known sample values and that the overall distribution converges back to the general distribution given by N(0, cov(x, x)) for values far away from the samples. 5.2 Estimation of the Covariance function The important parameter in the formulation of the Gaussian process is the covariance function cov that determines how values close in

East North

18 16 14

covariance − m2

12 10 8 6 4 2 0 0

100

200

300

400

500 lag − s

600

700

800

900

1000

Figure 7: Covariance functions for east and north direction.

system. If visual tracking fails, the Gaussian process model is evaluated at the time stamp of a new GPS measurement and the measurement is offset by the predicted mean error. n was chosen to be 80 through testing a range of values between 0 and 180 (3 minutes) and using the result with the lowest mean prediction error as described in section 6.2. Fig. 8 shows the predicted error starting from a set of assumed failure points. The predicted error does not track the GPS error in detail but typically captures the mean bias present in the estimate. This bias is usually due to systematic errors in the pseudo-distance measurements generated by athmosperic effects and local multipath signals. Sudden changes such as in Fig. 8(c) are not captured well by the model and may lead to estimates well outside of the error bounds. The Gaussian process generates a full normal distribution for the error which is used in the integration of the GPS measurement with the system. The GPS measurement corrected by the predicted error and augmented with the covariance returned by the Gaussian process model is fed to the Kalman filter to update the pose estimate appropriately. 6

time are correlated to each other. Typical applications of Gaussian processes use an analytical covariance function cov(x, y) = f (x, y; µ) and estimate free parameters µ from a training data set. Possible choices are radial basis functions or Gaussian mixtures. For example, it would be possible to model the state evolution of the prediction step of a Kalman filter by choosing an appropriate covariance function. However, we found that a purely data-based covariance function outperforms analytical representations in our problem. We determined an appropriate function by estimating the covariance function directly from a set of measurements. The estimated camera position as given by the visual tracking system and the GPS position estimate were recorded simultaneously and the error calculated as the difference between the two with the variance equal to the camera pose variance plus additional nose to account for quantisation in the GPS measurement. Then, the covariance function is estimated as cov(t1 ,t2 ) = cov(|t1 − t2 |) =

N−|t −t |

∑t=1 2 1 dt dt+|t2 −t1 | , N − |t2 − t1 |

(13)

where dt is the sample at time t and N the number of available samples. Due to the fixed frequency of GPS updates at 1Hz, the samples are taken at regular intervals and the sample times are effectively replaced with the indices 1 . . . N. The biased estimate of the covariance is used, because the mean is assumed to be 0 independed of the data. The estimated covariance function depends only on the difference in time, but not on the actual time itself. In our application, it is possible to represent the covariance function as a set of samples, because it will only be evaluated at regular intervals of t. These intervals co-incide with the distance between the training points. Therefore, we can estimate the value of the covariance function at t directly using the estimator of the autocovariance (??). If evaluation of the covariance functions at other points t would be required, an analytical representation would be more appropriate. The estimation was done separately for the two dimensions of the GPS error. Fig. 7 shows a plot of the resulting two covariance functions. 5.3 Predicting the GPS error Using the estimated covariance function the local behaviour of the GPS error can be modelled given a set of measurements of the error. A fixed history of the last n error measurements is kept in the

R ESULTS

The initialisation and GPS error models were integrated with the model-based tracking system on a handheld AR platform build around a VAIO U-71 tablet PC with 1.1 GHz Pentium III M CPU. A USB 2.0 camera with a 3.6mm lense and an Xsens MTx sensor are mounted rigidly to the back of the device. A µblox AEK-4H super sensitive GPS receiver with an external antenna mounted on a user worn cap provides GPS measurements. The camera captures video frames with a resolution of 640 × 480 at 30 Hz which are processed at quarter frame resolution of 320 × 240. The system is able to perform at 15–17 Hz using these parameters. 6.1 Initialisation To evaluate the basic initialisation several trial runs in different locations were recorded. The camera pose estimated by the visual tracking system was used as ground truth for the GPS estimate. Then the initialisation procedure was run on every uncorrected GPS measurement and the returned initial camera pose was compared with the recorded camera pose. Initialisation was considered successful if the estimated position was within the convergence area of the true pose. Three modes were evaluated to compare their robustness and runtime: try all search points, stop at the first one that passes the tracking test, search the neighbouring points of the first one to pass the test and return the best one. Fig. 9 shows a comparison of the three approaches on a single data set. Table 1 shows the success rates and average number of points visited for the three methods. The table shows the failure rate of the initialisation, the number of returned poses outside the convergence area of the true camera pose, the root mean square error of the inlier poses and the average number of search points visited. Failure rates and RMS error may differ for the three methods, because a fixed number of edgels is drawn randomly from all found ones and used in the optimisation. This randomisation makes the results of the optimisation dependent on the number of sample points visited before. Strategy Search all Stop first Search neighbours

no result 0.05% 0.05% 0.05%

outlier 0.16% 16.0% 0.11%

RMS 0.154m 0.449m 0.148m

# points 159 27 31

Table 1: Comparison of the initialisaton methods. A desirable method always returns a result, has small mean error, no outliers and searches a minimal number of points.

10

5

5

0

−5

−10

East − m

10

5 East − m

East − m

10

0

−5

0

50

100

150

200

250

300

−10

350

0

−5

0

50

100

150

200

s

250

300

−10

350

0

50

100

150

s

15

15

10

10

200

250

300

350

200

250

300

350

s

0

0 −5

5

North − m

5

North − m

North − m

−10

0 −5

−20

−30 −10

−10

−15

−15

0

50

100

150

200

250

300

350

0

50

100

150

200

s

250

300

−40

350

0

50

100

150

s

(a)

s

(b)

(c)

Figure 8: Prediction of the GPS error in east and north direction. Blue line shows the measured GPS error and red, stippled line the predicted error. The grey area denotes the 3σ interval computed by the Gaussian process. (a) error prediction for the plot of Fig. 1, (b) an example from the Trumpington street data set, (c) an example of a fast change from the St. Catherine’s College data set.

10

−10

m

5

m

5

m

5

10

−5

0

−10

10

−5

0

m

(a)

−10

−5

0

m

(b)

m

(c)

Figure 9: Evaluation of initialisation success. Red circle denotes the true camera position and convergence area, while the blue track shows the uncorrected GPS estimates. (a) The blue dots show the estimated camera poses testing all points, (b) shows the poses returned for the first successful initialisation, (c) shows the poses for the best among the first successful and its neighbours.

While the Stop first and Search neighbour methods visit only a fraction of the points the search all method has to test, the performance of the stop early method is below the other two in terms of outlying results and RMS error. This is plausible because the first encountered successful optimisation need not be a particular good one. However searching the neighbours around this first one usually yields a much better estimate while searching only a fixed number of additional points. Therefore, the third method combines both robustness and speed and is clearly preferable over the other two. 6.2 Re-initialisation A set of runs through two test areas were recorded to evaluate the prediction of GPS error for re-initialisation. The test runs included stationary operation and walking around the area. For each run the GPS estimate and the vision-based tracking estimate of the camera position were recorded. A covariance function as described in section 5.2 was estimated from a subset of the data for the first test area in Trumpington street representing about 1 hour of data of both of stationary and moving users. A direct comparison of the predicted GPS error with the true GPS error shows still large fluctuations for a single track because it captures only the average error and furthermore drifts back to 0 to emulate the long term behaviour of the GPS error (see Fig. 8). To evaluate the improvements to the position estimate over larger data sets, the GPS error was predicted from every single GPS observa-

tion 300s into the future and used to generate corrected GPS tracks. Both the raw GPS values and the corrected GPS values were then processed by the constant velocity Kalman filter to generate the predicted camera tracks. These predicted tracks were compared with the recorded camera track. Computing the predicted track starting from a single GPS measurement effectively simulates a visual tracking failure at that point and subsequent tracking based only on the corrected GPS and inertial measurements. If the user re-initialises at a point in the future, the time to a successfull pose estimation depends on the distance to the true camera pose. Thus, the distance to the true camera track is a measure of the time required to re-initialise the system. Averaging over the distances over all predicted tracks shows the average error as a function of the time since the failure. The smaller the error, the faster the re-initialisation will complete. Fig. 10 shows the resulting plots for different trial runs. Plots (a)-(b) are part of the training set for the covariance function. Plots (c)-(d) are from the same test site, but were not used in the training for cross validation purposes. Similarly, plots (e)-(h) represent trials from the second test site at St. Catherine’s College and were also not part of the training set. The plots show that the corrected error is almost always smaller than the GPS error. In the long run it usually converges back to the GPS error which is to be expected because the Gaussian process model tends towards the zero mean. The correction is particularly successful in the first minute because it simply simulates

(a)

(b)

(c)

(d)

10

10

10

8

8

8

8

6

6

6

6

4

4

4

4

2

2

2

2

GPS error Correction error

m

10

0

0

100

200

300

0

0

100

(e)

200

300

0

0

100

(f)

200

300

0

0

100

(g)

200

300

200

300

(h)

25

25

25

20

20

20

20

15

15

15

15

10

10

10

10

5

5

5

5

m

25

0

0

100

200

300

0

0

s

100

200

300

0

0

100

s

200

300

0

0

100

s

s

Figure 10: The plots show the average distance of the predicted position to the true position for the raw GPS measurements in blue and for the corrected GPS measurements in red as a function of the time after the catastrophic failure of the visual tracking system. (a)-(d) trial runs from Trumpington street site, (e)-(h) trial runs from St. Catherine’s College site. Only plots (a) & (b) were part of the training data.

7

C ONCLUSIONS

the cost of vastly increased search time. There are several directions to improve the statistical model of the GPS error. The two prime sources in an urban environment are athmospheric signal delays and multipath signals. Athmospheric delays errors can be reduced through the use of Differential GPS.

200

14 12

150

10 8

100 6

search time − s

# visited points

the GPS error at the failure point. In contrast, using the raw GPS measurements effectively represents a large step function in the error and a small number of wrong raw GPS measurements are enough to quickly pull the Kalman filter estimate towards them. The constant velocity model even leads to overshooting explaining the small peaks in the GPS error shortly after the failure. Knowing an estimate that is closer to the true position has a direct impact on the initialisation time, if the Stop first or Search neighbours methods are used (see section 6.1). The distance of the initial position to the true position maps directly to the number of search points visited until the true position is found, because the points are searched in order of their distance from the initial position. Fig. 11 shows the number of search points and the time required to perform the search iterations as a function of the position error of the initial position for the example data of Fig. 3. The function does resemble a quadratic function for small distances and tapers off as the total number of search points is limited. Fig. 12 shows the results of applying the mapping from position error to search time to the position predictions from the trial run in Fig. 10(a). Thus, it provides an example of the time savings available through predicting the GPS error. For the first minute, the search time is halved, while on average about one second is saved.

4

50

2 AND FUTURE WORK

The initialisation procedure depends on accurate information about the dimensions not searched. Orientation needs to be known rather accurately or the initialisation will not find any converging starting point. The magnetic field may be distorted through electrical installations and the estimate of gravity contains a large amount of noise during motion. To counter these effects, the system waits for a period of small or no motion before attempting an initialisation. Another method is to extend the search into more dimensions, at

0

0

2

4 6 position error − m

8

0 10

Figure 11: The maximum number of points visited and the corresponding search time as a function of the distance of the initial position to the true position. The search time is the number of points times 0.67s, the time for a single frame.

R EFERENCES

5

search time − s

4 3 2 1 0

GPS error Correction error 0

50

100 150 200 tracking failure − s

250

300

Figure 12: The mapping of position error to time applied to the evaluation data of Fig. 10(a). This is a simple transformation of the distance using the monotonic function from Fig. 11.

However, in the absence of the necessary infrastructure, the presented statistical method captures the introduced bias well and allows to correct for it. Multipath errors depend more on the local environment and may change quickly, if the user moves away from the original failure location. Thus, conditioning the Gaussian process both on time and travelled distance could model such behaviour. The current model also depends on fixed learned data. An interesting extension is to learn the current variance of GPS measurements and covariance function online during operation. The integration with GPS mostly impacts the area in which the system can be deployed. The required models to track are proportional to the size of the area and are potentially beyond the storage capabilities of the hand-held device. Therefore, a distributed approach of storing and accessing the model is desirable. For example, a database of 3D building models stored in Google Earth could serve the part of the city currently required for tracking to the hand-held. As the user moves about the local model residing on her device is updated as required. ACKNOWLEDGEMENTS The authors thank Dr. Duncan Robertson and Prof. Roberto Cipolla for making the model of Trumpington Street available for experiments. This work is supported by a grant from the Boeing Company.

[1] Sofiane Brahim-Belhouari and Amine Bermak. Gaussian process for nonstationary time series prediction. Computational Statistics & Data Analysis, 47(4):705–712, Nov. 1 2004. [2] Volker Coors, Tassilo Huch, and Ursula Kretschmer. Matching buildings: Pose estimation in an urban environment. In Proc. ISAR 2000, pages 89–92, Munich, Germany, October 5–6 2000. IEEE and ACM. [3] Youjing Cui and Shuzhi Sam Ge. Autonomous vehicle positioning with gps in urban canyon environments. IEEE Trans. Robotics and Automation, 19(1):15–25, Feb 2003. [4] Tom W. Drummond and Roberto Cipolla. Visual tracking and control using lie algebras. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition 1999, Ft. Collins, CO, USA, June 23–25 1999. IEEE. [5] Eric Foxlin. Pedestrian tracking with shoe-mounted inertial sensors. IEEE Comp. Graph. Appl., 25(6):38–46, Nov.-Dec. 2005. [6] Tobias Höllerer, Steven Feiner, Tachio Terauchi, Gus Rashid, and Drexel Hallaway. Exploring MARS: developing indoor and outdoor user interfaces to a mobile augmented reality system. Computer & Graphics, 23(6):779–785, 1999. [7] Gérard Lachapelle. GNSS indoor location technologies. Journal of Global Positioning Systems, 3(1–2):2–11, 2004. [8] David J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. [9] Wayne Piekarski and Bruce H. Thomas. Tinmith-metro: New outdoor techniques for creating city models with an augmented reality wearable computer. In Proc. ISWC’01, pages 31–38, Zurich, Switzerland, 8–9 October 2001. IEEE. [10] Gerhard Reitmayr and Tom W. Drummond. Going out: Robust tracking for outdoor augmented reality. In Proc. ISMAR 2006, pages 109– 118, Santa Barbara, CA, USA, October 22–25 2006. IEEE and ACM, IEEE CS. [11] Valérie Renaudin, Okan Yalak, and Phillip Tomé. Hybridization of MEMS and Assisted GPS for Pedestrian Navigation. Inside GNSS, January/February:34–42, 2007. [12] Duncan P. Robertson and Roberto Cipolla. An image-based system for urban navigation. In Proc. BMVC 2004, London, UK, September 7–9 2004. [13] Mark A. Sturza. Gps navigation using three satellites and a precise clock. Navigation: Journal of the Institute of Navigation, 30(2), 1983. [14] Jiangxin Wang, Susan Y. Chao, and Alice M. Agogino. Sensor noise model development of a longitudinal positioning system for AVCS. In Proc. American Control Conference, volume 6, pages 3760–3764, San Diego, CA, USA, 1999. [15] J. David Weiss and Frank Shields. GPS/INS integration in a severe urban environment. In Proc. IEEE Position Location and Navigation Symposium, pages 432–440, Palm Springs, CA, USA, April 20–23 1998. [16] Greg Welch and Gary Bishop. Scaat: incremental tracking with incomplete information. In Proc. SIGGRAPH ’97, pages 333–344, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. [17] Suya You, Ulrich Neumann, and Ronald Azuma. Orientation tracking for outdoor augmented reality registration. IEEE Comp. Graph. Appl., 19(6), November 1999.

Initialisation for Visual Tracking in Urban Environments - CiteSeerX

Initialisation for Visual Tracking in Urban Environments - CiteSeerX

Suggest Documents

Initialisation for Visual Tracking in Urban Environments - CiteSeerX

Tracking in Urban Environments Using Sensor Networks ... - CiteSeerX

Visual Tracking in Cluttered Environments Using the Visual

Robust visual odometry for vehicle localization in urban environments

Tracking for Distributed Mixed Reality Environments - CiteSeerX

Visual Analytics of Urban Environments using High ... - CiteSeerX

A Visual Servoing Approach for Tracking Features in Urban Areas ...

Predictive visual tracking - CiteSeerX

Augmented Reality Tracking in Natural Environments - CiteSeerX

Multiclass Multimodal Detection and Tracking in Urban Environments

Experiencing visualities in designed urban environments - CiteSeerX

Dynamic Search Initialisation Strategies for Multi ... - CiteSeerX

Thirteen Hard Cases in Visual Tracking - CiteSeerX

Visual Environments for Visual Thinkers - Science Direct

Happiness in Urban Environments in Tehran, Iran - Centre for Urban ...

Real-time tracking for visual interface applications in ... - CiteSeerX

Bounding Errors for Improved 3D Face Tracking in Visual ... - CiteSeerX

Energy Efficiency in Urban Environments

Networked Mobilities and Performative Urban Environments - CiteSeerX

Urban Environments & Health: Identifying Key ... - CiteSeerX

Reactivity of urban environments towards legislative ... - CiteSeerX

Understanding Obesogenic Urban Environments from the ... - CiteSeerX

Robust Visual Tracking for Retinal Mapping in

Visual Analytics of Urban Environments using High-Resolution