Advanced tracking through efficient image processing and visual-inertial sensor fusion Gabriele Bleser ∗
Didier Stricker†
Interactive Graphics System Group, Technical University Darmstadt and Department of Virtual and Augmented Reality, Fraunhofer IGD, Darmstadt Germany
A BSTRACT We present a new visual-inertial tracking device for augmented and virtual reality applications. The paper addresses two fundamental issues of such systems. The first one concerns the definition and modelling of the sensor fusion. Much work has been done in this area and several models for exploiting the data of the gyroscopes and linear accelerometers have been proposed. However, the respective advantages of each model and in particular the benefits of the integration of the accelerometer data in the filter are still unclear. The paper therefore provides an evaluation of different models with special investigation of the effects of using accelerometers on the tracking performance. The second contribution is about the development of an image processing approach that does not require special landmarks but uses natural features. Our solution relies on a 3D model of the scene that enables to predict the appearances of the features by rendering the model using the prediction data of the sensor fusion filter. The feature localisation is robust and accurate mainly because local lighting is also estimated. The final system is evaluated with help of ground-truth and real data. High stability and accuracy is demonstrated also for large environments. Keywords: augmented reality, markerless camera tracking, model-based tracking, inertial sensors, sensor fusion, extended kalman filter Index Terms: I.4.8 [Image processing and computer vision]: Scene analysis—Motion, Photometry, Sensor fusion, Tracking; I.2.10 [Artificial intelligence]: Vision and Scene Understanding— 3D/stereo scene analysis, Intensity, color, photometry, and thresholding, Motion, Texture, Video analysis; I.5.5 [Pattern recognition]: Applications—Computer vision; G.3 [Probability and statistics]: Markov processes, Probabilistic algorithms, Robust regression, Stochastic processes; I.2.9 [Artificial intelligence]: Robotics— Kinematics and dynamics, Sensors; I.3.m [Computer graphics]: Miscellaneous—Augmented Reality; General Terms: Algorithms, Experimentation, Performance, Theory, Verification 1
I NTRODUCTION
User tracking is an enabling technology for augmented reality and is also crucial for immersive virtual reality. If a camera is rigidly attached to the user, e.g. to a head-mounted display (HMD), visionbased tracking methods can be applied. As pure vision-based systems usually handle only very limited motion velocities, the fusion of vision-based and inertial tracking technology has been proposed several times in the past. Inertial sensors (gyroscopes and accelerometers) are suitable for capturing fast movements, while the slower vision sensor provides absolute references to reset the error, ∗ e-mail: † e-mail:
[email protected] [email protected]
IEEE Virtual Reality 2008 8-12 March, Reno, Nevada, USA 978-1-4244-1972-2/08/$25.00 ©2008 IEEE
which accumulates rapidly by integrating noisy and biased inertial data (dead reckoning). While it has been shown often enough in the past that inertial measurements do improve pure vision-based tracking in several ways [31, 18, 12, 1, 15], it is still unclear how to optimally exploit the information in the inertial data. In particular, the integration of accelerometer data is often argued to introduce rather instabilities than supports when used for the position estimation. This is due to the fact that accelerometers measure not only free acceleration, but also acceleration due to gravity, which has to be subtracted before the double-integration involving the estimated orientation. Thus, small errors in the orientation result in unbounded errors in the position. Accelerometers are therefore often used for stabilising only the camera attitude with respect to gravity (roll and tilt angle) rather than estimating attitude and position by modelling the free acceleration as noise, an approximation which holds under continuous motion [21]. This is also the common operating mode of the commercially available 3 DOF orientation trackers, which internally fuse measurements from accelerometers, gyroscopes and magnetometers and deliver an absolute orientation [28, 27]. [1] uses such a device to re-initialise the natural feature registration when the track has been lost by the vision sensor. Finally, because of the special nature of the accelerometer measurements, many researchers use only gyroscopes to support the vision-based tracking. [19] integrates gyroscope measurements and point-based vision tracking via a particle filter. During the particle generation step the rotation angles are sampled according to the measured angular velocities, while a random walk is applied to the translation. In most cases the fusion is done in an Extended Kalman Filter (EKF). [12] uses an EKF to combine gyroscopes with a line-based vision tracking system. The linear velocities are filtered from previous vision-based position estimates. The final camera pose is computed solely from the vision measurements and in return utilised to estimate the gyroscope biases. [32] fuses gyroscope data and vision measurements from fiducial tracking in an EKF. The 2D/3D correspondences obtained from the vision sensor are directly passed as measurements to the EKF without an intermediate vision-based pose computation step as in [12, 21]. [10] extends this approach to outdoor augmented reality replacing the fiducial tracking by a system based on natural line features. Some systems have also been proposed, which exploit gyroscopes and accelerometers for both, orientation and position estimation. Most of them use an EKF to fuse all measurements uniformly to an optimal pose estimate. The most popular representant is certainly the commercially available InterSense VIS-Tracker [4], which is based on artificial markers. [20] provides a theoretical framework for combining such a system with markerless linebased tracking technology, [23] presents ideas with point-based vision tracking, but both base their experiments on simulated data. [8] applies the model of [23] to realistic inertial data and simulated vision measurements and proposes to estimate the biases of the inertial sensors online as parts of the state vector. Some recent SLAM (Simultaneous Localisation And Mapping) systems provide experimental results on realistic data, but within simple test environments
137
[24, 15]. Our system combines model-based tracking of natural point features with inertial sensors in an EKF. The sensor fusion core is able to estimate the camera pose, velocities, accelerations and sensor biases by processing measurements obtained from gyroscopes (angular velocities), accelerometers (linear accelerations) and the vision system (2D/3D correspondences). Our contributions in the area of sensor fusion consist in (a) integrating a fusion model that fully exploits gyroscope and accelerometer data for both, orientation and position estimation, robustly with markerless tracking technology and (b) proving under controlled and in real world environments that this model indeed provides a better tracking performance than modelling the accelerometer measurements as static gravity vectors or using only gyroscopes. Natural feature detection and tracking are an essential research area for augmented reality. Many works focuse on the development of invariant and robust feature detectors, descriptors and alignment methods, suitable for wide baseline matching [14, 13, 3, 5]. In our system the problem of feature association is simplified, as a prediction of the camera pose is always available. A technique, which exploits the pose prediction in a very elegant manner, consists in rendering the appearances of natural features from a 3D model of the environment. In [12], a 3D wireframe model of the target scene is rendered from the predicted camera pose and a 1D search for local gradient maxima is carried out orthogonal to the rendered edges at equally distributed sample points. [30] enhances this approach by learning a visual appearance state for each sample point during the tracking, thus achieving increased discrimination. [21] works on a textured CAD model rather than learning appearances of the edge sample points from the images. Our vision system also uses a simplified textured CAD model of the environment, but in contrast tracks feature points instead of edges. Our contribution in this area consists in developing a technique for registering small patches of a rendered image in a live camera frame also under severe changes of illumination. This is mainly achieved by estimating not only 2D displacements but also local lighting changes using an advanced Kanade-Lucas feature tracker [25]. The technique of using a CAD model for the registration has many advantages: (1) the graphics hardware can be used for efficiently generating a rendering from the predicted pose, (2) the correct level of detail is guaranteed and (3) the feature tracking is free of drift. 2 A PPROACH We assume the sensors providing physical inputs to the system, i.e. the camera and the Inertial Measurement Unit (IMU), to be synchronised in hardware and to run at different rates 1 . Figure 1 outlines the workflow of our system. A frame is captured at timestep t and the IMU readings from the previous time interval ]t − 1..t] are buffered. The image is resampled to compensate for distortion effects based on the model of [7]. A divergence test is then performed on the Kalman Filter state, based on which the system either initialises or processes the measured angular velocities and accelerations up to timestep t. The (re-)initialisation is semi-automatic. A rough orientation is obtained from the IMU 2 and serves as starting point for the orientation states in the EKF. The initial position is entered manually and the user has to move the camera close enough to this position so that the vision-based tracking snaps on. (Re-)initialisation and filter update both yield a pose prediction for timestep t, which is then used to generate a rendering of the 1 In our experiments, the IMU runs at 100 Hz and provides a trigger signal to the camera at 25 Hz 2 We use a XSens MT9-C [28], which besides the measured angular velocities and accelerometers also delivers an absolute orientation as mentioned in section 1.
138
Figure 1: System workflow.
textured CAD model on the graphics card. On demand new features are extracted in free areas of the rendered image using FAST (Features from Accelerated Segment Test) [22]. They are stored with their 3D positions in a feature map, which is managed by a set of simple rules described in an earlier work [2]. At each timestep, a fixed-size subset of features predicted to be visible in the given frame is registered. The 2D/3D correspondences are then used as vision measurements to update the Kalman Filter. The paper is organised as follows. Section 3 describes the natural feature tracking. Section 4 details the fusion of vision and inertial measurements presenting three fusion models: model 1 (gyro) uses only gyroscopes for supporting the camera orientation estimate, model 2 (gravity) uses additional accelerometers for stabilising the camera attitude and model 3 (acc, our proposed model) exploits the accelerometers also for estimating free acceleration and thus position. Moreover, outlier rejection and divergence monitoring are addressed. Section 5 compares the different system models with respect to their tracking performance demonstrating that model (3) is superiour to (1) and (2). The high stability and accuracy achieved by a full integration of the accelerometer data into orientation and position estimation is shown in different real environments. Section 6 draws a final conclusion. 3 N ATURAL FEATURE TRACKING A natural feature point is tracked by cropping a small window of its surrounding texture from a rendering of the textured CAD model and registering it in the live camera image. The position and perspective distortion of the feature in the rendering can be assumed very similar to those in the live camera image, an optimal condition for estimating a 2D displacement using an iterative registration method like the Kanade-Lucas Tracker [25]. However, due to changing lighting conditions and the photometric response function of the camera used, the entire appearance of the feature in the rendering can differ significantly from that in the live frame. An example is given in figure 2 (a,b). Those effects can be compensated
(a)
(b)
(c)
Figure 2: The feature registration process over an image pyramid: (a) shows a live camera image from a simple target object and texture patches cropped around a feature position from three levels of the image pyramid. (b) shows the rendering, the patch pyramid and the binary masks. The upper two levels of the patch pyramid contain parts of the black background, which would disturb the registration. The masks on the right side mark exactly these parts as invalid. The registration therefore succeeds. (c) shows on the left side the texture patches from the rendered (left) and live (right) image (bottom level of image pyramid) in comparison. The top row demonstrates the differing appearances of the patches before the registration. After the registration, both patches look quite similar. In the image on the right side the four registered feature positions are marked, which have been used to compute the camera pose for the augmentation.
by using the relatively simple photometric model proposed in [34]. Let m p = [x, y]T be a feature position in an image. Let I(m p ) be the intensity value of this position in the rendering and J(m p ) the intensity value in the live camera image. We describe the appearance change of a feature from the rendering to the live camera image by: I(m p ) = J(λ · (m p + d) + δ )
∀m p ∈ W,
(1)
where d = [d1 , d2 ]T is a 2D displacement vector. λ can be regarded as a parameter adjusting the contrast and δ as a parameter adjusting the brightness of a region W around the feature point. The registration consists in finding the parameter vector α = [d1 , d2 , λ , δ ]T that minimises: 2 ε = ∑ I(m p ) − J(λ · (m p + d) + δ ) . (2) W
Substituting J(λ · (m p + d) + δ ) with its first-order Taylor expansion and setting the derivatives of equation (2) with respect to all parameters α to zero results in a linear system G · α = b, from which α can be computed. A Newton-Raphson-style iteration of this procedure is performed until a maximum number of iterations is exceeded or the change in α becomes negligable. To speed up the convergence, we initialise the illumination parameters λ and δ from the mean and standard deviation of the texture around the predicted feature position in the rendering and respectively the live image. A registration is assumed valid, if the sum of squared differences (SSD) over the registered feature windows is below a threshold. In order to increase the convergence radius of the feature registration, we apply a coarse-to-fine strategy over an image pyramid (see figure 2). This enables the system to (re-)initialise from a very rough initial pose, and to continue the tracking, if the pose prediction has drifted after some frames, where no feature were visible at all. To give the feature registration the maximum feasible robustness, two further issues arising from the fact that a rendered image is used for the registration are handled: If the CAD model used for generating the rendering has holes or does not fill the entire image, the OpenGL background colour appears and disturbs the registration. As the information of pixel validity is easily extracted from the depth buffer, we generate a binary mask for the feature window in the rendered image and use this during the registration for masking out invalid pixels (see figure 2 (b)). A second issue concerns the aliasing effects introduced by the rendering process. In order to reduce those and to stabilise the registration, we blur the synthetic image. Figure 2 demonstrates the whole potential of the natural feature tracking as presented here. Although the view onto one side of the
small two-sided target object is quite angular and the live image is very dark due to short integration and low gain settings of the camera, feature points on both sides are registered and the camera pose has been computed as shown in the augmentation. 4 S ENSOR FUSION FOR OPTIMAL POSE ESTIMATION The Extended Kalman Filter is used for fusing the 2D/3D correspondences from the image analysis and the inertial measurements from the IMU to an optimal pose estimate. The equations of the (Extended) Kalman Filter are well-known and are therefore not repeated here [11, 29, 26]. First we introduce the different coordinate systems involved and the notation used subsequently, then we present the three different system models as announced in section 2. 4.1 Notation We refer to the following coordinate systems: the world frame w (fixed to the target scene model), the camera frame c (fixed to the moving camera), the sensor frame s (fixed to the moving IMU), the pixel coordinate system p (assumed ideal, as we compensate for distortion effects before the feature registration) and the normalised image frame n, which is obtained from p as: " −cx fy +cy s # 1 −s xp fx fx fy fx fy y p , mn = (3) · −c y 1 0 1 fy fy
where fx , fy , cx , cy , s are the internal camera parameters. The reference frame, in which a quantity is resolved, is indicated by subscribing the abbreviation introduced above. The subscript of a transformation contains two letters denoting the mapping. Unit quaternions are used for describing a rotation: qba is a quaternion describing the rotation from a to b and q¯ ba is the inverse quaternion. The quaternion product is denoted by ⊙. The abbreviation is also used to indicate the origin of a reference frame: ab is the origin of a given in b. 4.2 Model 1 (gyro) Gyroscopes are used to support the vision-based tracking. The state vector x comprises: T sw s˙w qsw ωs bω s x = (4) sw position [m] s˙w velocity [m/s] qsw orientation quaternion ωs angular velocity [rad/s] bω s gyroscope biases [rad/s].
139
Note, that for simplifying the equations we actually estimate the translation sw and orientation qsw of the IMU with respect to the world frame. However, the camera pose is easily obtained from the state vector: cw qcw
= =
(sw + q¯ sw ⊙ cs ⊙ qsw ) qcs ⊙ qsw ,
(5) (6)
where qcs and cs are the hand-eye rotation and translation between the camera and the IMU, which are rigidly coupled. qcs is calibrated once offline using Horn’s method [9] and cs is measured. The motion model assumes constant velocity and constant angular velocity. The equations for the time update are 3 : t+T 2 s st + s˙t · T + vs¨ · T2 s˙t+T s˙t + vs¨ · T t+T q = exp − T · (ω t + v ) ⊙ qt , (7) ω 2 t+T ω ω t + vω bω t + vbω bω t+T where vs¨ , vω and vbω denote the process noise and h iT v exp (v) = cos kvk kvk sin kvk .
(8)
We assume uncorrelated noise in all components. The gyroscope measurement model is: yω = ωs + bω s + vyω ,
(9)
where yω is the measured angular velocity and vyω denotes the measurement noise. The image analysis initially provides a set of 2D/3D correspondences (m p , mw ) with covariances Pp and Pw attached 4 . m p and Pp are first transformed to the normalised image coordinate system (3) giving the measurement c = (mn , mw ) with covariances Pn and Pw , which is then used in the implicit correspondence measurement model: 0 = I2 −mn · R(qcs ) · (R(qsw ) · (mw − sw ) − cs ) + vc , (10) with vc ≈ N (0, Pc ), where Pc is a combination of Pn and Pw linearly propagated through (10). qcs and cs are as in (5,6) and R(q) denotes the conversion of quaternion q to a rotation matrix [33]. The correspondences are all processed sequentially.
4.3 Model 2 (gravity) Accelerometers are additionally used for stabilising the camera attitude. The state vector comprises additional parameters for the accelerometer bias bas : T sw s˙w qsw ωs bas bω s . x = (11)
The motion model still assumes constant velocity and constant angular velocity including now the additional equation: t+T t ba + vba , = (12) ba where vba denotes the process noise of the accelerometer bias. The accelerometer measurement model involves only the estimated orientation qsw : ya = qsw ⊙ gw ⊙ q¯ sw + bas + vya ,
(13)
where gw denotes the gravity direction in the world coordinate system and vya denotes the accelerometer measurement noise, which in this formulation includes also the free accelerations. The gyroscope and the correspondence measurement model both stay unchanged. 3 Subscripts 4P p
140
are omitted here for a better readability. and Pw are diagonal matrices with standard deviations vm p and vmw
4.4 Model 3 (acc) Our proposed model uses all information given in the accelerometer measurements, i.e. information about the camera attitude and free acceleration as second derivative of the position. The state vector comprises additional parameters for the free acceleration s¨w : x
=
sw
s˙w
s¨w
qsw
ωs
bas
bω s
T
.
(14)
The motion model assumes constant acceleration and constant angular velocity changing the upper three equations of the time update to: t+T 2 s st + s˙t · T + (¨st + vs¨ ) · T2 s˙t+T = . (15) s˙t + (¨st + vs¨ ) · T s¨t+T s¨t + vs¨
The accelerometer measurement model involves also free acceleration: ya = qsw ⊙ (¨sw − gw ) ⊙ q¯ sw + bas + vya . (16) This has the effect of changing during an EKF measurement update not only estimates related to the orientation of the IMU but also estimates related to the position, mainly the acceleration, which of course propagates to velocity and position in the subsequent time update. Clearly, this measurement function requires a certain prior estimate of the orientation, as small errors in the orientation result in unbounded errors in the position. Note, that this model includes the gravity model. The gyroscope and the correspondence measurement model still stay unchanged. 4.5 Outlier rejection The Kalman Filter is not robust. A single erroneous measurement can cause a total filter divergence. Inertial sensors generally produce no outliers, but the image analysis does. As mentioned in section 3, we use a Newton-Raphson method over multiple image resolutions for overcoming also larger feature displacements. This awards the whole system essential robustness, but on the other hand allows outliers, which pass the simple validity test after the registration based on the SSD. As in our system we have a very good prediction of where the features should occur in the images, a simple statistical test is done for detecting outliers. Let zi be the EKF innovation computed for the i-th 2D/3D correspondence and let Si be its innovation covariance with zi ≈ N (0, Si ), all based on the prior state vector. The outlier test becomes: si = zTi · Si−1 · zi > χα2 ,2 ,
(17)
where si is referred to as normalised residual and α is the significance level. A common value for α is 0.05 leading to a threshold of 5.991. During our experiments we discovered that this threshold is too rigorous and that real outliers produce considerably larger residuals. We apply therefore a much looser threshold. 4.6 Divergence monitoring In order to be applicable a tracking system needs a reliable failure detection. A failure of the vision sensor is easily recognised based on the number of features registered. However, due to the inertial sensors our system is able to overcome some frames without any features visible. A (re-)initialisation is therefore only desired, if the dead reckoning has introduced a drift inevitably leading to a filter divergence. Thus, we base the divergence check on a combination of some simple tests: • A filter divergence manifests itself in the normalised residuals of measurements providing absolute reference as the vision measurements. The entity computed in (17) for recognising outliers can therefore be used also for divergence detection
freeacceleration[m/s
2
]
Slowsequence
fast sequence vs¨ vya slow sequence vs¨ vya
gravity
acc
4.0 -
4.0 1.79
0.1 0.14
0.19 -
0.19 0.28
0.1 0.14
rateofturn[rad/s]
gyro
10
5
5
0
0
-5
-5
-10
Figure 3: Integrated camera and inertial sensor package subsequently referred to as CamIMU.
[6]. We maintain a low-pass filtering of the normalised residuals g = λ · g + (1 − λ ) · si , with 0 ≪ λ < 1 and compare this to a threshold. • Obviously, the above test does not detect a divergence, if no vision measurements are delivered. But this one can be observed through the increase in the state uncertainty. We therefore compute the frobenius norm of the state covariance and compare it to a threshold.
0
10
20
30
-10
40
2
2
1
1
0
0
-1
-1
-2
Table 1: Desktop: noise settings used in the experiments. vs¨ [m/s2 ] denotes the process noise for the time update introduced in equations (7,12,15) and vya [m/s2 ] denotes the accelerometer measurement noise introduced in (13,16). Note, that vs¨ has been increased significantly for the models assuming constant velocity to cope with the fast sequence.
Fastsequence
10
0
10
20 time[s]
30
-2
40
0
10
20
30
40
0
10
20 time[s]
30
40
Figure 4: Desktop: free linear accelerations and angular velocities of the slow and the fast sequence as estimated by our system running in the acc mode.
fast sequence mean[std] slow sequence mean[std]
gyro
gravity
acc
3.83 [4.03]
3.82 [4.02]
0.93 [0.62]
0.48 [0.46]
0.48 [0.46]
0.72 [0.42]
Table 2: Desktop: average prediction errors in pixels for the different sensor fusion models on the slow and the fast sequence.
• Finally, the norm of the orientation quaternion in the state vector is checked. The quaternion is normalised before each time update to keep it at unit length, as each measurement update causes a slight change in the norm. A significant change from unit length indicates an estimation error.
30 gyro 20
If one of those tests fails, the system invokes a (re-)initialisation. 10
5.1 Test case: Desktop In order to evaluate the tracking performance of the three sensor fusion models introduced in section 4, the CamIMU was mounted
0 predictionerror[pixel]
5 E XPERIMENTAL SETUP AND RESULTS The camera-IMU system used for the experiments is shown in figure 3. It combines a monochrome PGR camera with an XSens MT9-C IMU in one housing and is subsequently referred to as CamIMU. We considered three different test cases in our experiments, a small-scale test scenario with movements controlled by an industrial robot (Desktop), a realistic mid-scale (Room) and largescale scenario (Foyer), both with uncontrolled movements, where the CamIMU was holden in the hand. In all test scenarios, our system was able to track the trajectory of the CamIMU online and stably at 25 Hz on 320×240 resolution images. The image analysis revealed no problems with textures from synthetic images or cameras other than the one used for the tracking. For a deeper evaluation we captured representative data sequences (synchronised images and IMU data) from each scenario. The results are shown in the videos accompanying this paper and are presented subsequently.
30
0
10
20
30
40
50 gravity
20 10 0 30
0
10
20
30
40
50
acc(ourproposedmodel) 20 10 0
0
10
20
30
40
50
time[s]
Figure 5: Desktop, fast sequence: the prediction errors of the registered features per frame show clearly the superiority of the acc model. The outliers are already removed.
141
on a robotic arm observing the small two-sided target object shown previously in figure 2. The robotic arm was programmed to perform a continuous movement in the form of an eight under various speeds. We considered two data sets for the tests, a slow sequence and a fast sequence, both of which the acceleration and angular velocity spectra are shown in figure 4. In order to maintain the tracking throughout both sequences with all models, the process and measurement noise settings related to the accelerations used in the EKF had to be adapted appropriately (see table (1)). We derived optimal noise settings for each model from the ground truth trajectory of the robot in order to assure meaningful evaluation results, which not just depend on parameter settings. The prediction errors of the features in the image plane, that is the Euclidean distance between the predicted feature positions and the registered positions, are used as indicators for the tracking performance. Table 2 shows clearly, how the acc model outperforms the other models on the fast sequence giving a significantly smaller prediction error. This becomes also apparent in figure 5, where the prediction errors produced by the acc model stay constant, while the gyro and gravity model produce errors highly correlated to the acceleration peaks in figure 4. On the slow sequence, the acc model performs slightly worse then the others. However, the performance decrease is only minor and this effect is explained by vibrations, which are introduced through the mechanical movement of the robotic arm and which disturb the accelerometers. We also see that on both sequences the gravity model, which uses accelerometers to stabilise two angles of the orientation estimate, gives nearly no improvement to the gyro model. This is explained by the fact that a relatively high feature density is used for the experiments, which provides already sufficient information about the rotation. Figure 6 shows results on the slow sequence, where only two features are used for the vision-based tracking, thus revealing the benefits of using the accelerometers in the gravity as well as in the acc model. The experimental results presented in this section allow the following conclusions. The gravity model is superiour to the gyro model by reducing the number of features needed for correcting the drift in the gyroscopes, but both models show as expected a poor tracking quality in the presence of changing linear velocities. The acc model also reduces the number of features needed and provides a significantly better prediction quality. Under slow movements, where the constant acceleration model is actually overparameterised, it still performs just as well without introducing instabilities. In order to give an idea of the absolute accuracy of the overall system under quick motions, we transformed the ground truth trajectory of the robot into the spatial and time reference of our system and compared it to the camera trajectory produced by the acc model on the fast sequence. The average euclidean position error obtained from this comparison was 1 cm and the average error in the Euler angles [1.0, 0.39, 0.9] degree. Note, that the transformations used to synchronise the trajectories contribute of course to the measured estimation errors, so that the actual accuracy can be assumed higher.
(a)
(b)
(c)
Figure 6: Desktop: after 22.32 s of tracking with only two features marked by green crosses on the slow sequence, the gyro model (a) has drifted significantly, while the gravity and acc model (b,c) show comparably stable results. This can be seen from the pose of the augmented coordinate system.
(a)
(b)
(c)
(d)
Figure 7: Room: the CAD model used for the feature registration (a) and some overlaid frames demonstrating the range of possible movements (b,c,d). Note, that the wall observed in (d) is not part of the model, but the augmentation in the lower left corner stays in its place.
5.2 Test case: Room In this scenario a person walks in a room of base area 5 × 4 m and carries the CamIMU. Three walls of the room are modelled as shown in figure 7 (a). The captured data sequences contain a significant range of rotations and translations demonstrated in figure 9 and some other challenging parts, which are all handled robustly by our system running in the acc mode. Two 360 degree rotations are performed, whereat one wall is not included in the CAD model. The tracking survives these periods of pure dead reckoning as indicated in figure 7 (d). As the user moves the camera very near to the target scene seeing only few features, the augmentation still stays in its place (figure 7 (c)). Finally, some fast movements with
142
(a)
(b)
(c)
Figure 8: Room: some overlaid frames demonstrating the stable operation of the system under occlusion (a) and changing illumination (b,c). In (a) the green crosses indicate registered features and the red ones mark featuers, where the registration failed due to the occlusion.
60 inlier outlier
2 1.8 1.6 1.4 1.2
predictionerror(pixel)
z[m]
50
3
-0.5
2.5 -1
40
30
20
2 1.5
-1.5
10
1
-2
0.5 -2.5
y[m]
x[m]
0
0
0
20
40
60
80
100
time[s]
freeacceleration[m/s
2
]
Figure 9: Room: camera trajectory in world coordinates as estimated by our system. The coordinate systems show the camera orientation for every fiftieth frame.
10 5 0 -5 -10
0
20
40
60
80
100
10 rateofturn[rad/s]
Figure 12: Foyer: prediction errors for the features produced by the acc model with outliers marked by stars. The mean prediction error is not more than 1.42 pixels.
5
6
0 -5 -10
acc model with the detected outliers marked by stars. The average prediction error is not more than 1.42 pixels with 1.4695 pixels standard deviation. Some increased displacements between time 60 and 70 are explained by several lacks of vision data of up to half a second, where the pose is solely based on dead reckoning. The reason for those lacks is that the camera repeatedly observes non-modelled parts of the environment. A better impression of the robustness achieved by our system is obtained from the augmented video.
0
20
40
60
80
100
time[s]
Figure 11: Foyer: free linear accelerations and angular velocities as estimated by our system running in the acc mode.
high free accelerations up to 12.9 [m/s2 ] and rotations up to 9.6 [rad/s] are also performed and tracked. Figure 8 demonstrates the stable operation of the system under occlusion and severe lighting changes. 5.3 Test case: Foyer In this scenario a person carrying the CamIMU is standing on the first floor of a building behind a handrail looking down into a large open foyer. A part of the opposite wall (4, 5 × 8, 5 m) at a distance of approximately 7 m is used for the natural feature registration. Figure 10 shows the textured CAD model and some overlaid live camera images. Note, that the features on the posters look because of the big viewing distance quite similar. A very close prediction is therefore crucial to maintain the tracking. We did not succeed in processing this sequence on base of the gravity or the gyro model by adapting the noise settings appropriately. The track gets lost repeatedly between time 68 and 92, although the camera movements are indeed mainly rotational. But the rotations are off-center and therefore high linear accelerations are also produced. Figure 11 shows the estimated free linear accelerations and angular velocities based on the acc model, which tracks robustly through the whole sequence. Figure 12 presents the prediction errors for the features using the
C ONCLUSION
AND FUTURE WORK
This paper presented a markerless vision-inertial tracking system that works robustly in small-scale and large-scale environments, under varying lighting conditions and fast camera movements surviving even short lacks of vision data. Different sensor fusion models have been evaluated, first under controlled movements, showing clearly the benefits of fully exploiting all information given in accelerometer measurements. These benefits are a better tracking quality and a reduced demand of vision measurements. The results were also verified under uncontrolled movements in realistic mid- and large-scale environments. Moreover, a markerless tracking method has been developed, which in the most efficient way exploits a running pose prediction by first rendering a simplified textured 3D model of the environment for predicting the feature appearances and second registering the features in the live camera image by estimating their 2D displacements and local illumination. Alltogether, the developed methods result in an exceptionally robust and efficient final system. One restriction is however the offline preparation needed for generating a textured CAD model of the environment. This can be done manually from some photos using a commercial modelling tool or automatically from an image sequence using structure from motion algorithms [17, 16]. In the future we plan to go towards simultaneous localisation and mapping [24, 15] for reducing the required preparation time. Moreover, the currently semi-automatic (re-)initialisation is not sufficient for considerably larger environments. We will therefore investigate methods based on Bluetooth and Wi-Fi for obtaining a rough initial position estimate. ACKNOWLEDGEMENTS This work has been performed within the EU research program MATRIS (IST-002013). The authors would like to thank the project partners within the consortium and the EU for the financial support.
143
(a)
(b)
(c)
Figure 10: Foyer: (a) CAD model used for the feature registration and virtual objects, (b) rectified live image with feature positions and augmentations, (c) another live image with significant motion blur, where features are nevertheless registered.
The data sequences controlled by the industrial robot have been provided by the University of Link¨oping. For more information, please visit the website www.ist-matris.org. R EFERENCES [1] M. Aron, G. Simon, and M.-O. Berger. Handling uncertain sensor data in vision-based camera tracking. In International Symposium on Mixed and Augmented Reality (ISMAR), pages 58–67, Arlington, VA, November 2004. [2] G. Bleser, H. Wuest, and D. Stricker. Online camera pose estimation in partially known and dynamic scenes. In International Symposium on Mixed and Augmented Reality (ISMAR), pages 56–65, Santa Barabara, CA, October 2006. [3] A. Boffy and Y. Genc. Real-time feature matching using adaptive and spatially distributed classification. In British Machine Vision Conference (BMVC), 2006. [4] E. Foxlin and L. N. I. Inc. Vis-tracker: A wearable vision-inertial self-tracker. In Proceedings of the IEEE Virtual Reality (VR), 2003. [5] M. Grabner, H. Grabner, and B. Horst. Tracking via discriminative online learning of local features. In Conference on Computer Vision and Pattern Recognition (CVPR), 2007. [6] F. Gustafsson. Adaptive Filtering and Change Detection. John Wiley & Sons, September 2000. [7] J. Heikkilae and O. Silven. A four-step camera calibration procedure with implicit image correction. In Conference on Computer Vision and Pattern Recognition (CVPR), 1997. [8] J. Hol, T. Sch¨on, F. Gustafsson, and P. Slycke. Sensor fusion for augmented reality. In International Conference on Information Fusion, Florence, Italy, August 2006. [9] B. K. P. Horn. Closed-form solution of absolute orientation using unit quaternions. In Journal of the Optical Society of America A, volume 4, pages 629–642, April 1987. [10] B. Jiang, U. Neumann, and S. You. A robust hybrid tracking system for outdoor augmented reality. In IEEE Virtual Reality Conference 2004 (VR 2004), 2004. [11] R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960. [12] G. Klein and T. Drummond. Tightly integrated sensor fusion for robust visual tracking. In British Machine Vision Conference (BMVC), volume 2, pages 787–796, Cardiff, 2002. [13] V. Lepetit and P. Fua. Keypoint recognition using randomized trees. In Transactions on Pattern Analysis and Machine Intelligence (PAMI), volume 28, pages 1465–1479, 2006. [14] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 1615–1630, 2005. [15] P. Pinies, T. Lupton, S. Sukkarieh, and J. D. Tardos. Inertial aiding of inverse depth slam using a monocular camera. In International Conference on Robotics and Automation (ICRA), 2007. [16] M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch. Visual modeling with a hand-held camera. In
144
International Journal of Computer Vision, pages 207–232, 2004. [17] M. Pollefeys, R. Koch, and L. J. V. Gool. Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. In International Conference on Computer Vision (ICCV), pages 90–95, 1998. [18] G. Qian, R. Chellappa, and Q. Zheng. Robust structure from motion estimation using inertial data. Optical Society of America, 18:2982– 2997, December 2001. [19] G. Qian, R. Chellappa, and Q. Zheng. Bayesian structure from motion using inertial information. In International Conference on Image Processing (ICIP), June 2002. [20] H. Rehbinder and B. Gosh. Pose estimation using line-based dynamic vision and inertial sensors. In IEEE Transactions on Automatic Control, volume 48, pages 186–199, Stockholm, Sweden, February 2003. [21] G. Reitmayr and T. Drummond. Going out: Robust model-based tracking for outdoor augmented reality. In International Symposium on Mixed and Augmented Reality (ISMAR), 2006. [22] E. Rosten and T. Drummond. Fusing points and lines for high performance tracking. In IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1508–1511, October 2005. [23] T. Sch¨on and F. Gustafsson. Integrated navigation of cameras for augmented reality. In International Federation of Automatic Control (IFAC) World Congress, Prague, Czech Republic, July 2005. International Federation of Automatic Control. [24] T. Sch¨on, R. Karlsson, D. Toernqvist, and F. Gustafsson. A framework for simultaneous localization and mapping utilizing model structure. In International Conference on Information Fusion, Quebec, Canada, July 2007. [25] J. Shi and C. Tomasi. Good features to track. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 593 – 600, 1994. [26] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. Massachusetts Institute of Technology, 2005. [27] http://www.isense.com/. [28] http://www.xsens.com/. [29] G. Welch and G. Bishop. An introduction to the kalman filter. Technical report, University of North Carolina at Chapel Hill, Department of Computer Science, 2001. [30] H. Wuest, F.Vial, and D. Stricker. Adaptive line tracking with multiple hypotheses for augmented reality. In International Symposium on Mixed and Augmented Reality (ISMAR), pages 62–69, 2005. [31] Y. Yokokohji, Y. Sugawara, and T. Yoshikawa. Accurate image overlay on video see-through hmds using vision and accelerometers. In IEEE Virtual Reality Conference 2000 (VR 2000), pages 247–254, 2000. [32] S. You and U. Neumann. Fusion of vision and gyro tracking for robust augmented reality registration. In Virtual Reality (VR), 2001. [33] Z. Zhang and O. Faugeras. 3D Dynamic Scene Analysis. Springer, 1992. [34] T. Zinßer, C. Gr¨aßl, and H. Niemann. Efficient feature tracking for long video sequences. In Deutsche Arbeitsgemeinschaft Mustererkennung (DAGM), pages 326–333, 2004.