Machine Vision and Applications (2011) 22:323–335 DOI 10.1007/s00138-009-0214-y
ORIGINAL PAPER
Tracking rigid objects using integration of model-based and model-free cues Ville Kyrki · Danica Kragic
Received: 22 December 2008 / Revised: 24 June 2009 / Accepted: 19 July 2009 / Published online: 12 August 2009 © Springer-Verlag 2009
Abstract Model-based 3-D object tracking has earned significant importance in areas such as augmented reality, surveillance, visual servoing, robotic object manipulation and grasping. Key problems to robust and precise object tracking are the outliers caused by occlusion, self-occlusion, cluttered background, reflections and complex appearance properties of the object. Two of the most common solutions to the above problems have been the use of robust estimators and the integration of visual cues. The tracking system presented in this paper achieves robustness by integrating modelbased and model-free cues together with robust estimators. As a model-based cue, a wireframe edge model is used. As model-free cues, automatically generated surface texture features are used. The particular contribution of this work is the integration framework where not only polyhedral objects are considered. In particular, we deal also with spherical, cylindrical and conical objects for which the complete pose cannot be estimated using only wireframe models. Using the integration with the model-free features, we show how a full pose estimate can be obtained. Experimental evaluation demonstrates robust system performance in realistic settings with highly textured objects and natural backgrounds. Keywords Model-based tracking · Model-free tracking · Cue integration · Iterated extended Kalman filter V. Kyrki (B) Department of Information Technology, Lappeenranta University of Technology, P.O. Box 20, 53851 Lappeenranta, Finland e-mail:
[email protected] D. Kragic Centre for Autonomous Systems, School of Computer Science and Communication, Royal Institute of Technology, 10044 Stockholm, Sweden e-mail:
[email protected]
1 Introduction Reliable and stable pose tracking is today one of the most important building blocks of systems in augmented reality (AR) [1], activity interpretation [2], and robotic object manipulation and grasping [3]. It is interesting to notice that there are almost as many proposed tracking algorithms as there are applications. One reason for this is the multitude of camera-object configurations: moving camera/static object (visual servoing, visual navigation, AR), static camera/moving object (activity interpretation, surveillance), moving camera/moving object (visual servoing, AR). Another reason is the appearance of objects considered: some of the approaches have specifically been designed for tracking of textured objects [1,3] while others, mainly based on gradient information, have mainly been evaluated on non-textured objects [4–8]. Despite their number, pose-tracking systems are still prone to drift and jitter, and can lose track if the geometrical model of the object is simple but the appearances of the object and background are complex. In applications such as robotic object manipulation, tracking is typically modelbased, because the grasping can only be performed after aligning the manipulator and the object precisely if no additional sensory modalities are available. Thus, the absolute pose of the object with respect to the manipulator-mounted camera needs to be recovered. Some of the visual servo application solve this problem by using the teaching-by-showing approach that requires an image of the target in the desired pose [9]. In general, the use of only a wire-frame model for tracking is difficult when the background and the object appearance properties are complex, as it is difficult to distinguish between background and object edges, as well as multiple edges on the object itself. Tracking of textured objects can also be problematic since the “signal-to-noise ratio” is small,
123
324
V. Kyrki, D. Kragic
Fig. 1 Performance of the object based tracking system when (upper) only a wire-frame model is used, and (lower) both model-based and model-free features are used for tracking
that is, only a fraction of detected edges really belong to the outer edges of the object. The confusion between object and background edges is illustrated in Fig. 1, where the sequence of images on top row shows the tracking result of the Lie algebra method proposed in [8]. Here, the tracker gets easily stuck on a leg of a table in the background. In this paper, we propose a solution to the 3-D tracking problem by integrating model-based and model-free features. The benefit of using the integration can be seen in the second row of images in Fig. 1. We show that the different types of features can be integrated in a Kalman filter framework that operates in real-time. In addition, we demonstrate that the model-free features can be used to track the full pose of objects which are only partially observable using the CAD model, for example, solids of revolution. Finally, three different motion models are developed and their characteristics studied. The paper is organized as follows: we begin by reviewing related literature in Sect. 2. Then, in Sect. 3 we describe the model-based tracking approach, followed by the modelfree tracking in Sect. 3.2. Section 4 presents our approach of integrating cues using the Iterated Extended Kalman filter. Experiments are presented in Sect. 5, and finally, the method is discussed and conclusions drawn in Sect. 6.
2 Related work Although there have been examples of appearance based 3-D pose tracking systems [10], most current systems for 3-D pose tracking are based on tracking of object boundaries. One of the early systems called RAPID [4] uses the dynamic vision approach presented by Dickmanns [11], which is based on the use of extended Kalman filtering to integrate image measurements through a non-linear measurement function to estimate the pose. However, these papers do not consider the modeling of the motion in detail, which is one of the main problems considered here. The same applies to most of the other approaches presented, such as [12,13]. Drummond and Cipolla presented an approach using Lie group and Lie algebra formalism as the basis for representing the motion of a rigid body [8]. The approach has been shown to give excellent results in the case of non-textured objects. Our work applies and extends the ideas to textured objects with a textured background.
123
Considering tracking based on texture, an approach for model based tracking based on local bundle adjustment has been presented [1]. It relies on the use of a CAD model of the object and requires off-line matching of model points to their 2-D projections in a set of reference keyframes, which is in contrast to our approach, where the texture properties are determined on-line. In [1], the matching between the current frame and a keyframe is based on homographies, which is suitable for locally planar (polyhedral) objects. In contrast, our approach considers also curved surfaces. In [14], Masson et al. propose a 3-D tracking algorithm based on a fast patch registration that provides 3D-2D correspondences for pose estimation. An off-line stage using a textured model is used to learn a Jacobian for the patch registration. Compared to these approaches, in our work we are interested in integrating CAD models with on-line generated surface features for tracking both polyhedral and non-polyhedral objects. By generating the features on-line, we do not need the off-line registration process required in the above work. It can be argued that the registration process is not a complex one but it still has to be redone every time the appearance of the object is changed, even if there is only a small change in texture. In practice, this can happen rather often, for example for food items such as a rice or a cereal box. In robotics community, one envisions a robot that manipulates food items in a kitchen and for such an application the above makes a significant problem. Comport et al. have presented a tracking system based on a non-linear pose computation formulated by means of a virtual visual servoing approach [5]. Tracking of different features including lines, circles, cylinders and spheres is demonstrated using the interaction matrix formulation. Similar to our approach, a local moving edge tracker is used for real-time tracking of points normal to the object contours. Robustness is obtained by integrating a M-estimator into the visual control law via an iteratively re-weighted least squares approach. Different from our work, only an edge model of the object is considered. Work presented in [15] demonstrates a tracking system based on integration of visual and inertial sensors. A good performance is achieved for fast camera movements due to the integration with an inertial sensor but it is argued that, in order to have a robust system, more stable visual features should be considered. We believe that our approach provides a suitable framework for retrieving such features. Integration of visual cues has been found to provide increased robustness and has been used successfully in
Tracking rigid objects
application such as object tracking and scene segmentation [16–19]. In tracking, multiple cues have been applied mostly for image space tracking. Recently they have been proposed for 3-D tracking of polyhedral objects for which a textured model is available [3]. In this paper, we do not need a textured model, but instead the texture features are generated online. This idea has also been proposed by Vacchetti et al. [20]. In contrast to that paper, we extend the idea by formulating tracking for solids of revolution and spherical objects using both CAD models and texture. Also, in [20] the texture features are tracked only between two consecutive images, which might cause increased drift over long tracking sequences if the edge information is weak. A similar strategy was proposed by Rosten and Drummond in [21] which applies expectation maximization to combine the estimates based on edge and texture information. The use of a Kalman filter to integrate model-based and model-free cues was presented for objects composed of planar surfaces in [22], and different integration models for model-free cues were studied in [23]. This paper collects these works and extends them to curved surfaces where some degrees of freedom are not observable using the edge information. Brox et al. have proposed to use optical flow-based correspondences together with extracted object contour for pose tracking [24], taking somewhat differing point of view to the use of contour information by extracting the contour by a segmentation-like approach not employing the object geometry, and then using iterated closest point algorithm to obtain correspondences between the model and the extracted contour. The virtual visual servoing approach has also been extended to account for model-free cues [25], which is accomplished by using an image intensity based part in the Jacobian of the virtual visual servoing. Thus, their model-free part closely resembles the 2-D tracking approach of [26]. No motion modeling is used. This is in contrast to this paper, where motion is modeled and interest points are first tracked in the image and then the image motion is used to infer information about the 3-D motion. Similar to the approach presented in this paper, the approaches above rely on M-estimators to increase the robustness. Lu et al. [27] have also proposed a method combining edge and optical flow cues for tracking an articulated hand model. Their approach differs from the approach presented here in that they do not consider the motion modeling, here achieved through a Kalman filter.
3 Tracking system The tracking system is composed of two parts, model-based and model-free. These are next described in more detail with the aim that the extensions to non-planar objects are emphasized while keeping this study self-contained.
325
3.1 Tracking using a Lie group formulation The model-based tracking system is based on the ideas proposed by Drummond and Cipolla in [8]. The approach relies on the estimation of normal flow for points (nodes) along the edges, and preserves the rigid structure of the object. The estimation of relative pose change between two consecutive frames, is based on measuring the distance between sample point features and the closest edge. This normal flow is then projected in the direction of the contour normal so as to represent an error to be minimized. To estimate the relative change in pose between two consecutive frames, the derivative of S E(3) corresponding to the tangent velocity vector space (Lie algebra) is used. Velocity basis matrices Gi are chosen in a standard way to represent translations in the x, y and z directions (G1 , G2 , G3 ) along with rotations around the x, y and z axes (G4 , G5 , G6 ) [8]. The transformation matrices M describing object pose are related to the Lie algebra of velocities via the exponential map, such that (1) M = exp(M) = exp αi Gi where αi correspond to the elements of the kinematic screw or twist representing the inter-frame transformations. Computing the motion is performed by solving a weighted leastsquares (M-estimation) problem for determining αi . For details, see [8]. Next, we propose how the system can be used to track solids of revolution and spherical objects. Solids of revolution, such as cylinders and truncated cones, have one unobservable degree of freedom, namely, the axis of revolution. By aligning the Lie generators with the coordinate system attached to the object, the effect of the generator corresponding to this rotation can be removed (in our case G 5 as the axis of revolution corresponds to y). We note here that the least-square expressions used to solve for αi only lose one dimension of i, and otherwise the estimation of the object pose remains the same. A spherical object has three unobservable degrees of freedom, as no rotations can be observed. In this case, all three generators related to the rotational motion (G 4 , G 5 , G 6 ) are disregarded. 3.2 Model-free features The image plane tracker is initialized using interest points extracted using the Harris corner detector [28], which was selected because it is suitable for a real-time operation and describes well interest points of a textured surface. The current state estimate is used to only initialize points on the visible part of the object. After the interest points are chosen, the local neighborhood around each interest point is stored. It is sufficient to perform the corner detection every 10th frame
123
326
V. Kyrki, D. Kragic Fig. 3 Cone model
y
z
r
p y CAMERA
R,t
x
rt
h
y x
z
z
rb
Fig. 2 Sphere model
where because the rotation changes between the object and the camera frames are only moderately fast and the appearance of the local neighborhood changes gradually. The point tracking is based on minimizing the sum of squared differences in the RGB values. It would be possible to use a more complex (e.g., affine) motion model for tracked areas, but this simple approach is useful because it is computationally light and it is suitable for region tracking over shorter time intervals. It should be noted that our integration method does not depend on tracking the same feature through the whole sequence of images. The tracked features are rejected on two conditions: (i) if the sum of squared differences grows above a predetermined threshold, or (ii) if the feature moves out of the estimated image position of the object. Also, we found that an RGB template greatly outperforms the gray scale template. Thus, the output of the model-free tracker are the image coordinates of the tracked points. 3.3 Initialization of model-free features As mentioned, when a new model-free point is initialized, its 3-D location in the object coordinate frame is determined and stored. If the point lies on a planar patch, determining the location corresponds to calculating the intersection of the model plane and the line corresponding to the point where the 3-D point projects onto the camera image plane. In this case the solution is unique. Contrary to this, if the object model is a sphere or a truncated cone, there may be several intersections. Ray casting is used to solve the intersections. The sphere model used is shown in Fig. 2. Let r be the radius of a sphere, t the location of its origin (located at its center), R the rotation matrix corresponding to its current orientation (together R and t describe the transformation between the camera and object coordinate systems) and p be the camera frame homogeneous coordinates of a point on the object surface. When an image ray passes through the sphere, there are two intersections. The desired intersection closest to the camera can be solved in object frame from q = R T (k2 p − t),
123
(2)
k2 =
p·t−
(p · t)2 − p2 (t2 − r 2 ) . p2
(3)
For a truncated cone, let rb and rt be the radii of the bottom and top, correspondingly, and let h be the height (see Fig. 3). Note that this model applies also to cylinders, for which rb = rt . Let the origin of the object frame be at the center of the bottom, and t and R describe its position and orientation. The line corresponding to the image point p can then be written in object frame as l = −R T t + kR T p ≡ w + kv. Then, the intersections are the solutions of the secondorder polynomial k 2 vx2 + vz2 − a 2 v 2y + 2k vx wx + vz wz − a 2 v y w y − 2arb v y + wx2 + wz2 − a 2 w 2y − 2arb w y − rb2 = 0.
(4)
The second-order polynomial has up to two solutions, k1 , k2 , giving the intersections −R T t + ki R T p. Now the intersections are directly expressed in the object frame as desired, but to select the one closest to the camera, the intersections must be transformed to camera frame, and then the closest one can be chosen.
4 Integration of motion with IEKF The measurements of the two types of features are integrated using an iterated extended Kalman filter (IEKF). It is used instead of the basic extended Kalman filter because it is shown to perform better when the measurement functions are strongly nonlinear [29]. An alternative approach would be to use particle filtering for integration [30]. However, in a model-based tracking system, we would need to generate a large number of sample hypotheses of the pose and verify each of them given the image data. Projecting the model for each of the hypotheses is a time-consuming process and it would force us to use a comparatively small number of edge measurements per sample, which can cause problems in the case of highly textured objects. To present the actual models proposed and studied, we now introduce the notation used. For IEKF estimation equations, the reader is asked to consult a standard text, for
Tracking rigid objects
327
example [31]. The system state is denoted by x, the system model (motion model) by f (x), its gradient by F, the measurement model by h(x), and its gradient by H. The uncertainty of state is described by covariance matrix P, the uncertainty of system model by Q, and the uncertainty of measurement by S.
We study three different motion models. The first two are zeroth order, i.e., we do not have any velocity model, to really investigate only the integration issue. The difference between the models is the center point of the rotation. In the first (Model 1), the object rotates around its own origin, which corresponds to moving object and stationary camera. In the second (Model 2), the rotation is around the camera frame origin, corresponding to moving camera and stationary object. The third model (Model 3), is a constant velocity model. We want to use all models in order to study how much the choice of a model affects the tracking accuracy. All models are linear, but have differences in the gradient of the prediction function F and the prediction covariance Q. Models 1 and 2 use the 6-DOF pose vector of the tracked object as the system state, i.e., x = (X, Y, Z , φ, θ, ψ)T . Due to the well-known problems with the non-uniqueness of Euler angles [3], we adopt the approach proposed in [32], where the orientation is represented externally, outside the Kalman filter state, and the angles φ, θ and ψ only describe incremental changes. Unlike their approach of using a quaternion, we represent the external orientation using a rotation matrix. Thus, after each time step the rotation angles are integrated into matrix R0 and reset to zero. Models 1 and 2 predict the internal state according to xi+1|i = xi . Thus, the gradient of the state update is the identity matrix, F = I6 . The motion is modeled as a Wiener process, with independent uncorrelated noise sources for both translational and rotational motion. Thus, for Model 1, where the motion is with respect to the object origin, the state prediction covariance is
0 t σ p2 I3 0 t σφ2 I3
R(wφ , p) ≈ R(0, p) + where
4.1 Motion models
Q1 (t) =
where p and φ represent translation and rotation vectors, correspondingly, and w p and wφ are the noise sources for camera translation and rotation. Writing R(wφ , pi ) ≡ R(wφ )pi , we can then approximate it with a first order Taylor series as
(5)
∂R(wφ , p) wφ = p − [p]× wφ ∂wφ
⎞ 0 − pz p y [p]× = ⎝ pz 0 − px ⎠ − p y px 0
(7)
⎛
(8)
is the skew-symmetric matrix corresponding to the cross (vector) product with p. (6) and (7) allow us to write Q2 as Q pp Q pφ (9) Q2 = Qφp Qφφ where T Q pp = t σ p2 I3 + t σφ2 [p]× [p]×
Q pφ = t σφ2 [p]×
T Qφp = t σφ2 [p]×
Qφφ = t σφ2 I3 .
Thus, the translation is affected by the rotation around the camera center in addition to the pure translational component. Also, translation and rotation are correlated. In Model 3 the system state includes velocities in addition to the 6-DOF pose, ˙ θ˙ , ψ) ˙ T. x = (X, Y, Z , φ, θ, ψ, X˙ , Y˙ , Z˙ , φ, The model has a prediction equation I6 tI6 xi . xi+1|i = 0 I6
(10)
The noise is considered to only affect the velocities, which are modeled as a Wiener process. Note that in this context the noise describes the stimulus driving the system, not the measurement uncertainty, and thus Model 3 can be understood such that acceleration is modeled as a zero-mean Gaussian random variable. The noise covariance is thus ⎞ ⎛1 3 2 1 2 2 0 0 3 t σ p I3 2 t σ p I3 ⎟ ⎜1 2 2 ⎟ ⎜ 2 t σ p I3 tσ p2 I3 0 0 ⎟ ⎜ Q3 = ⎜ ⎟. 1 1 2 2 3 2 ⎜ t σ I t σ I 0 0 φ 3 2 φ 3⎟ 3 ⎠ ⎝ 1 2 2 0 0 tσφ2 I3 2 t σφ I3 (11)
where t is the length of the time step, and σ p2 and σφ2 are the variances of the translational and rotational motion, respectively. In Model 2, the translation is affected by the rotation around the camera center and the system can be written as pi+1 = R(wφ )pi − w p φi+1 = φi − wφ
(6)
4.2 Measurement models Two types of measurements are integrated in the Kalman filter, model-based pose measurements and model-free point measurements. Neither of these is used to directly observe velocity. Thus, in the following, we can assume without
123
328
V. Kyrki, D. Kragic
loss of generality that the system state is described by the 6 degree-of-freedom pose vector, without the velocity information. Also, it is assumed that the camera is calibrated, that is, we are using calibrated camera frame coordinates instead of image pixel coordinates. This assumption can be made as efficient camera calibration procedures are readily available, e.g. [33,34]. 4.2.1 Pose measurements The measurement model h(·) describes how the measurements are related to the system state. In the case when all degrees-of-freedom are observable using the model, the measurement model for the model-based tracking can be written simply as h F (x) = x = (X Y Z φ θ ψ)T .
(12)
ˆ into the To convert the measured translation tˆ and rotation R ˆ can be defined as same form, function g F (tˆ, R) tˆ ˆ = (13) g F (ˆt, R) ˆ T) . α(RR 0 where α(R) is a function making the conversion from matrix representation of rotation to immediate angles. Using this formulation, the gradient of the measurement function, needed to compute the Kalman gain, becomes simply an identity matrix: ∂h F (x) = I6 . (14) ∂x When all rotational degrees-of-freedom are unobservable using the model, they can be ignored, and the measurement functions become h S (x) = (X Y Z )T
g S (tˆ) = tˆ
(15)
and the gradient is ∂h S (x) = (I3 03 ). (16) ∂x When there is only one unobservable degree-of-freedom, such as in the case of a truncated cone, the conversion functions become more complex. The measurement function for the cone model presented in Sect. 3.2 is h C (x) = (X Y Z αx (R0T R p ) αz (R0T R p ))T
(17)
where R p is the predicted rotation. Thus, only rotations around x and z axes of the object frame can be measured. These angles are also represented as instantaneous, that is, as incremental angles compared to the previous time instant. It should be also noticed that, because the model has no predictive function, R p = R0 , both αx and αz are zero. While only two degrees-of-freedom of rotational motion are measured, these two do not necessarily correspond to the rotations around particular two coordinate axes of the camera
123
frame, but instead to some combination of the three axes. As a Kalman filter is a linearized model of the system, this is taken into account by linearization. In addition, the orientation of the object must be transformed from the camera frame to object frame. Therefore, the conversion function for the measurement is ⎞ ⎛ tˆ ⎜ ˆ ⎟ ⎟. α (R T R) (18) gC (x) = ⎜ ⎠ ⎝ x 0 T ˆ αz (R0 R) While the absence of prediction caused both of the measured angles to be zero, the form of the function in (17) is still necessary for the calculation of the measurement gradient. The gradient of the measurement function at R p = R0 becomes then ⎛ ⎞ I 03 ∂h C (x) ⎝ 3 = 0 R p (1, 1) R p (2, 1) R p (3, 1) ⎠ , (19) ∂x 0 R p (1, 3) R p (2, 3) R p (3, 3) where R p (i, k) is the (i, k) element of R p . 4.2.2 Model-free measurements For the integration of point measurements, two approaches are considered: (1) directly integrating the 2-D point measurements in the Kalman filter using a non-linear measurement model corresponding to the perspective projection, and (2) using M-estimators to robustly estimate the current pose and integrating the pose measurement using a linear measurement model. The two approaches have different computational complexities and error characteristics. It should be noted that in both approaches the following simplification is made compared to full structure-from-motion: For all point features, the 3-D object frame location is recorded during the initialization using the current pose estimate, as presented in Sect. 3.2. Therefore, the point location is not included in the Kalman filter state, but the associated uncertainty is modeled as part of the measurement uncertainty. This choice makes real-time operation possible even with a large number of points. 4.2.3 Direct integration In the direct integration approach, the image measurements of model-free points are used directly as IEKF measurements. The 3-D locations of the points with respect to the object origin, stored during the point initialization, are used as references. Therefore, the measurement function h j (·) is the perspective projection of a known 3-D point q j : h j (x) = (X j /Z j Y j /Z j )T (X j Y j Z j )T = R p (x, R0 )q j + t(x)
(20)
Tracking rigid objects
329
where R p (x, R0 ) is the rotation matrix taking into account the accumulated rotation R0 and the rotational degreesof-freedom of the state x, and t(x) represents the translational component of the state. The same function is used for each of the measured points. The gradient of the measurement function can be calculated from (20), but is not shown here for the sake of brevity. The measurement errors are assumed to be independent with respect to the coordinate axes. They are also assumed to be independent for each point. Thus, the measurement covariance matrix for point measurements can be written as S I = σi2 I where σi2 is the image measurement variance. 4.2.4 M-estimators
ten as S M =
The direct integration approach above suffers from measurement outliers. To increase the tolerance to outliers, a robust M-estimator can be used [35]. To outline the approach, the M-estimator is first used to find the best pose corresponding to current measurements. Then, the model-free pose measurement is integrated in the Kalman filter with the modelbased measurement. It should be noted that M-estimators could also be used to weight image errors in an integration approach similar to the one presented in Sect. 4.2.3, however, the approach presented here can be computationally more efficient. The M-estimators are used to minimize a robust error function, based on the sum of the squared errors between image measurements and the recorded 3-D points. The sum of squared errors measure of the 3D-2D projection error can be written as Xj 2 Yj 2 2 xj − ej ej = + yj − (21) d= Zj Zj j
where (x j , y j ) is the measured position, and (X j , Y j , Z j ) are its coordinates in the camera frame, from (20). Instead of solving the optimization problem for the squared error criterion in (21), the squared residual is replaced with a robust weighting function. A Tukey weight function [36] is used and thus the optimization problem can be written wTUK (e j )e2j (22) min R,t
j
with the weight function defined as 2 1 − (x/c)2 if x ≤ c wTUK (x) = 0 if x > c
The least squares optimization is performed by a Polak– Ribiere variant of the conjugate gradient method [37]. Each line search along the gradient is performed by first bracketing the minimum by golden section bracketing, and then locating it by Brent’s method [38]. A constant value c is used for the outlier rejection threshold, as it directly relates to the image pixel error and can be reasonably easily be estimated for a particular application. The estimated pose (R, t) is finally integrated in the Kalman filter using the measurement function identical to that of model-based tracking without unobservable degreesof-freedom presented in (12). The conversion function is also identical to (13). uncertainty can be writ The measurement
(23)
where c is a parameter describing the outliers rejection threshold. The problem is solved as iteratively reweighted least squares, such that for each measurement, a weight is first determined, then a weighted least squares optimization is performed, and these two steps are repeated until convergence.
σt2 I3 03 03 σφ2 I3
. By adjusting the translational and
rotational uncertainties σt and σφ , the effect of model-free measurements can be controlled. For example, using a very large σt makes the effect of model-free measurements negligible with respect to the translation of the object, such that only rotation is estimated using them.
5 Experimental evaluation Accuracy experiments were performed on two recorded sequences for which we also estimated the ground truth to allow repeated tests and statistical evaluation of accuracy. For these experiments, two different rectangular target objects, seen in Fig. 4, were used. The lengths of the sequences are 173 (Sequence 1) and 157 s (Sequence 2). The sequences were recorded by moving a camera mounted on a robot arm. Ground truth was generated by recording the robot trajectory calculated using the forward kinematics and determining the hand-eye calibration by the method of Tsai and Lenz [39]. A linear projective model was used as the intrinsic model of the camera. 5.1 Tracking accuracy Tracking accuracy was studied for both sequences and the two motion models described in Sect. 4.1. Typical number of tracked point features for the model-free tracker was 25–50 for Seq. 1 and 10 for Seq. 2. Figures 5, 6, 7 and 8 show the error magnitudes for the two sequences for both motion models. For Sequence 1, it can be seen that the moving camera model is less prone to sudden changes in rotation error. This is because the model restricts the rotation of the object to co-occur with suitable translations. Both models had similar average errors, the moving object model having slightly lower error of 1.7 cm in translation and 3.8◦ in rotation compared to 1.9 cm and 3.9◦ with the moving camera model. For Sequence 2, the errors were
123
330
V. Kyrki, D. Kragic
Fig. 5 Moving object model, Sequence 1
error/m
Fig. 4 Test sequences 1 (top) and 2 (bottom)
0.04 0.02
error/deg
0
0
20
40
60
80
100
120
140
160
180
100
120
140
160
180
100
120
140
160
180
100
120
140
160
180
time/s
10 5 0
0
20
40
60
80
Fig. 6 Moving camera model, Sequence 1
error/m
time/s
0.05
0
0
20
40
60
80
error/deg
time/s 20 10 0
0
20
40
60
80
Fig. 7 Moving object model, Sequence 2
error/m
time/s
0.1 0.05 0
0
20
40
60
80
100
120
140
160
100
120
140
160
error/deg
time/s 20 10 0
0
20
40
60
80
time/s
123
331
Fig. 8 Moving camera model, Sequence 2
error/m
Tracking rigid objects 0.2 0.1 0
0
20
40
60
80
100
120
140
160
100
120
140
160
error/deg
time/s 20 10 0
0
20
40
60
80
time/s
2
Model
Model-free
Both
Moving object
0.018/4.1
0.25/24.8
0.017/3.8
Moving camera
0.052/4.0
0.075/27.7
0.019/3.9
error/m
Table 1 Integration results. Average errors in meters/degrees
1 0
0
50
100
150
200
150
200
time/s error/deg
50
0 0
50
100
time/s
Fig. 9 Drift in model-free tracking Table 2 Errors with lower sampling rates Moving object
Moving camera
Constant velocity
30 Hz
0.017/3.8
0.019/3.9
0.018/4.0
15 Hz
0.019/4.4
0.020/4.8
0.019/4.5
10 Hz
0.038/13.8
0.037/13.4
0.019/4.8
7.5 Hz
0.038/13.8
0.038/13.6
0.020/5.1
6 Hz
0.039/13.9
0.041/17.6
0.038/13.8
error/m
0.1 0.05 0 0
50
100
150
200
150
200
time/s 40
error/deg
6.5 cm, 12.2◦ (moving object), and 6.5 cm, 12.1◦ (moving camera). Here, the rotation error was remarkably higher. This is because in the sequence one of the visible edges cannot be detected because the lighting angle caused the pixels on both sides of the edge to have same color. Thus, the edge is mismatched to a neighboring edge which causes the pose to rotate somewhat. This is a problem that cannot be solved with the model-free features, and requires further study. The use of robust optimization for determining the pose was studied with Sequences 1 and 2. It was found out that there is no significant difference in the accuracy compared to the direct integration of point measurements. Both of these sequences have a clear texture on the object visible all the time, and thus there are no spurious measurement outliers, which would cause the direct integration to give an erroneous estimate. The use of integration was also inspected by ignoring one of the cues. We need to mention here that the model-based approach used alone typically results in a failure if there are strong edges in the background as demonstrated in Fig. 1. In order to study the accuracy, neither of the sequences had a strong background edge coinciding with one of the object edges, which would have made the model tracker fail. These results can be seen in Table 1 for Sequence 1. It can be seen that the use of only model-free features results in a bad accuracy. This is because the pose drifts as the features only describe the relative pose changes without respect to absolute error. This drifting is demonstrated in Fig. 9. Thus, the integration did not have a dramatic effect in the cases where the model-based approach did not fail completely, but it still improved the accuracy slightly. The need for motion modeling in tracking fast moving targets was inspected by lowering the sampling rate of the sequence to integer fractions of the full rate. The errors for different sampling rates are shown in Table 2. The errors for the two zeroth-order models increase suddenly at 10 Hz
20 0
0
50
100
time/s
Fig. 10 Error at 6 Hz sampling rate
sampling rate because at that rate the object motion at one point of the sequence is fast enough so that one of the tracked edges is confused with another. After that point, the tracker has a constant error offset, which can be seen in Fig. 10 which
123
332
V. Kyrki, D. Kragic
Fig. 11 Cone test sequence—here, the cup with Santa Claus motive is tracked. The estimated pose is overlaid in white
Fig. 12 Sphere test sequence—the estimated pose is overlaid in white
shows the evolution of the error for 6 Hz sampling rate. The constant velocity motion model can handle sampling rates down to 7.5 Hz after which it exhibits the same phenomenon. Thus, the use of constant velocity motion model allowed the tracking of objects with two times larger velocities. The important conclusion here is that while the constant velocity assumption is not exactly correct, which can be seen as slightly greater errors on high sampling rates, the assumption
123
makes it possible to track objects with significantly higher velocities, corresponding to lower sampling rates.
5.2 Partially observable models Two test sequences are presented in Figs. 11 and 12, one for tracking a truncated cone and another for a sphere. The
Tracking rigid objects
333 100
40 20
80
0
60
−20
40
X Y Z
−40 −60
0
200
400
600
800
1000
Fig. 13 Estimated sphere orientation with M-estimators
X Y Z
20 0
0
200
400
600
800
1000
Fig. 15 Estimated cone orientation with M-estimators
10
120
X Y Z
100
0
80 −10
60
X Y Z
−20
−30
0
200
400
600
800
1000
Fig. 14 Estimated sphere orientation with direct integration
sequences are overlaid with the projected model edges, using the M-estimator approach for integration. Figures 13 and 14 present the tracked angles for the sphere model undergoing the motion shown in Fig. 12 for both M-estimators and direct integration. The three curves correspond to the rotations around the three principal axes of the camera frame. For that reason, the rotation is not exactly around the vertical axis of the camera during the beginning of the scene, where the object is rotated around the world vertical axis. The rotation reaches its maximum near the 400th frame, corresponding to the fourth image of Fig. 12. After that the object is rotated backwards until frame 700, after which the object is moved and rotated along a more complex trajectory. The graph in Fig. 13 represents the true motion relatively well while the direct integration approach presented in Fig. 14 detects rotation around z-axis which is not present. This demonstrates qualitatively that the M-estimator approach is able to detect the rotation even while the orientation is unobservable from the model point of view. One particular thing to notice is that the distance to the object is somewhat overestimated for a short while during the tracking (see the fourth image in Fig. 12). The distance is mostly estimated by the model-based tracker, and at this point one of the object edges has a gradual shadow and the edge cannot be reliably detected, which causes the overestimation of the distance. An important point to notice is that this does not seem to cause
40 20 0
0
200
400
600
800
1000
Fig. 16 Estimated cone orientation with direct integration
significant problems for the estimation of the orientation, as can be seen from the symmetry of the rotation trajectories in Fig. 13. In contrast, there are spurious changes present in Fig. 14, which are result from the bad quality of some of the tracked points. This sequence is used here to demonstrate that the M-estimators are important in reducing the effect of outliers, especially when the texture of the object does not present easy tracking targets. Figures 15 and 16 present the tracked angles for the cone model under the sequence shown in Fig. 11 for both integration approaches. Consistent behavior can be seen in this sequence for both of the methods. In this sequence, the average number of tracked points is larger, and there are no spurious errors in the points. It should be noticed that the model-based tracker has here a significant contribution on the observable rotational degrees-of-freedom because the modelfree features are initialized on-line and thus measure only relative changes, and for that reason they do not suppress consistently incorrect estimates of the model-based tracker. 5.3 Performance The computational feasibility of the methods was examined by measuring the average processing times over long sequences of images. The average per frame times are presented in Table 3. The time required for retrieving the images
123
334
V. Kyrki, D. Kragic
Table 3 Average processing times Method
Sequence 1 (ms)
Sphere (ms)
Cone (ms)
Direct integration
61.1
2.8
5.4
M-estimators
13.1
3.1
5.0
is not included in the processing time. The tests were run on a 1.7 GHz Pentium M laptop. With Sequence 1, the direct integration approach is over four times slower compared to the M-estimator optimization. This is because the number of tracked features is high, causing a high dimensional measurement to be integrated in the Kalman filter when the direct integration approach is used. In M-estimator approach this computational load is moved to solving the optimization problem, and the Kalman filter only integrates the 6-DOF pose. For the sphere and cone sequences the number of tracked features is much lower. With the sphere sequence, tracking is slightly faster using the direct integration model, because the number of image features is small, keeping also the measurement of the Kalman filter low-dimensional. When this dimensionality grows, the processing time increases. With the cone sequence, the M-estimator approach has lower processing time compared to the direct integration, as its computational complexity grows slower with respect to the number of measurements. However, both approaches seem to be feasible for real-time use from the computational point of view if the number of tracked features is comparably low, with the optimization approach gaining advantage when the number of features increases.
6 Conclusion Applications of model-based tracking range from intelligent user interfaces to augmented reality and robotic object manipulation. Although a widely studied problem, there is still a lack of a unified framework or a very robust system suitable for different application areas. In this paper, we have presented a method to integrate information from model-based and model-free cues for 3-D object tracking. The method is based on an iterated extended Kalman filter, and three different motion models have been presented. Most approaches in the literature use the moving object motion model without much further thought, but we suggest that it is important to consider the physical configuration of a particular application in the choice of the motion model. For example, if the apparent target motion is caused by moving observer, taking this into account in the motion model can improve results. System run-time may differ during different iterations, which is taken into account in the
123
changes of the state covariance. Two types of image features, object edges and interest points, are used to compensate for each others weaknesses. We believe that the integration of model-free cues can benefit a wide variety of tracking approaches. The presented approach is based on a Kalman filter. As an alternative, particle filters approaches are also popular in the vision community, as they allow tracking of multiple hypotheses. However, it is not clear if such approaches are useful for real-time 3-D tracking as the state space has 6 degrees of freedom compared to 2 in image plane tracking, causing the need to increase the number of particles significantly to preserve the same particle density. An approach parallel to the use of pose tracking is the direct use of image measurements in visual control, for example, in image based visual servoing. However, we believe that the use of multiple visual cues is very important in avoiding the weaknesses of each individual cue. In that context it is important to note that the work presented in this paper also applies to visual control strategies other than the basic position based servoing, e.g., [40]. Acknowledgment grant No. 114646.
V. Kyrki was supported by Academy of Finland
References 1. Vacchetti, L., Lepetit, V., Fua, P.: Stable real-time 3D tracking using online and offline information. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1385–1391 (2004) 2. Vincze, M., Ayromlou, M., Ponweiser, M., Zillich, M.: Edge projected integration of image and model cues for robust model-based object tracking. Int. J. Robot. Res. 20(7), 533–552 (2001) 3. Taylor, G., Kleeman, L.: Fusion of multimodal visual cues for model-based object tracking. In: Australiasian Conf. on Robotics and Automation, Brisbane, Australia (2003) 4. Harris, C.: Tracking with rigid models. In: Blake, A., Yuille, A. (eds.) Active Vision, Ch. 4, pp. 59–73. MIT Press, Cambridge (1992) 5. Comport, A., Marchand, E., Chaumette, F.: A real-time tracker for markerless augmented reality. In: IEEE Int. Symp. on Mixed and Augmented Reality pp. 36–45 (2003) 6. Vincze, M., Ayromlou, M., Kubinger, W.: An integrating framework for robust real-time 3D object tracking. In: Int. Conf. on Comp. Vis. Syst., ICVS’99, pp. 135–150 (1999) 7. Koller, D., Daniilidis, K., Nagel, H.: Model-based object tracking in monocular image sequences of road traffic scenes. Int. J. Comput. Vis. 10(3), 257–281 (1993) 8. Drummond, T., Cipolla, R.: Real-time visual tracking of complex structures. IEEE Trans. PAMI 24(7), 932–946 (2002) 9. Malis, E., Chaumette, F., Boudet, S.: 2-1/2-d visual servoing. IEEE Trans. Robot. Autom. 15(2), 238–250 (1999) 10. Jurie, F., Dhome, M.: Real time tracking of 3D objects: an efficient and robust approach. Pattern Recognit. 35, 317–328 (2002) 11. Dickmanns, E.D., Graefe, V.: Dynamic monocular machine vision. Mach. Vis. Appl. 1, 223–240 (1988) 12. Lowe, D.G.: Robust model-based motion tracking through the integration of search and estimation. Int. J. Comput. Vis. 8(2), 113–122 (1992)
Tracking rigid objects 13. Wunsch, P., Hirzinger, G.: Real-time visual tracking of 3-D objects with dynamic handling of occlusion. In: IEEE Int. Conf. on Robotics and Automation, ICRA’97, Albuquerque, New Mexico, USA, pp. 2868–2873 (1997) 14. Masson, L., Jurie, F., Dhome, M.: Robust real time tracking of 3D objects. In: Int. Conf. Pattern Recognit., vol. 4, pp. 252–255 (2004) 15. Klein, G., Drummond, T.: Robust visual tracking for non-instrumented augmented reality. In: Proc. 2nd IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 113–122 (2003) 16. Rasmussen, C., Hager, G.: Probabilistic data association methods for tracking complex visual objects. IEEE Trans. PAMI 23(6), 560–576 (2001) 17. Triesch, J., der Malsburg, C.V.: Self-organized integration of adaptive visual cues for face tracking. In: Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 102– 107 (2000) 18. Kragic, D., Christensen, H.I.: Cue integration for visual servoing. IEEE Trans. Robot. Autom. 17(1), 18–27 (2001) 19. Hayman, E., Eklundh, J.-O.: Statistical background subtraction for a mobile observer. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, pp. 67–74 (2003) 20. Vacchetti, L., Lepetit, V., Fua, P.: Combining edge and texture information for real-time accurate 3D camera tracking. In: Proceedings of International Symposium on Mixed and Augmented Reality, Arlington, VA, USA (2004) 21. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1508–1511. Beijing, China (2005) 22. Kyrki, V., Kragic, D.: Integration of model-based and model-free cues for visual object tracking in 3D. In: IEEE International Conference on Robotics and Automation, ICRA’05, pp. 1566–1572 (2005) 23. Kyrki, V., Schmock, K.: Integation methods of model-free features for 3D tracking. In: Scandinavian Conference on Image Analysis, pp. 557–566 (2005) 24. Brox, T., Rosenhahn, B., Cremers, D., Seidel, H.P.: High accuracy optical flow serves 3-D pose tracking: exploiting contour and flow based constraints. In: European Conference on Computer Vision, ECCV 2006, Graz, Austria, pp. 98–111 (2006) 25. Pressigout, M., Marchand, E.: Real-time hybrid tracking using edge and texture information. Int. J. Robot. Res. 26(7), 689–713 (2007)
335 26. Hager, G., Belhumeur, P.: Efficient region tracking with parametric models of geometry and illumination. IEEE Trans. Pattern Anal. Mach. Intell. 20(10), 1025–1039 (1998) 27. Lu, S., Metaxa, D., Samaras, D., Oliensis, J.: Using multiple cues for hand tracking and model refinement. In: Computer Vision and Pattern Recognition, vol. 2, pp. 443–450. Madison, Wisconsin, USA (2003) 28. Harris, C.J., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conference, Manchester, UK, pp. 147– 151 (1988) 29. Lefebvre, T., Bruyninckx, H., De Schutter, J.: Kalman filters for non-linear systems: a comparison of performance. Int. J. Control 77(7), 639–653 (2004) 30. Isard, M., Blake, A.: CONDENSATION-Conditional density propagation for visual tracking. Int. J. Comput. Vis. 29(1), 5–28 (1998) 31. Bar-Shalom, Y., Li, X.R., Kirubarajan, T.: Estimation with Applications to Tracking and Navigation. Wiley-Interscience, New York (2001) 32. Welch, G., Bishop, G.: SCAAT: Incremental tracking with incomplete information. In: Proc. Computer graphics and interactive techniques, pp. 333–344. Los Angeles, CA, USA (1997) 33. Heikkila, J., Silven, O.: A four-step camera calibration procedure with implicit image correction. In: IEEE Computer Vision and Pattern Recognition, pp. 1106–1112 (1997) 34. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) 35. Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35, 73–101 (1964) 36. Huber, P.J.: Robust Statistics. Wiley, New York (1981) 37. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C++. Cambridge University Press, Cambridge (2002) 38. Brent, R.P.: Algorithms for Minimization without Derivatives. Prentice-Hall, Englewood Cliffs (1973) 39. Tsai, R., Lenz, R.K.: A new technique for fully autonomous and efficient 3D robotics hand/eye calibration. IEEE Trans. Robot. Autom. 5(3), 345–358 (1989) 40. Kyrki, V., Kragic, D., Christensen, H.I.: New shortest-path approaches to visual servoing. In: IEEE/RSJ Int. Conf. Intell. Robots and Systems, Sendai, Japan (2004)
123