Visual registration for unprepared augmented reality ... - CiteSeerX

0 downloads 0 Views 881KB Size Report
systems are based on a calculation of the optical flow between the current and .... image position and hence quadratic in these equations. Hence, the solutions to ..... placement of the three dimensional maze and objects. In this application ...
Pers Ubiquit Comput (2003) 7: 287–298 DOI 10.1007/s00779-003-0241-z

O R I GI N A L A R T IC L E

Ke Xu Æ Simon J. D. Prince Æ Adrian David Cheok Yan Qiu Æ Krishnamoorthy Ganesh Kumar

Visual registration for unprepared augmented reality environments

Received: 31 January 2003 / Accepted: 2 April 2003 / Published online: 11 September 2003  Springer-Verlag London Limited 2003

Abstract Despite the increasing sophistication of augmented reality (AR) tracking technology, tracking in unprepared environments still remains an enormous challenge according to a recent survey. Most current systems are based on a calculation of the optical flow between the current and previous frames to adjust the label position. Here we present two alternative algorithms based on geometrical image constraints. The first is based on epipolar geometry and provides a general description of the constraints on image flow between two static scenes. The second is based on the calculation of a homography relationship between the current frame and a stored representation of the scene. A homography can exactly describe the image motion when the scene is planar, or when the camera movement is a pure rotation, and provides a good approximation when these conditions are nearly met. We assess all three styles of algorithms with a number of criteria including robustness, speed and accuracy. We demonstrate two real-time AR systems here, which are based on the estimation of homography. One is an outdoor geographical labelling/ overlaying system, and the other is an AR Pacman game application. Keywords Augmented reality Æ Fundamental matrix Æ Homography Æ Optical flow Æ Vision based tracking

1 Introduction A number of papers have introduced the notion of direct scene annotation for augmented reality (AR) systems. Applications have included a general real-world ‘‘windows’’ system, an annotation of equipment for instruction and the maintenance and labelling of geographical K. Xu Æ S. J. D. Prince Æ A. D. Cheok (&) Æ Y. Qiu Æ K. G. Kumar Department of Electrical and Computer Engineering, National University of Singapore, 117576 Singapore, Singapore E-mail: [email protected]

landmarks for navigation [1, 2, 3, 4, 5] (see Fig. 1). The simplest technique for registration of virtual information is to introduce artificial ‘‘fiducial’’ markers into the scene, which can be easily detected by a computer vision system, e.g., [6, 7]. Since the positions and/or shapes of these markers are known in advance it is relatively easy to establish the position of the camera relative to the scene and introduce the virtual content. Recent studies have attempted to replace these fiducial markers with ‘‘natural feature tracking’’. This is particularly important for the applications where the environment is necessarily unprepared, e.g., the outdoor geographical labelling applications [8, 9, 10]. The aim of these applications is to calculate the image motion field between two different pictures of the same geographical object. If we know where the text label should go in the first image, then the motion field tells us where to place it in the second image. The goal of this paper is to introduce two algorithms which compute highly reliable approximations to this motion field. Most current techniques are based on an initial optical flow calculation. For example, in [2], pointwise optical flow measurements are used to establish affine mappings of individual image regions. Hence, the position of a given feature in the image may be tracked from frame to frame. However, there are a number of disadvantages of such techniques. Optical flow calculation is not robust and may give erroneous velocity estimates. It can also only measure small velocities and is not well-suited for large image motion such as that created by camera rotation. Finally, and most importantly, optical flow calculation does not take into account global geometric constraints in the image flow field (see Fig. 2). Geographical annotation applications involve a camera moving in a (mostly) static, rigid environment. Under these circumstances image flow is constrained by epipolar geometry. When the camera motion is a pure rotation or the scene is planar, then even more strict constraints are placed on the motion flow field. In this paper, we exploit these constraints to make fast, robust estimates of the image flow field and label points in the scene with sub-pixel

288

Based on the tensor-based method of representing the ‘‘features of interest’’, we use a Harris corner detector to identify the points of interest in each image. More information about this detector can be found in [11]. 2.2 The epipolar constraint

Fig. 1 Geographic labelling refers to the real-time annotation of outdoor scenes via augmented reality displays

resolution. In the following section we describe these mathematical constraints on image motion. Then, we discuss how to estimate these flow constraints before considering how this information can be applied to geographic labelling. We then provide a detailed analysis of the advantages and limitations of these techniques and how to combine them into a full wearable computing application.

2 Image motion constraints It is common to restrict the estimation of the velocity field to particular ‘‘features of interest’’ in the image. These are chosen to exhibit stability when viewed from different angles, and to provide unambiguous motion estimates. The general approach is to estimate the motion of these points and then interpolate between these known motions to estimate the velocity at general points in the image. This technique also has the benefit that it reduces the image motion estimation problem to establishing matches for just a few points, which is computationally less intensive than trying to estimate a dense velocity map. 2.1 The detection of the feature corners The ‘‘features of interest’’ used in this paper are based on the image structure tensor: X  I 2 Ix Iy  x A¼ Ix Iy Iy2 X

where Ix and Iy are the derivatives of the luminance in x and y direction and W is a small patch of the image surrounding the point in question. We denote points in the image where both the eigenvalues of this matrix are large as features of interest or ‘‘corners’’.

Consider a camera viewing the same scene from two different angles (see Fig. 3). Given the image of a point in the first camera, we only know its direction from the optical centre, and not its depth. Hence, it may lie anywhere along a line in space. One must inevitably conclude that the image of the point in the second image must lie somewhere along the projection of this line in the second camera. Every such line in three-dimensional space must pass through the optical centre of camera one, and hence every projected line in the second camera must pass through the projection of the optical centre of the first camera. This is termed the epipole. These constraints can be expressed mathematically by the fundamental matrix, F. It can be shown that (homogeneous) points in image 1 and image 2 are related by the 3·3 matrix, F such that: x0T Fx¼ 0 The fundamental matrix, F, maps any point in one image to a line in the second image. This reduces the search for the correct velocity to a one-dimensional problem. This constraint can be used as a criterion for the rejection of plausible, but false, matches. 2.3 The homography Under some special circumstances, image motion between two views of the same rigid scene is further constrained (see Fig. 4). Two images are related by a homography if their (homogeneous) point positions are related by the 3·3 matrix, H such that: x0 ¼ Hx Notice that this constraint is much stronger than the epipolar constraint—if we know H, we can predict exactly where a feature in the first image will appear in the second image. The homography is effectively a linear transformation of the rays projecting from the camera centre. Equivalently, one can consider linearly transforming the image plane of the first camera so it cuts the rays at a different angle. Real-world situations where image motion is exactly modelled by a homography include pure camera rotations, and general camera movements when viewing a planar structure. Although these conditions may seem restrictive, the homography provides a good approximation when they are nearly met. For example, when the scene is mostly planar (e.g., when looking at the front of a building), or when camera translation is

289 Fig. 2a–c Image motion constraints. In each case we are attempting to calculate the motion or flow between the left and right images. a Results of optical flow calculation. A noisy estimate of the image movement is detected independently at each corner point. A given point such as the one denoted by a square, can map to anywhere in the second image. b An epipolar constraint. For arbitrary movement in a static scene, a given point in the first image is constrained to lie on a line in the second image. The mapping from point to line is described by the fundamental matrix. c In certain cases, the image flow is well described by a homography. This maps a point in the first image to an unique point in the second image.

small relative to the object (e.g., when looking at something far away), the image motion may be well described by a homography. 2.4 Estimating F and H We have argued that the velocity field between two images is constrained, and that given the matrices, F or H, we can restrict or predict the possible motions of

a given point. We now turn to the estimation of F and H. In each case, the elements of the matrix can be solved for using a linear algorithm given a certain number of known point matches between the two images. For the fundamental matrix, each pair of points gives us one linear constraint on the elements of F: 0

0

0

0

x1 x1 F11 þ x1 x2 F12 þ x1 F13 þ x2 x1 F21 þ 0 0 x2 x2 F22 þ x2 F23 þ x1 F31 þ x2 F32 þ F32 ¼ 0

290

erally used as the initial point in a nonlinear optimisation procedure. 2.5 A robust estimation

Fig. 3 Epipolar constraint. Consider two cameras viewing the same scene from different positions. A point in one image must lie somewhere along the line projecting through the optical centre. The position in the second image is constrained to lie on the projection of this line which is known as an epipolar line. This mapping from points to lines is described by the fundamental matrix

Fig. 4 A geometric description of homography. On the left, several camera planes intersect the rays from a three-dimensional cube. Each of these images is related by a homography. This includes rotations: the images on planes 2a (solid) and 2b (dashed) would be related by a homography. Arbitrary views of planar scenes are also described. Both views of the photo are related to the original (and hence each other) by a homography

It takes eight such independent constraints to estimate the elements of F since the matrix is ambiguously up to scale; hence, we need eight point matches. For the case of the homography, we take the cross product of the LHS with the RHS of the previous equation to generate the three constraint equations: 2 0 T 3 x2 h3 x  hT2 x 0 0 x  Hx ¼ 4 hT1 x  x1 hT3 x 5 ¼ 0 0 0 T x1 h2 x  x2 hT1 x Four point matches are required to solve for the elements of H up to scale. Although these provide linear solutions for F and H, the uncertainty is linear in the image position and hence quadratic in these equations. Hence, the solutions to these sets of equations are gen-

The question remains as to how exactly we might identify corresponding points in order to form these linear constraints. As mentioned earlier, we use a Harris corner detector [11] we can identify the points of interest in each image. We constrain these points to be at least 20 pixels apart to ensure that we get a good spread of corners across each image. If there are 50 corners in each image, we have 2500 potential matches between the two images. However, some of these matches are much more probable than others. We form a prediction for the movement of each corner from the first image to the second. If the potential match in the second image is not close to this predicted position, we discount it as a possibility. The prediction may be based on previous estimates of camera motion or it might come from other tracking devices such as accelerometers. An initial set of matches is selected from the remaining possibilities by examining the similarity of the local 15·15 pixel regions surrounding the corners. For each point in the first image, we find the corner in the second image with the highest local cross-correlation. We now have an initial set of matches (Fig. 5). However, many of these are erroneous, and will render a brute-force least square solution ineffective. We employ a robust statistical method known as random sample consensus (RANSAC) [12] to identify the correct matches. The RANSAC procedure involves choosing a minimal subset (i.e., 4 for the homography or 8 for the fundamental matrix) at random and calculating the associated homography or fundamental matrix. We then note how many of the other matches are in close agreement with this estimated homography or fundamental matrix (inliers). We repeat this process a number of times while keeping track of the mapping that had the greatest support. We can define an up limit for this RANSAC sampling repetition. Or, alternatively, we can also decide to accept the result immediately after it has reached a predefined support percentage in order to avoid the unnecessary computational load, e.g., terminating the RANSAC loop when 75% of the other matches are in close agreement with this estimated homography fundamental matrix. Finally, we re-estimate the matrix based on all of these inliers. In the case of homography estimation, the pseudo-code can be written as follows: MaxInliers = 0; For (n = 1 to n Iterations) Choose random subset of four matches Calculate homography Calculate how many of the other matches agree (inliers) If (inliers > maxInliers) bestHomography = thisHomography Recalculate best homography from all inliers

291

Fig. 5 A robust calculation of a homography between two images. Corner points are identified in the two images (yellow dots). We choose initial matches based on the similarity of the areas around these corners and based on prior knowledge about the likely match direction (pink lines indicate the corner vector to the matched corner in the other image). This initial set contains many incorrect matches. We pick four matches (blue lines) and calculate the associated homography. We then count the number of other matches that are in agreement (inliers are pink lines) and repeat this procedure. We choose the estimate with the most support and recalculate the homography using all of the inliers

The right-hand side of Fig. 5 shows the results of this procedure. The final point matches are in close agreement, indicating that image movement is predominantly horizontal. These methods are well established in the computer vision literature [13], and we will talk more about the details of RANSAC implementation later in the paper.

3 The geographic labelling/overlaying application After calculation of either of these procedures, we are left with a correct set of image matches and a rule to help predict where in the second image any unmatched feature in the first image will appear. For the case of the homography, we can predict the exact position. For the fundamental matrix, we can predict a line along which the matching feature must lie. Given the positions of the feature points in the first image, how then can we establish their positions in the second image? For the homography, this is self-evident, since the homography itself predicts the exact position. For the more general case, the simplest method would be to interpolate based on the closest known image velocities subject to the epipolar constraint. This is equivalent to the assumption that the depth map is locally similar. For very accurate positioning the final position of the feature can be established by maximising cross-correlation along the line. We are able to use the homography and fundamental matrix to estimate the flow field for two-dimensional feature points in the image, which means a real-time two-dimensional geographical labelling is achievable. However, these mathematical entities can also be used to form estimates of the full three-dimensional camera motion. Indeed, Simon et al. [14] have used the

calculation of the homography to calculate the camera position relative to a planar surface and integrate fully three-dimensional objects. An example of this is shown in Fig. 6. For more general scenes, the fundamental matrix encompasses information about camera movement, but there are a number of problematical aspects to calculating three-dimensional camera motion in real time, including the ambiguity of scale, and the fact that the calculation is unstable when the translational component is small or the scene structure is not sufficiently general. In these cases a homography may suffice and a number of methods have been proposed for selection between these models. In this project work, we developed a wide-scale geographical labelling/overlaying system based on homography. We store a number of key frames (see Fig. 7) of the geographical features in question. For each key frame we store the positions of the corners in the image, and the local image intensities around them. At a given position in space, we may more generally sample more visual directions than is possible with a single image by applying mosaicing techniques. The system attempts to find the velocity map between the current frame and the stored frame. The positions of the geographical annotations are then mapped from the stored frame to the current frame and drawn into the field. This method has a major advantage over previous systems [2] in that it always matches back to the reference frame rather than repeatedly updating the position of the labels based on the previous frame. Hence, errors do not accrue over time. Moreover, because the matching procedure is robust, it will not fail if

Fig. 6 Superimposed three-dimensional graphical content onto a real notice board

292

Fig. 7 Information for geographical annotation can be stored in the form of corner points and surrounding regions and their directions in space. These points may be stored at a given point across a wide range of angles (displayed top as a mosaic). The input frame (bottom) can be compared to this stored representation to establish the position of the label

be able to see a virtual text label or a three-dimensional virtual object (a virtual three-dimensional plan of the building, for example) being attached to the building in front of him. When he moves his head around, the label should always stick to the part of building. Because the motion of the user’s head can be regarded as a pure rotation when we compare it with the distance between him and the building in front, we can use the estimation of the homography to integrate fully three-dimensional objects into the scene. A reliable calculation of the fundamental matrix is currently rather slow for real-time wearable computing applications, and only operates at < 8 Hz on our wearable platform which is based on a 1 GHz 786 single board computer from Inside technologies, and a Sony Glasstron display. However, another wearable system we have developed based on a Dell 2.4 Ghz Pentium IV Laptop can calculate homography matrices at 25 Hz between 320·240 pixel grayscale images. Fig. 9 shows the two-dimensional text labelling result of our system. Fig. 10 shows the three-dimensional virtual building overlaying the real building. By pressing a predefined key on the Twiddler [15], the user can decide the transparency level of the virtual building in front of him.

4 A comparison to previous work Before presenting detailed results of these two algorithms, we consider briefly previous work in the field. Neumann et al. [2] developed the ‘‘closed-loop motion tracking architecture’’ which is based on an optical flow calculation. Corners are sought in the image and tracked. For regions where there are many corners, the velocity of the corners is used to establish a local affine warping model. This is then fed back to improve the corner velocity position. The algorithm proceeds from a coarse to a fine resolution so that large image velocities may be measured. This basic optical approach has been variously combined with artificial markers in the scene [3] and with compasses, tilt sensors and inertial gyroscopes to form hybrid tracking systems [10, 16]. Other systems [8] have relied primarily on gyroscopic measurements and merely used a template matching process to fine tune image correspondence.

5 System implementation details and performance Fig. 8 The wearable computer used in the geographical labelling/ overlaying system

part of the image has changed in the time between initial capture and the present. The user with a wearable computer (Pentium III 1Ghz) in his jacket and a HMD on his head (Fig. 8) stands in front of different buildings in the National University of Singapore campus. From there, he would

We restrict the remaining performance assessments to the real-time system based on calculating homographies. The fundamental-matrix based matching produces similar results over a wider range of image conditions, but with a considerably increased computational cost. Typically 50–60 corners per image are identified, of which > 80% are inliers to the final solution, depending on the amount of overlap of the two images. If less than 35% of corners are inliers we consider the calculation to

293 Fig. 9 The geographical labelling of different buildings at the National University of Singapore Kent Ridge campus

have failed. We apply a number of heuristics to increase the system speed and accuracy: we repeat the RANSAC sampling up to 70 times, but immediately accept the solution if it has greater than 75% support and 15 iterations have already been computed. Using this criterion, the majority of frames are completed prematurely. In order to increase the average quality of the solutions, we ensure that the initial four points in the first image are spatially separated by at least 60 pixels. We also reject homographies that are near singular by testing the determinant. Singular matrices map the entire first image to a line or point in the second image. For a static camera in an indoor environment, the homography was successfully calculated for 500/500 test frames in a 20 second sequence. For each frame, the centre point of the image was transformed by the incoming homography. Since the camera is static, we expect it to remain in the same place. The mean deviation was

Suggest Documents