Fusion of Data from Head-Mounted and Fixed ... - Semantic Scholar

1 downloads 0 Views 418KB Size Report
Nov 1, 1998 - William A. Hoff. Center for Robotics and Intelligent Systems. Engineering Division, Colorado School of Mines. Golden, CO 80401. Telephone: ...
Submitted to the First International Workshop on Augmented Reality, November 1, 1998, San Francisco, California.

Fusion of Data from Head-Mounted and Fixed Sensors1 William A. Hoff Center for Robotics and Intelligent Systems Engineering Division, Colorado School of Mines Golden, CO 80401 Telephone: 303-273-3761 Fax: 303-273-3602 Email: [email protected]

Abstract A methodology is developed to explicitly fuse sensor data from a combination of fixed and head-mounted sensors, in order to improve the registration of objects in an augmented reality system. The methodology was applied to the analysis of an actual experimental augmented reality system, incorporating an optical see-through head-mounted display, a head-mounted CCD camera, and a fixed optical tracking sensor. The purpose of the sensing system was to determine the position and orientation (pose) of a movable object with respect to the head-mounted display. A typical configuration was analyzed and it was shown that the hybrid system produces a pose estimate that is significantly more accurate than that produced by either sensor acting alone. Using only the fixed sensor, the maximum translational error in the location of an object with respect to the head-mounted display in any direction was 27.4 mm (corresponding to a 97% confidence interval). Using only the head-mounted sensor, the maximum translational error in any direction was 20.4 mm. By combining data from the two sensors, the maximum translational error was reduced to 3.3 mm. In order to fuse the pose estimates, the uncertainties are explicitly calculated, in the form of covariance matrices. A capability was also developed to visualize the uncertainties as 3D ellipsoids.

1 Introduction Where is an object of interest with respect to the user’s head? In augmented reality systems that use head-mounted displays (HMD’s), knowing the relative position and orientation (pose) between object and head is crucial in order to display a virtual object that is aligned with the real object. For example, a virtual wire-frame model of an object could be rendered as a graphical overlay on top of the real world object. If the estimated pose of the object is inaccurate, the real and virtual objects may not be registered correctly. For example, the virtual wire-frame model could appear to float some distance away from the real object. This is clearly unacceptable in applications where the user is trying to understand the relationship between real and virtual objects. Registration inaccuracy is one of the most important problems limiting augmented reality applications today [1]. To determine the pose of an object with respect to the user’s head, tracking sensors are necessary. Sensor technologies that have been used in the past include mechanical, magnetic, acoustic, and optical [2]. Of these, optical sensors promise to have the best overall combination of speed, accuracy, and range [3] [4] [5]. In this paper, I will concentrate on optical sensors, although the methods can be applied to other sensors. Optical sensors use cameras or photo-effect sensors to track optical targets, such as light emitting diodes (LED’s) or passive fiducial markings. Using two or more sensors (stereo vision), the three-dimensional (3D) position of a target point can be determined directly via triangulation. The accuracy of locating the point is improved by increasing the separation (baseline) between the sensors. The full six degree-of-freedom (DOF) pose of a rigid body can be determined by measuring three or more target points on the body, assuming the geometry of the points on the body is known. This procedure is known as the “absolute orientation” problem in the photogrammetry literature.

1

This work was supported by a grant from Johnson & Johnson Professional, Inc.

1

Alternatively, a single sensor can be used to measure the 2D (image) locations of three or more target points on a rigid body. If the geometry of the points is known, the full 6 DOF pose of the rigid body can be estimated, by a procedure is known as “exterior orientation” [6]. One issue is where to locate the sensor and target. One possibility is to mount the sensor at a fixed known location in the environment, and put targets on both the HMD and on the object of interest (a configuration called “outside in” [4]). We measure the pose of the HMD with respect to the sensor, and the pose of the object with respect to the sensor, and derive the relative pose of the object with respect to the HMD. Another possibility is to mount the sensor on the HMD, and the target on the object of interest (a configuration called “inside out”). We measure the pose of the object with respect to the sensor, and use the known sensor-to-HMD pose to derive the relative pose of the object with respect to the HMD. Both approaches have been tried in the past, and each has advantages and disadvantages. With a fixed sensor (outside-in approach), there is no limitation on size and weight of the sensor. Multiple cameras can be used, with a large baseline, to achieve highly accurate 3D measurements via triangulation. For example, commercial optical measurement systems such as Northern Digital’s Optotrak have baselines of approximately 1 meter and are able to measure the 3-D positions of LED markers to an accuracy of less than 0.5 mm. The orientation and position of a target pattern is then derived from the individual point positions. A disadvantage with this approach is head orientation must be inferred indirectly from the point positions. Any error in these positions gives rise to an orientation error. For example, consider two target points on the HMD separated by 100 mm. A 0.5 mm error in one of the points causes an error of 0.5/100 = 0.005 radians in the derived orientation of the HMD. The inside-out approach has good registration accuracy, because a slight rotation of a head-mounted camera causes a large shift of a fixed target in the image. For example, consider a head mounted sensor with a 30° field of view. It is quite easy with an ordinary TV camera to resolve the 2D position of a target point to 1 part in 500 or less, which would correspond to an orientation error of the HMD of 0.001 radians. However, a disadvantage of this approach is that one it is impossible to put multiple cameras with a large baseline separation on the head. Either a small baseline separation must be used, or alternatively a single camera can be used with the exterior orientation algorithm. Either method gives rise to large translation errors along the line of sight of the cameras. A question arises – is it possible to fuse the data from a head-mounted sensor and a fixed sensor, and derive a more accurate estimate of object-to-HMD pose? If the data from these two types of sensors are complementary, then the resulting pose can be much more accurate than that from each sensor used alone. In order to do this, a mathematical analysis is required of all errors and uncertainties associated with the measurements and derived poses. Effectively, we can create a hybrid system that combines the “inside-out” and “outside-in” approaches. This paper describes a methodology to explicitly compute uncertainties of pose estimates, propagate these uncertainties from one coordinate system to another, and fuse pose estimates from multiple sensors. The contribution of this work is the application of this methodology to the registration problem in augmented reality. It is shown that a hybrid sensing system, combining both head-mounted and fixed sensors can improve registration accuracy over that from either sensor used alone. Only quasi-static registration is considered in this paper; that is, objects are stationary when viewed, but can freely be moved. The remainder of this paper is organized as follows. Section 2 provides a background on pose estimation, with a description of the terminology used in the paper. Section 3 develops the methodology for estimating the uncertainty of a pose, transforming it from one coordinate frame to another, and fusing two pose estimates. Section 4 illustrates the application of the methodology to an actual augmented reality system that was developed for a surgical aid application. A typical configuration is analyzed, and the predicted accuracy of the combined (hybrid) pose estimate is found to be much improved over that obtained by either sensor alone. Finally, Section 5 provides a discussion.

2

2 Background on Pose Estimation 2.1

Representation of Pose

The notation in this section follows that of Craig [7]. The pose of a rigid body {A} with respect to another coordinate system {B} can be represented by a six element vector

(

)

B A

X =

(

B

x Aorg , B y Aorg , B z Aorg , α , β , γ

)

T

, where

T

PAorg = B x Aorg , B y Aorg , B z Aorg is the origin of frame {A} in frame {B}, and (α,β,γ) are the angles of rotation of {A} about the (z,y,x) axes of {B}. An alternative representation of orientation is to use three elements of a quaternion; the conversion between xyz angles and quaternions is straightforward. B

Equivalently, pose can be represented by a 4x4 homogeneous transformation matrix: B AH

Here,

B A

 BR =  A  0

B

PAorg  1 

(1)

R is the 3x3 rotation matrix corresponding to the angles (α,β,γ):

B A

 cα  R =  sα  0 

− sα cα 0

0  cβ  0  0 1  − sβ

0 sβ  1 0  1 0  0 cγ 0 cβ  0 sγ

0   − sγ  cγ 

(2)

where cα is shorthand for cos α and sα for sin α, etc. Given a homogeneous transformation matrix, the corresponding six element pose vector can easily be recovered (the equations are in [7]). In this paper, we shall use the letter X to designate a six-element pose vector and the letter H to designate the equivalent 4x4 homogeneous transformation matrix. Homogeneous transformations are a convenient and elegant representation. Given a homogeneous point A

P=

(

A

)

T

x P , A y P , A z P ,1 , represented in coordinate system {A}, it may be transformed to coordinate system {B} with

a simple matrix multiplication B P = AB H

A

P . The homogeneous matrix representing the pose of frame {B} with

respect to frame {A} is just the inverse of the pose of {A} with respect to {B}; i.e., BA H = AB H −1 . Finally, if we know the pose of {A} with respect to {B}, and the pose of {B} with respect to {C}, then the pose of {A} with respect to {C} is easily given by the matrix multiplication CA H = BC H AB H .

2.2

2D-to-3D Pose Estimation

In this section, we describe the problem of determining the pose of a rigid body, given an image from a single camera (the “exterior orientation” problem). Specifically, we are given a set of 3D known points on the object (in the coordinate frame of the object), and the corresponding set of 2D measured image points from the camera, which are the perspective projections of the 3D points. The internal parameters of the camera (focal length, principal point, etc.) are known. The goal is to find the pose of the object with respect to the camera, cam obj X . There are many solutions to the problem; this section describes the solution in Haralick [6], which uses an iterative non-linear least squares method. The algorithm effectively minimizes the squared error between the measured 2D point locations and the predicted 2D point locations.

(

)

T

Using a simple perspective projection model, a 3D point cam Pi = cam xi , cam y i , cam z i ,1 in the camera’s coordinate frame is projected onto the image plane to the 2D point qi = (ui,vi): u  q i =  i  =  vi 

 cam xi    cam z i  cam y i  f

3

(3)

Here, f is the camera constant (which is the focal length of the lens if it is focused at infinity). The corresponding point on the object is obj Pi , where cam

Pi = cam obj H

obj

Pi

(4)

Combining the two equations above, the 2D projection of the ith point is given by a function gi, which is a function of the unknown pose parameters X: q i = g i (X )

(5)

Stacking all measurement equations into a single matrix equation, we have  q1   g 1 (X )      Q= =  = G (X )  q   g (X )  N  N 





(6)

Using a Taylor series expansion, we linearize about the current estimate X0, and get Q0 + ∆Q = G(X0) + J ∆X, where

(

)

−1

J is the Jacobian of G. Solving, we get ∆X = J T J J T ∆Q . Thus, the algorithm proceeds as follows. We start with an initial estimate of the pose. At each iteration, ∆Q is the error between the predicted image point locations and the actual point locations. ∆X is the adjustment to be added to the unknowns. The algorithm typically converges within five or ten iterations.

2.3

3D-to-3D Pose Estimation

In this section, we describe the problem of determining the pose of a rigid body, given a set of 3D point measurements (the “absolute orientation” problem). These 3D point measurements may have been obtained from a previous triangulation process, using a sensor consisting of multiple cameras. Specifically, we are given a set of 3D known points on the object {obj Pi}, and the corresponding set of 3D measured points from the sensor {sen Pi}. The sen X . There are many solutions to the problem; this goal is to find the pose of the object with respect to the sensor, obj section describes the solution in Jain [8] which uses a quaternion-based method. The algorithm starts by subtracting the centroid from each set of points. This has the effect of centering each set of points at the origin. The points are expressed as rays about the origin. The problem then reduces to finding the rotation R that will align the two bundles of rays, {obj ri} and {sen ri}. After the rotation has been determined, the translation can be found by T = obj Pcentrod − R senPcentroid . We can achieve the best alignment in a least squares sense by finding the rotation R that maximizes the scalar product of each ray pair:

χ 2 = ∑ obj ri ⋅ R

sen

ri

(7)

In quaternion notation, this sum is expressed as N

∑ i =1

obj

(

ri ⋅ q

sen

)

N

(

ri q * = ∑ q i =1

sen

)(

ri ⋅

obj

ri q

)

(8)

This sum can be successively changed into the notation of a quadratic form to get χ2 = qT N q, where N is a 4x4 matrix which is a function of the ray coordinates (details are in [8]). The unit quaternion that maximizes this quadratic form is the eigenvector corresponding to the most positive eigenvalue. Once the rotation has been determined, the translation component of the pose is computed by the equation above. Although this algorithm works by maximizing the orientation alignment of the two sets of rays, it appears to have the same result as an algorithm that directly minimizes the squared error between the measured 3D point positions

4

and the predicted 3D point positions [6]. Both algorithms were implemented; the Jain algorithm appears to be slightly faster.

3 Determination and Manipulation of Pose Uncertainty Given that we have estimated the pose of an object, using one of the methods above, what is the uncertainty of the pose estimate? Knowing the uncertainty is critical to fusing measurements from multiple sensors. We can represent the uncertainty of a six-element pose vector X, by a 6x6 covariance matrix CX, which is the expectation of the square of the difference between the estimate and the true vector: CX = E ( ∆X ∆XT )

(9)

This section describes methods to estimate the covariance matrix of a pose, given the estimated uncertainties in the measurements, transform the covariance matrix from one coordinate frame to another, and combine two pose estimates.

3.1

Computation of Covariance

Assume that we have N measured data points from the sensor {P1, P2,..., PN}, and the corresponding points on the object {Q1, Q2, ... , QN}. The object points Qi are 3D; the data points Pi are either 3D (in the case of 3D-to-3D pose estimation) or 2D (in the case of 2D-to-3D pose estimation). The noise in each measured data point is independent, and the noise distribution of each point is given by a covariance matrix CP. Let Pi = H(Qi, X) be the function which transforms object points into data points. In the case of 3D-to-3D pose estimation, this is just a multiplication of Qi by the corresponding homogeneous transformation matrix. In the case of 2D-to-3D pose estimation, the function is composed of a transformation followed by a perspective projection. The noiseless model is f(Pi, Qi, X) = Pi – H(Qi, X) = 0. In the noisy case, f is non-zero and represents the error between the measured point position and the predicted point position. An algorithm that solves for X est minimizes the sum of the squared errors. Assume that have we solved for X est using the appropriate algorithm (i.e., 2D-to-3D or 3D-to-3D). We then linearize the equation about the estimated solution X est: T

 ∂f   ∂f  f (Pi + ∆Pi , Qi , X est + ∆X ) = f (Pi , Qi , X est ) +  ∆Pi +  ∆X    ∂X  Pi ,Qi , X est  ∂Pi  Pi ,Qi , X est T

(10)

Since f(Pi, Qi, X) = 0 and f(Pi+ ∆Pi, Qi, X+ ∆X) ≈ 0 the equation reduces to T

 ∂f   ∂f  − ∆Pi =   ∆X   ∂X  Pi ,Qi , X est  ∂Pi  Pi ,Qi , X est

(11)

 ∂f  − ∆Pi =   ∆X = M i ∆X  ∂X  Pi ,Qi , X est

(12)

T

which simplifies to T

where Mi is the Jacobian of f, evaluated at (Pi, Qi, X est). Combining all the measurement equations, we get the matrix equation:  ∆P1   M 1      − =  ∆X  ∆P   M   N  N





5

⇒ − ∆P = M∆X (13)

(

Solving for ∆X, we get ∆X = − M T M outer product:

)

−1

M T ∆P . The covariance matrix of X is given by the expectation of the

( = E (M

)

C X = E ∆X ∆X T =

(



= MTM

(

= M M T

((

M

)

)

M T E ∆P∆P T

)

CP  M   0 

T

−1

−1

−1

M T ∆P∆P T M T M

(

T

) ((M

T

)

−1

M

)  M ) T

MT

)

−1

T

 0      ((M M )  C  T

−1

T

MT

)

T

P

(14)

Here, we have assumed that the errors in the data points are independent; i.e., E(∆Pi ∆PjT)=0, for i ≠ j. The above analysis was verified with Monte Carlo simulations, using both the 3D-to-3D algorithm and the 2D-to-3D algorithm.

3.2

Interpretation of Covariance

A useful interpretation of the covariance matrix can be obtained by assuming that the errors are jointly Gaussian. The joint probability density for N-dimensional error vector ∆X is [9]:

(

p (∆X ) = 2π

N 2

CX

)

1 2 −1

(

exp − 12 ∆X T C X−1 ∆X

)

(15)

If we look at surfaces of constant probability, the argument of the exponent is a constant, given by the relation ∆X T C X−1 ∆X = z 2 . This is the equation of an ellipsoid in N dimensions. Incidentally, the value d = z 2 is known as the Mahalanobis distance, and is a measure of the magnitude of the error vector, normalized by covariance [10]. For a given value of z, the cumulative probability of an error vector being inside the ellipsoid is P. For N=3 dimensions, the ellipsoid defined by z=3 corresponds to a cumulative probability P of approximately 97% 2. For a six-dimensional pose X, the covariance matrix CX is 6x6, and the corresponding ellipsoid is six dimensional (which is difficult to visualize). However, we can select only the 3D translational component of the pose, and look at the covariance matrix corresponding to it. Specifically, let Z=(x,y,z) T be the translational portion of the pose vector X=(x,y,z,α,β,γ)T. We obtain Z from X using the equation Z=M X, where M is the matrix 1 0 0 0 0 0   M = 0 1 0 0 0 0 0 0 1 0 0 0  

(16)

The covariance matrix for Z is given by CZ = M CX MT (which is just the upper left 3x3 submatrix of CX). We can then visualize the three dimensional ellipsoid corresponding to CZ.

3.3

Transformation of Covariance

We can transform a covariance matrix from one coordinate frame to another. Assume that we have a six-element pose vector X and its associated covariance matrix CX. Assume that we apply a transformation, represented by a

2

The exact formula for the cumulative probability in N dimensions is 1 − P =

6

∞ N 2 N / 2 Γ ( N 2 +1) z



X N −1e − X

2

2

dX [9].

six-element vector W, to X to create a new pose Y. Denote Y = g(X, W). A Taylor series expansion yields ∆Y = J ∆X , where J = ( ∂g/∂X ). The covariance matrix CY is found by:

(

) [

]

(

)

CY = E ∆Y∆Y T = E (J ∆X )(J ∆X ) = J E ∆X ∆X T J T = J C X J T T

(17)

A variation on this method is to assume that the transformation W also has an associated covariance matrix CW. In this case, the covariance matrix CY is: CY = J X C X J XT + J W CW J WT

(18)

where JX = ( ∂g/∂X ) and JW = ( ∂g/∂W ). The above analysis was verified with Monte Carlo simulations, using both the 3D-to-3D algorithm and the 2D-to-3D algorithm. To illustrate these concepts, a simulation of a pose estimation process was performed. A simulated target pattern was created, attached to a coordinate frame A. The pose of coordinate frame A with respect to a sensor S, AS H , was estimated using a 3D-to3D algorithm. The covariance matrix of the resulting pose, CA, was computed using Equation 14. Figure 1 shows a rendering of the ellipsoid corresponding to the uncertainty of the translational component of the pose. The ellipsoid is shown centered at the origin of frame A. Next, another object associated with a coordinate frame B was created and rigidly attached to A. The pose of B with respect to A was known exactly. The pose of frame B with respect to S was derived via BS H = AS H BAH . The covariance matrix of this pose, CB, was then estimated using Equation 17. The uncertainty of the translational component of the pose is shown by the larger ellipsoid in Figure 1. Note the ellipsoid associated with CB is much larger than that associated with CA, even though the relative pose of B with respect to A is known exactly. The reason is that the pose of A with respect to S has an uncertainty in orientation as well as translation. When deriving the pose of B with respect to S, this orientation uncertainty gives rise to an uncertainty in the location of B. The uncertainty is greatest in the plane perpendicular to the line between A and B (hence the flattened shape of the ellipsoid associated with CB). The component of translational uncertainty in B that is caused by the orientation error in A can be estimated by ∆P = d ∆θ, where ∆θ is the orientation error and d is the distance between A and B. Thus, the uncertainty in the derived location of B grows with the orientation uncertainty in A, and also with the distance between A and B. If one needs to track an object using a sensor, and attaches a target pattern to the object in order to do this, then the target pattern should be placed as close as possible to the object to minimize the derived pose error.

3.4

Combining Pose Estimates

Two vector quantities may be fused by averaging them, weighted by their covariance matrices. Let X1, X2 be two N-dimensional vectors, and C1, C2 be their NxN covariance matrices. Assuming X1 and X2 are uncorrelated, then the combined estimate X and the combined covariance matrix C may be found by the following equations: X = C 2 (C1 + C 2 ) X 1 + C1 (C1 + C 2 ) X 2 −1

−1

C = C 2 (C1 + C 2 ) C1 −1

(19)

Therefore, this is a method of sensor fusion in the hybrid augmented reality system. If the pose of the object with respect to the HMD can be estimated using data from the head-mounted sensor, and the same pose can be estimated using data from the fixed sensor, then a combined estimate can be produced using Equation 19.

4 Experiments The methodology described in the previous sections was applied to an actual experimental augmented reality system developed in our lab. The main function of the system is to display a graphical overlay on an HMD, such that the

7

overlay is registered to a movable object in the scene. Only quasi-static registration is considered in this paper; that is, objects are stationary when viewed, but can freely be moved. The system incorporates both head-mounted and fixed sensors. The hybrid system was developed for a surgical aid application, but its capabilities are such that it could be used in many other applications. The first sub-section below describes the experimental setup, including the sensor and display characteristics. Additional details of the system are described in [11]. In the second sub-section, the task itself is described. Finally, an analysis of registration accuracy is performed.

4.1

Description of Experimental AR System

The prototype augmented reality system incorporates an HMD mounted on a helmet (Figure 2). The helmet chassis upon which all components are affixed is simply a hard hat with adjustable head strap. This works well as it allows the head strap of the helmet to be adjusted for each user while maintaining inter-camera and camera-to-display rigid calibrations. The optical see-through HMD is a commercially available product called i-glassesTM, manufactured by Virtual i-o Corporation. A video signal (VGA format) is generated by a desktop PC and transmitted to the HMD through a cable tether. A stereo display electronics module separates the odd and even fields from the video image and displays only one field for each eye, thus allowing different images to be displayed for each eye. Shifting the graphics overlays presented to each eye provides three-dimensional overlay capabilities. The field-of-view of the HMD is 30 degrees in each eye. In Figure 2, the helmet is shown mounted on a bust. The bust incorporates a video at the location of the user’s left eye, so that we can record the actual view (with overlays) that would be seen by a user.

4.1.1 Head-mounted Sensor Three micro-head CCD TV color cameras (Panasonic GP-KS162) are mounted on the helmet. The camera lenses have a nominal focal length of 7.5mm, providing a nominal field of view of 44 degrees. For the work described in this paper, we only used one camera (the left camera). The NTSC-format video signal from the camera is transmitted to the PC through a cable tether. An optical target is affixed to the object of interest. For this work, we used a pattern of 5 green LED’s, in a rectangular planar configuration (Figure 3a). The distinctive geometric pattern of the LED’s enables the correspondence to be easily determined [11]. The video signal is digitized by an image processing board (Sharp GPB-1) in the PC to a resolution of 512x480 pixels, with 8 bits of gray scale intensity per pixel. This board also performs low-level image processing to extract the image locations of the LED targets. The noise in the 2D measured image point locations was assumed to be isotropic, with an estimated standard deviation of 0.5 pixels. Pose estimation is done by the PC, using a 2D-to-3D algorithm. The system uses a 100 MHz 80486 IBM compatible PC running QNX version 4.22. The throughput currently achieved with the system described above is approximately 120 ms per update iteration, or 8.3 Hz. This update rate is too slow for applications where dynamic registration is important but it is sufficient to prove the feasibility of concepts and algorithms.

4.1.2 Fixed Sensor An optical measurement sensor (Northern Digital Optotrak 3020) was fastened to one wall of the laboratory. The sensor consists of three linear array CCD cameras (Figure 4). The distance between the outermost two cameras is approximately 90 cm. The cameras are sensitive to infrared light. An optical target, consisting of a set of six infrared LED's, is fastened to each object of interest (Figure 3b). A system controller module, attached to a Silicon Graphics Indigo2 workstation through a SCSI interface, sequentially illuminates each LED. For each LED target that is illuminated, the controller acquires the data from the cameras and calculates (via triangulation) the 3D location of the target with respect to the sensor. From the set of 3D point positions on a target body, the controller

8

also calculates the pose of the body with respect to the sensor. The update rate of the system is dependent on the number of target points being tracked. For 18 target points, we measured an update rate of approximately 4 Hz. Infrared LED’s were also placed on the helmet, to form an optical target. A set of six LED’s were mounted in a semi-circular ring around the front half of the helmet, as shown in Figure 2. Not all six LED’s could be simultaneously tracked by the sensor. In most typical head poses, only four LED’s were visible. For the purpose of the analyses in this paper, the noise in the measured 3D point locations was assumed to be isotropic, with an estimated standard deviation of 0.5 mm.

4.2

Description of Task

The hybrid augmented reality system was developed for a surgical aid application; specifically, total hip joint replacement. A hip implant consists of a femoral component and an acetabular component (Figure 5). The purpose of the augmented reality system is to guide the placement of the acetabular component. As part of this task, the system must track the acetabular implant and display a graphical overlay on the HMD, that is registered to the implant. Optical targets were attached to the acetabular implant to enable sensor tracking, as shown in Figure 3. Separate LED targets were used for the head-mounted camera, and for the Optotrak sensor. The following is a description of the principal coordinate frames in the system: 1.

HMD. This frame is centered at the point of projection of the left eyepiece of the display.

2.

Implant. This frame is centered on the acetabular implant component. The goal of the system is to determine the implant-to-HMD pose. If this pose is known accurately, then a graphical overlay can be displayed that is accurately aligned to the implant. In the nominal configuration, the distance from the HMD to the implant was approximately 70 cm (i.e., arm’s length). The distance from the Optotrak to the implant was approximately 2.5 m.

3.

HMD target. This is the optical target that is mounted on the helmet, consisting of six infrared LED’s. It is tracked by the Optotrak sensor. The transformation between the HMD target frame and the HMD frame is known.

4.

Camera. This is the camera that is mounted on the helmet. The transformation between the camera and the HMD is known.

5.

Implant target. This is the optical target that is attached to the implant, consisting of six infrared LED’s. It is tracked by the Optotrak sensor. The transformation between the implant target and the implant is known.

6.

Camera target. This is other optical target that is attached to the implant, consisting of five green LED’s. It is tracked by the head-mounted camera. The transformation between the camera target and the implant is known.

These coordinate frames are depicted schematically in Figure 6. Even though this figure shows all frames as coplanar, the transformations between frames are actually fully six-dimensional (i.e., three translational and three rotational components). To aid in visualizing these coordinate frames, a 3D graphical display system was developed using a Silicon Graphics computer and the “Open Inventor” software package. Figure 7 (a) shows a simplified representation of the coordinate frames on the head: the HMD, the HMD target, and the head-mounted camera. These coordinate frames are rigidly mounted with respect to each other on the helmet. The real helmet assembly is shown in Figure 2. Figure 7 (b) shows a simplified representation of the coordinate frames attached to the implant: the implant, the implant target, and the camera target. These coordinate frames are also rigidly mounted with respect to each other. The real implant assembly is shown in Figure 3. The coordinate axes of all frames are also shown. Figure 8 (a) shows the entire room scene, consisting of the fixed sensor on the back wall, the observer with the HMD, and the patient on the table with the hip implant. Figure 8 (b) shows a 3D visualization of the same scene.

9

4.3

Analysis of Registration Accuracy

The analysis methodology described earlier was applied to the experimental augmented reality system to estimate the accuracy of the derived implant-to-HMD pose. The simulation was implemented in the software application Mathematica. The processing consists of three main steps. First, an estimate of implant-to-HMD pose is derived using data obtained from the Optotrak (fixed) sensor alone. Second, an estimate of implant-to-HMD pose is derived using data obtained from the head-mounted camera alone. Finally, the two estimates are fused to produce a single, more accurate estimate. These steps are described in detail below.

4.3.1 Pose Estimation from Fixed Sensor Optotrak H ) with respect to the Using data from the fixed sensor (Optotrak), we estimated the pose of the HMD target ( HmdTarg

sensor, using the 3D-to-3D algorithm described earlier. From the estimated error in each 3D point measurement (0.5 mm), the covariance matrix of the resulting pose was determined. Using the known pose of the HMD with respect to the HMD target ( HmdTarg Hmd H ), the pose of the HMD with respect to the sensor was estimated, using the Optotrak H = HmdTarg H HmdTarg Hmd H . The covariance matrix of the resulting pose was also estimated. The ellipsoids corresponding to the uncertainties in the translational components of the poses are shown in Figure 9. In all figures in this paper, the ellipsoids are drawn corresponding to a normalized distance of z=3; i.e., corresponding to a cumulative probability of 97%. However, during rendering the ellipsoids are scaled up by a factor of 10 in order to make them more easily visible. The major axis of the small ellipsoid in Figure 9 is actually 1.1 mm; that of the larger ellipsoid is 6.1 mm.

equation

Optotrak Hmd

Optotrak H ) with respect to the sensor, along with the Next, the fixed sensor estimated the pose of the implant target ( ImpTarg

corresponding covariance matrix. Using the known pose of the implant with respect to the implant target ( ImpTarg Implant H ), Optotrak H = ImpTarg H ImpTarg Implant H . The covariance matrix of the resulting pose was also estimated. The ellipsoids corresponding to the uncertainties in the translational components of the poses are shown in Figure 10. The major axis of the small ellipsoid in Figure 10 is 1.1 mm; that of the larger ellipsoid is 3.7 mm.

the pose of the implant with respect to the sensor was estimated, using

Optotrak Implant

Hmd H ( opto ) = Optotrak H Optotrak Implant H . The covariance matrix of this pose was estimated using Equation 18. The corresponding ellipsoid is shown in Figure 11. Note that the shape of this ellipsoid is elongated in the plane perpendicular to the line of sight from the HMD to the implant, due to the orientation uncertainty in the HMD. The major axis of this ellipsoid is 27.4 mm.

Finally, the pose of the implant with respect to the HMD was estimated via.

Hmd Implant

4.3.2 Pose Estimation Using Head-Mounted Sensor Camera H ) with respect to Using data from the head-mounted camera, we estimated the pose of the camera target ( CamTarg

the camera, using the 2D-to-3D algorithm described earlier. From the estimated error in each 2D-point measurement (0.5 pixel), the covariance matrix of the resulting pose was determined. Then, using the known pose of the implant with respect to the camera target ( CamTarg Implant H ), the pose of the implant with respect to the camera was Camera H = CamTarg H CamTarg Implant H . The covariance matrix of the resulting pose was also estimated. The ellipsoids corresponding to the uncertainties in the translational components of the poses are shown in Figure 12.

estimated, via

Camera Implant

Camera CamTarg

H along the line of sight to the camera, and very small uncertainty perpendicular to the line of sight. This is typical of poses that are estimated using the 2D-to-3D method. Intuitively, this may be explained as follows. A small translation of the object parallel to the image plane results in an easily measurable change in the image, meaning that the uncertainty of translation is small in this plane. However, a small translation of the object perpendicular to the image plane generates only a very small image displacement, meaning that the uncertainty of translation is large in this direction. The major axis of the ellipsoid corresponding to Camera Camera CamTarg H is 25.1 mm. The major axis of the ellipsoid corresponding to the derived pose, Implant H , is 20.6 mm. Note the large uncertainty of

10

Hmd Hmd Camera H ( cam ) = Camera H Implant H . The Next, the pose of the implant with respect to the HMD is estimated, via. Implant covariance matrix of this pose was estimated using Equation 18. The corresponding ellipsoid is shown in Figure 13. The major axis of this ellipsoid is 20.6 mm.

4.3.3 Fusion of Data from Fixed and Head-Mounted Sensors The two pose estimates, which were derived from the fixed and head-mounted sensors, can now be fused. Using Equation 19, we produce a combined estimate of the implant-to-HMD pose, along with the covariance matrix. The Hmd Hmd Hmd H (opto ) , Implant H (cam ) , and Implant H (hybrid ) are shown in Figure 14. ellipsoids corresponding to the three poses, Implant Note that the large ellipsoids, corresponding to

Hmd Implant

H (opto ) and

Hmd Implant

H (cam ) , are nearly orthogonal. The ellipsoid

H (hybrid ) , is much smaller and is contained within the intersection volume of the larger ellipsoids. Figure 15 is a wire-frame rendering of the ellipsoids, which allows the smaller interior ellipsoid to be seen. The major axis corresponding to the uncertainty of the combined pose is only 3.3 mm. corresponding to the combined pose,

Hmd Implant

5 Discussion This paper has developed a methodology to explicitly fuse sensor data from a combination of fixed and headmounted sensors, in order to improve the registration of objects with respect to a HMD. The methodology was applied to an actual experimental augmented reality system. A typical configuration was analyzed and it was shown that the hybrid system produces a pose estimate that is significantly more accurate than that produced by either sensor acting alone. Using only the fixed sensor, the maximum translational error in any direction was 27.4 mm (corresponding to a 97% confidence interval). Using only the head-mounted sensor, the maximum translational error in any direction was 20.4 mm. By combining data from the two sensors, the maximum translational error was reduced to 3.3 mm. In order to fuse the pose estimates, the uncertainties are explicitly calculated, in the form of covariance matrices. By visualizing the uncertainties as 3D ellipsoids, new insights can be gained. For example, given the measured pose of an object (such as an optical target), in many cases it is necessary to compute the pose of a second object attached to the first object. From the calculated ellipsoids, it is easy to see how the uncertainty in the position of the second object can grow as a result of the orientation uncertainty in the measured pose of the first object. This uncertainty grows with the distance between the first and second objects. Therefore, it is important to mount optical targets as close as possible to the object of interest. Pose estimates produced from either sensor acting alone have uncertainties that are not isotropic. The uncertainty of the pose derived from the fixed sensor has its largest component perpendicular to the line of sight from the HMD. The uncertainty of the pose derived from the head-mounted sensor has its largest component along the line of sight from the HMD. This orthogonality greatly reduces the uncertainty of the fused pose estimate. For future work, it would be interesting to look at the accuracy of the 2D image overlay, instead of the full 6D pose accuracy. From the analysis in this paper, one might expect that the fixed sensor would produce the worst overlay accuracy, due to the fact that it has the greatest translational uncertainty perpendicular to the line of sight. One might expect that the head-mounted sensor would have a more accurate overlay, and the hybrid system the best. Also, this paper has only considered quasi-static registration accuracy; that is, where objects are stationary when viewed, but can freely be moved. It would be useful to extend this work to study dynamic registration accuracy.

6 Acknowledgments The support of Johnson & Johnson Professional, Inc. is gratefully acknowledged. Also, the author would like to thank Dr. Tyrone Vincent for many helpful discussions, and Khoi Nguyen for implementing many of the components of the experimental augmented reality system.

11

7 References [1]

R. T. Azuma, “A Survey of Augmented Reality,” Presence, Vol. 6, No. 4, pp. 355-385, 1997.

[2]

K. Meyer, et al, “A Survey of Position Trackers,” Presence, Vol. 1, No. 2, pp. 173-200, 1992.

[3]

R. Azuma and G. Bishop, “Improving static and dynamic registration in an optical see- through HMD,” Proc. of 21st International SIGGRAPH Conference, ACM; New York NY USA, Orlando, FL, USA, 24-29 July, pp. 197-204, 1994.

[4]

J.-F. Wang, R. Azuma, G. Bishop, V. Chi, J. Eyles, and H. Fuchs, “Tracking a Head-Mounted Display in a Room-Sized Environment with Head-Mounted Cameras,” Proc. of Helmet-Mounted Displays II, Vol. 1290, SPIE, Orlando, FL, April 19-20, pp. 47-57, 1990.

[5]

D. Kim, S. W. Richards, and T. P. Caudell, “An optical tracker for augmented reality and wearable computers,” Proc. of IEEE 1997 Annual International Symposium on Virtual Reality, IEEE Comput. Soc. Press; Los Alamitos CA USA, Albuquerque, NM, USA, 1-5 March, pp. 146-50, 1997.

[6]

R. Haralick and L. Shapiro, Computer and Robot Vision, Addison-Wesley Inc, 1993.

[7]

J. Craig, Introduction to Robotics, Mechanics, and Control, 2nd ed., Addison Wesley, 1990.

[8]

R. Jain, R. Kasturi, and B. G. Schunck, Machine Vision, New York, McGraw-Hill, 1995.

[9]

H. L. V. Trees, Detection, Estimation, and Modulation Theory, New York, Wiley, 1968.

[10]

J. L. Crowley, “Mathematical Foundations of Navigation and Perception for an Autonomous Mobile Robot,” in Reasoning with Uncertainty in Robotics, Lecture Notes in Artificial Intelligence, L. Dorst, M. van Lambalgen, and F. Voorbraak, Ed., New York, Springer-Verlag, pp. 9-51, 1996.

[11]

W. A. Hoff, T. Lyon, and K. Nguyen, “Computer Vision-Based Registration Techniques for Augmented Reality,” Proc. of Intelligent Robots and Computer Vision XV, Vol. 2904, in Intelligent Systems & Advanced Manufacturing, SPIE, Boston, Massachusetts, Nov. 19-21, pp. 538-548, 1996.

12

A

B

S

Figure 1 A depiction of the 3D uncertainty ellipsoids associated with the locations of coordinate frames. The uncertainty in the origin of frame A (as measured by sensor S) is shown by the small ellipsoid. The pose of B with respect to S is then derived. The uncertainty in the origin of frame B (with respect to S) is shown by the large ellipsoid.

Cameras

LED’s

HMD

Figure 2 The augmented reality system incorporates a HMD, cameras, and LED targets, all mounted on a helmet.

13

(a) Green LED’s (5)

(b) Infrared LED’s (6)

Figure 3 (a) A pattern of 5 green LED's (indicated by arrows) forms the optical target for the head mounted camera. (b) A pattern of 6 infrared LED’s (indicated by circles) forms an optical target for the Optotrak sensor. Both targets are mounted on a plastic box, which can then be attached to any object of interest; in this case, a hip implant.

Figure 4 A fixed optical sensor (Northern Digital Optotrak) is mounted on the wall. The baseline between the outer two cameras is approximately 90 cm.

14

Figure 5 The total hip joint replacement consists of an acetabular component (hemispherical socket, left) and a femoral component (long stem with ball, right).

{HMD target} {Camera} Legend: {Optotrak}

{HMD} {Coordinate frame} Known transformation Measured transformation

{Implant}

Derived transformation

{Implant target}

{Camera target}

Figure 6 The principal coordinate frames are shown, along with the transformations between them.

15

HMD optical target

Camera

Camera target

Implant target

HMD Implant

(a)

(b)

Figure 7 A 3D visualization of coordinate frames was developed. (a) The coordinate frames on the head are shown: The HMD (two circular disks); the HMD target (flat plate); and the head-mounted camera (cylinder). (b) The coordinate frames on the implant: The acetabular implant (sphere); the implant target (six dot pattern); and the camera target (five dot pattern).

(a)

(b)

Figure 8 A visualization of the entire scene, showing the fixed sensor on the wall, the HMD, and the object of interest, which is the hip implant. (a) The real scene. (b) A 3D visualization.

16

Uncertainty ellipsoids

Figure 9 The fixed sensor estimates the pose of the HMD target and its covariance matrix (small ellipsoid). Using the known pose of the HMD with respect to the HMD target, the pose of the HMD with respect to the sensor is then estimated, along with its covariance matrix (large ellipsoid).

Uncertainty ellipsoids

Figure 10 The fixed sensor estimates the pose of the implant target and its covariance matrix (small ellipsoid). Using the known pose of the implant with respect to the implant target, the pose of the implant with respect to the sensors is then estimated, along with its covariance matrix (large flattened ellipsoid).

17

Uncertainty ellipsoid

Figure 11 Using data from the fixed sensor, the pose of the implant with respect to the HMD is derived, along with its covariance matrix (large ellipsoid).

Uncertainty ellipsoids

Figure 12 Using the data from the head-mounted camera, the pose of the camera target with respect to the camera is estimated, along with its covariance matrix (long narrow ellipsoid). Using the known pose of the implant with respect to the camera target, the pose of the implant with respect to the camera is then estimated, along with its covariance matrix (wide ellipsoid to the left).

18

Uncertainty ellipsoid

Figure 13 Using data from the head-mounted camera alone, the pose of the implant with respect to the HMD is computed, along with its covariance matrix (large ellipsoid).

Uncertainty ellipsoid (derived from fixed sensor) Uncertainty ellipsoid (derived from headmounted sensor)

Figure 14 This figure depicts the fusion of the data. Note that the ellipsoids from the fixed sensor and the head-mounted sensor are nearly orthogonal. The ellipsoid corresponding to the resulting pose estimate is much smaller and is contained in the volume of intersection.

19

Uncertainty ellipsoid for combined (fused) pose estimate

Figure 15 This wire-frame rendering of the uncertainty ellipsoids allows the smaller (combined estimate) ellipsoid to be seen, which is contained in the intersection of the two larger ellipsoids.

20