Camera based Relative Motion Tracking for Hand

0 downloads 0 Views 302KB Size Report
We have used OpenCV, a public domain image processing ..... [17] J. S. Pierce, A. S. Forsberg, M. J. Conway, S. Hong, R. C. Zeleznik, and M. R. Mine, "Image ...
Camera based Relative Motion Tracking for Hand-held Virtual Reality Jane Hwang, Gerard J. Kim, Namgyu Kim Virtual Reality and Perceptive Media Laboratory Dept. of Computer Science and Engineering Pohang University of Science and Technology (POSTECH) Pohang, Korea [email protected]

ABSTRACT In this paper, we present a vision based relative tracking for use in hand-held virtual reality systems. Hand-held virtual reality system refers a virtual reality system that is self contained with a computer and multimodal sensors (e.g. 3D tracking, voice recognition, motion sensing, etc.) and feedback (e.g. graphic, sound, and tactile display) all hold-able in one hand. Relative tracking refers to an approximate tracking of the hand-held VR system (thus, the user’s hand or body) in relation to the environment. Such relative tracking of the user’s body can be used for various interaction tasks and induce a whole body experience in contrast to simply using the “arrow” buttons. The vision based relative tracker uses one camera (and without any environment markers or landmarks) to first select and capture distinct features from the incoming images and calculates the approximate directions and amounts of their motion, including translation, rotation and zoom. This in turn can be used to infer the motion of the user. Using such a tracker, the user can, for instance, navigate through a virtual environment seen through the hand-held VR display and select objects through physical metaphors. We also compare such body based interface to the usual button-based one in terms of performance and preference through a usability study. CR Categories and Subject Descriptors: Interaction, Vision based tracking, Usability

Interface,

Additional Keywords: Hand held, Virtual Reality, Tracking, Navigation, Selection INTRODUCTION: HAND -HELD VIRTUAL REALITY

1

Hand-held computers have recently experienced extreme growth in terms of their performance and capabilities. Hand held computers refer to those that are small and light enough to be held in one hand such as personal digital assistants (PDA), cell phones, and some of the recent game consoles (GBA or PSP1). Today’s hand-held computers are equipped with processors approaching 1GHz performance with graphics accelerating chips, sound processing modules and a sleuth of sensors (e.g. camera, gyros, light sensors) and displays (e.g. vibrators, auto-stereoscopic display). We can now easily find kids or even adults, in the streets, seemingly immersed and enjoying the contents or services provided through such devices. From the point of view of traditional virtual reality (that has pursued the replication of rich sensory experience in the large-scale), it is somewhat odd that people can feel immersed in such “small” environments. However, it could be explained easily from the on-going debate about “form vs. content” [1] in user felt presence. The “content”

proponent would say that this is one evidence that “mindcatching” contents (no matter how simplistic the interaction is) is sufficient to induce presence or the feeling of immersion. We are interested in knowing if it is possible or how, to make these ever-advancing hand-held devices more “multi-modal” so that the user experience can be further enriched and perhaps even improve their interaction performance. Interaction in the handheld devices is still mostly button (and finger) based. Thus, one way is to enrich the user experience is to provide the whole body interaction for certain interactive tasks such as navigation and selection and in order to do so, there must exist a 3D tracking sensor appropriate for the hand-held virtual reality constraint (self contained and not tied to any ground based system). In this paper, we present a vision based relative tracking for such a purpose. Relative tracking refers to an approximate tracking of the handheld VR system (thus, the user’s hand or body) in relation to the environment. The vision based relative tracker uses one camera (and without any environment markers or landmarks) to first select and capture distinct features from the incoming images and calculates the approximate directions and amounts of their motion, including translation, rotation and zoom. This in turn can be used to infer the motion of the user. Using such a tracker, the user can, for instance, navigate through a virtual environment seen through the hand-held VR display and select objects through physical metaphors (See Figure 1). We also compare such body based interface to the usual button-based one in terms of performance and preference through a usability study. Even though the tracking is only approximate (mostly due to hardware constraints such as the limited computing power, use of single camera, its resolution, etc.), the user is still able to interact quite naturally relying on one’s hand-eye coordination and quickly adapting to the small inconsistency between the scale of the movement between the real and the virtual worlds. 2

RELATED WORK

Possibilities for virtual reality with hand-held devices (palmtop computers) were first investigated in 1993 by Fitzmaurice et al. [2]. In this paper, the authors suggested several principles for display and interface proposals for 3D interaction for palmtop computer VR. Due to the limited technology at the time, the prototype was demonstrated and tested with a wired 6-DOF sensor and the display generated by a workstation. The concept of the hand-held VR is somewhat similar to how the boom display or the WindowVR2 works. The difference is that with handheld VR everything (computing system, other displays, sensors) must be self contained. Hand-held devices were perhaps first used in the context of virtual reality as mostly interaction devices. 2

1

GBA and PSP are registered trademarks of the Nintendo and SONY corporations.

WindowVR is a registered trademark of the Virtual Research Systems, Inc. WindowVR is the boom like display, with a large handled LCD display hanging from a fixture with motion sensors.

Watsen et al. have used a PDA to interact in the virtual environment, but the interaction was mostly button or touch screen based and no tracking was used nor necessary [3]. Kukimoto et al. also developed a similar PDA based interaction device for VR, but with a 6DOF tracker attached to it. This way, they were able to demonstrate 3D interaction such as 3D drawing through the PDA (moving it and pressing the button or touch screen) [4]. Wagner developed hand-held (PDA) augmented reality system [5] and applied it to the Signpost project (e.g. smart space with augmentation for navigation and information browsing). The processing for the camera based recognition for marker-object binding was still shared with a separate server (communicated in wireless manner) and an additional 6DOF tracker was used for person tracking within the environment [6]. Paelke et al. has presented a foot-based interaction for a soccer game on a mobile device using a camera [7]. Hachet et al. introduced a novel interaction method using a handheld computer equipped with a camera [8][9]. In their system, the hand-held camera recognizes the movement and poses of a special marker held in the other hand. The user can interact in the virtual environment (seen through a separate large display in front of him) by detecting the motion of the marker by the camera. Our system is similar to theirs except that we dynamically select and use features in the environment (without any predefined markers or features) and all the interaction (display and sensing) occurs on hand-held device. Researchers have long been interested in using the camera (or computer vision) as interfaces for computers, particularly for 3D interactions [3][4]. Generally robust tracking in with as little environment constraints as possible is a hard problem, and much more so with hand-held devices which lack the needed computational power [8][9]. However, the computational power of the hand-held devices is ever-more increasing with respect to their physical size and cost. In our case, we use a new SONY hand-held PC that is equipped with a 733 MHz Pentium processor, perhaps one of the most powerful hand-held computer, yet still short of that of the desktop computers. 3

(a) Initial viewpoint

(b) Rotating the view

RELATIVE MOTION TRACKING

Relative motion tracking refers to an approximate tracking of the hand-held VR system (thus, the user’s hand or body) in relation to the environment. Our vision based relative tracker uses only one camera without any environment markers or landmarks. It first selects and capture distinct or “good” features from the incoming images and calculates the approximate directions and amounts of their motion, including translation, rotation and zoom. Even though the tracking is only approximate (mostly due to hardware constraints such as the limited computing power, use of single camera, its resolution, etc.), we believe that the user would still able to interact quite naturally and without much difficulty relying on one’s hand-eye coordination, quickly adapting to the small inconsistency between the scale of the movement between the real and the virtual worlds. On the other hand, we expect such errors would be less tolerable in a hand-held AR system.

(c) Translation to z axis (forward)

(d) Selecting and manipulating a virtual object Figure 1: Camera based interaction for hand-held VR.

3.1 Feature finding To make our hand-held VR system as self-contained as possible, we attempt to integrate a vision based motion tracker without a predefined marker in the environment. Instead we attempt to select a good distinct feature from the incoming images and use them for motion analysis. Generally, corners are regarded a “good” feature. The matrix below computes the spatial image gradient for a pixel located at (x, y). 2 ⎡ ⎛ ∂I ⎞ ⎢ ∑ ⎜ ⎟ ⎢ neighborhood ⎝ ∂x ⎠ ⎢ ∂2I ⎢ ∑ ⎢⎣ neighborhood ∂x∂y

∂2I ⎤ ⎥ ∑ neighborhood ∂x∂y ⎥ 2 ⎛ ∂I ⎞ ⎥ ∑ ⎜⎜ ⎟⎟ ⎥ neighborhood ⎝ ∂y ⎠ ⎥ ⎦

I : image intensity The two eigenvalues of the matrix reflect the amounts of differences in intensities in the respective directions. Thus we can use the smaller eigenvalue to represent the “corner-ness” of the given pixel (See Figure 2). We have used OpenCV, a public domain image processing library that provides a function to find the corner-ness of the image pixels based on this principle in real time. In order to extract the motion parameters (See Section 2.3), we need at least 7 feature points, however, the accuracy is improved with more corresponding feature points. If can be found, we take the 15 “best” corner features for subsequent motion calculation. If there are only few features with good enough (this threshold is set empirically too) corner-ness, an approximated method is used (See Section 3.4).

To solve this problem, we used a pyramidal implementation of the Lucas-Kanade feature tracker for matching the features between two sequential images [10][11]. This algorithm searches for the corresponding pixel in the target image by down sampling the image and hierarchically searching for corresponding feature (See Figure 3). 3.3 Extracting motion parameters: General Case Given the set of corresponding feature points between two consecutive image frames, the following formulation [15] is used to derive the some of the motion parameters for 2 rotational and 2 translational freedoms. The scale (translation in z) is inferred by a separate step explained in section 2.4. Note that given the various factors such as noise, camera distortion, errors in feature correspondence, approximations in the motion model, real time requirement, etc., it is still generally difficult to achieve high accuracy feature tracking especially without any fiducials [12][13][14]. However, we believe that sufficient interaction performance can still be obtained with approximated tracking (through the hand eye coordination). Given a set of corresponding feature points (at least seven), we can solve for the Fundamental matrix, F, such that xFx’ = 0 x = PX x’ = P’X where, x and x’ are the corresponding image points (between two consecutive frames) of the 3D point X. P and P’ are the projection matrices. Thus, by setting P (of the previous image) to identity, P’ will contain the motion parameters relative to previous pose of the camera. P’ can be found from the F using the SR decomposition and Singular Value Decomposition [15] where, P’= [R | T] = [ R (ω , φ , κ ) | t] = [UWVT | U3] F = SR, S = UZUT and R = UWVT Z = diag (1, 1, 0), U constructed from basis of column space of F, Ui V constructed from basis of null space of F, and W is the diagonal matrix.

Figure 2: Forty “corner” feature points found using the spatial gradient matrix. 3.2

ω : Rotation around x axis, φ : Rotation around y axis κ : Rotation around z axis

Feature tracking

Once “corner” features are identified and selected, the next problem is to track them, in other words, find the corresponding feature in the next image frame. We define the problem as finding the displacement vector, d, that minimizes the residual function ε defines as follows:

ε (d) = ε (dx , dy ) =

ux +wx

uy +wy

∑ ∑(I (x, y) − J(x + d , y + d ))

x=ux −wx y=uy −wy

I: first image,

J: second image

2

x

y

R(ω , φ , κ ) : Rot. matrix associated with camera orientation 3.4

Special Case: Not enough feature points

3.4.1 Camera rotation with few tracked features The above formulation works reasonably well with sufficient number (>= 7) of corresponding feature points. However, only less than sufficient number (< 7) of feature points may be found during relatively fast motion. In such a case, we use the following approximation in Figure 4 to estimate the change in

orientation as the average of vectors between two corresponding features in the two images. As we have not found a way to estimate the roll angle or the translation component, it is not used for interaction in this case.

orientation is added to the previous orientation then converted to the gaze of the virtual viewpoint within the handheld virtual environments. The forward and backward movement (i.e. velocity) is controlled with the approximated scale of the zoom. The user can physically wander (e.g. turn left, walk forward) in the real world and the similar effect would be seen in the virtual environment through the hand-held display.

first image I 3D position of the feature

viewpoint

Figure 3: Motion field found with pyramidal implementation of the Lucas-Kanade feature tracker.

second image J

height

3.4.2 Relative distance between two frames It is more complex to estimate zooming information from the set of the corresponding points because the feature movement in the image usually imply both of rotation and translation and because the rotational components affect more to the moving of the feature points in the images (difficulty in separation). While it is possible to decompose the fundamental matrix (the matrix that relates two corresponding image points) to extract the scale matrix, we have used the proximities between features as a measure of the zooming factor because the proximities are affected little by the rotational movement. While the connections between features (in-between-ness) do not change between in the projected consecutive images, the inter-distance could change as the view moves (See Figure 5). We define the proximity between features in an image as follows:

P=

dy

dx

width

Δω ≈ FOVy ×

1 n dy ∑ n i =1 heigth

Δφ ≈ FOVx ×

1 n dx ∑ n i =1 width

Movement of the feature in the viewing plane

Figure 4: Estimation of the rotation around x axis ( Δω ) and

y axis ( Δφ ).

n n 1 (u i − u j ) 2 + ( v i − v j ) 2 ∑ ∑ n ( n − 1) j =1 i =1

(n > 1) The proximities are proportional to the distances from 3D feature point to the camera position. Therefore we can estimate the ratio of the distance between z positions of the two cameras. Let the

Pt

as the proximity in time t and

in time t+1. distances.

Pt +1 / Pt

For example, if

Pt +1 is the proximity

represent the ratio of the feature

Pt +1 / Pt =2, it means the distance of

the feature from the camera became twice as closer. With this metric, we can estimate relative distance in z direction between the frames. 4

INTERACTION USING RELATIVE MOTION TRACKING

4.1 Navigation We have implemented two types of interfaces for virtual environment navigation using the relative motion tracking. The first is the gaze directed navigation and the other is the virtual steering. In the gaze directed navigation, the relative camera

Figure 5: Connections between features of the two images (with four tracked features) In the virtual steering, the relative orientation of the camera accumulates and the amount of the relative orientation from initial orientation becomes the angular velocity of rotating virtual camera. In Figure 6, (a) illustrates the posture of the handheld device and (b) and (c), the view directions (indicated with the arrow) in the virtual environment in gaze and steering mode. Note that the view direction in the gaze mode is same (or qualitatively same) as that of the real camera. But in the steering mode, if user rotates the handheld device from an initial rotation, then the camera keeps rotating until the user returns to the original initial orientation. The steering mode is appropriate to control viewpoint with small movement.

Camera direction

The hand-held PC model VGN-U71P from SONY was used for this study. This hand-held PC has the Intel Pentium M733 as its processor, 512MB of main memory and 5 inch LCD display and runs the desktop OS (WindowsXP). An ordinary USB camera (QuickCam series by Logitech) was used and OpenCV [18] and Coin3D [19] were used for image capture/processing and 3D graphics. 5

Handheld device (a)

Viewpoint in the virtual environment (b)

Viewpoint in the virtual environment (c)

Figure 6: (a) Movement of handheld device (b) Camera control in the gaze mode (c) Camera control in the steering mode. 4.2 Selection and manipulation For selection and manipulation, a virtual ray casting [16] has been implemented. For instance, a virtual ray is shot from the center of the hand-device in the direction normal to the hand-held display surface. Upon collision of the ray with a virtual object, the virtual object is highlighted and it can be selected by processing a button on the hand-held device. Once selected, if the user pushes down the manipulation button, the object sticks to end of the ray and it follows as the user moves and rotates the handheld device until user releases the manipulation button. Interestingly, the image plane selection [17] and manipulation method is effectively equivalent to the virtual ray casting because the display is hand-held. This is because if the hand-held device was used like the usual tracker in the hand, the user would not be able to see the display (See Figure 7).

USER STUDY

In order to carry out the effectiveness of the suggested interface, we have carried out a small usability study comparing it with the conventional button based interface. With the button based interface, users can navigate or select and manipulate objects, using four “arrows” (for direction) and function (for movement and selection) buttons on the hand-held device. Thus there are two subject groups: one who used the camera based relative motion tracking and the interfaces described in Section 4, and the other who used he button based interface to carry out the experimental task. As for the experimental task, the test hand-held virtual environment contained 27 geometric objects (9 cylinders, spheres and cones each), and subjects were asked to locate and select specific objects in certain order (e.g. cylinders, spheres then cones) with the given interface as fast as possible. In order to provide a spatial reference, we have put a virtual house in the background (See Figure 1). Six participants volunteered for the test. They were all males, aged between 20 and 30, and had prior experience with 3D and handheld games. Training sessions were held to have the participants familiarize themselves with the respective interfaces. In each treatment, the positions and colours of objects were made different and session and object type selection orders were distributed accordingly (e.g. 3 subjects: camera first, 3 subjects: button first). In this small experiment, we measured the task completion time as a quantitative measure of the interfaces and asked for preferences and usability using a subjective questionnaire. 5.1 Task performance Figure 8 shows the results with regards to the task performance. The camera based interface generally took Time to accomplish the given task (selecting 27 objects in the order of cylinders, spheres and cones) was measured. Camera based interface took less time from all of the test, the average completion time with camera interface was 2 min 37 sec and with button it was 3 min 52 sec.(significantly different with p-value 0.004 by paired t-test) 300 250

2. Rotate and translate 3. And release

Normal vector from the handheld display

Time(sec)

1. Select 200

Camera-based completion time Button-based completion time

150 100 50 0

Handheld device Figure 7: Selection and manipulation with handheld device (virtual ray casting).

4.3

Implementation

S1

S2

S3

S4

S5

S6

Figure 8: Task completion time for each subject. The analysis with the subjective questions (asked in 7 Likert scale) revealed that the camera based interface was superior to the

button based in most categories: for example, for ease of use (4.8 for camera based and 3 for button based, t-value = 2.8, p-value = 0.038) and ease of learning (5.5 for camera-based and 3.5 for button based, t-value = 3.46, p-value = 0.018) (See Figure 9). The results with regards to the questions of “relative ease to use,” “relative ease to learn,” and preference exhibited similar results.

REFERENCE

5.5

[2]

4.83

4 score

This work has been supported by the ITRC program of the Ministry of Information and Communication of Korea.

[1]

6 5

ACKNOWLEDGEMENT

3.5

[3]

3 3

[4] 2 1

[5]

0 Eas y to us e camera

Eas y to us e button

Eas y to learn camera

Eas y to learn button

[6]

Figure 9: Results of the subjective questionnaire [7]

6

CONCLUSION AND FUTURE WORK

This paper has presented a camera based method for interacting in the hand-held virtual environments. The method is based on selecting and tracking prominent corner features from the environment (without predefining them), and generates relative orientation and zooming parameters. The generated orientation and zooming parameter values are converted to guide the position and orientation of the virtual camera in the hand-held virtual environment. Also combined with buttons, the interface can be used to select and manipulate virtual objects in the scene. The user study has revealed that even with the approximate relative motion tracking, it exhibited at least promising (considering the relatively low number of subjects) performance gain and preference over the standard button based interface. We believe that the performance gain (and preference) is due to the stimulation of the proprioceptive sense of the users and its association of the virtual space. This association is particularly effective because of the relatively small visual display. There are many possible applications of such as browsing and interacting in relatively large virtual environments (e.g. game) or with such virtual objects (e.g. long hierarchical menus or large documents). Our system works reasonably well in static environment, but due to the nature of the feature selection algorithm, not so well in very dynamic (fast moving) or plain (no corner features) environments (e.g. in car with the camera faced outside). And the zooming factor is only an approximate to that of the real world scale. We are continuing to improve the algorithm and would like to investigate how the inexact match between the amounts of movement between the real and virtual world affects the user performance and presence. The user study described in this paper also lacks sufficient number of subjects and we are currently adding more subject data.

[8]

[9]

[10] [11] [12] [13]

[14]

[15] [16]

[17]

[18] [19]

M. Slater, “A Note on Presence Terminology”, Presence-Connect, 3 (3) G. W. Fitzmaurice, S. Zhai, and M. H. Chignell, "Virtual Reality for Palmtop Computers," ACM Transactions on Information Systems, vol. 11, pp. 197-218, 1993. K. Watsen, R. P. Darken, and M. Capps, "A Handheld Computer as an Interaction Device to a Virtual Environment," Proceedings of the International Projection Technologies Workshop, 1999. N. Kukimoto, Y. Furusho, J. Nonaka, K. Koyamada, and M. Kanazawa, "PDA-based Visualization Control and Annotation Interface for Virtual Environment," Proceeding of 3rd IASTED International Conference Visualization, Image and Image Processing, 2003. D. Wagner and D. Schmalstieg, "First Steps Towards Handheld Augmented Reality," Proceedings of the 7th International Conference on Wearable Computers, 2003. D. Wagner, T. Pintaric, F. Ledermann, and D. Schmalstieg, "Towards Massively Multi-User Augmented Reality on Handheld Devices," Third International Conference on Pervasive Computing (Pervasive 2005), 2005. V. Paelke, C. Reimann, and D. Stichling, "Foot-based Mobile Interaction with Games," ACM SIGCHI International Conference on Advances in Computer Entertainment Technology (ACE), 2004. M. Hachet and Y. Kitamura, "3D Interaction With and From Handheld Computers," Proceedings of IEEE VR 2005 Workshop: New Directions in 3D User Interfaces, 2005.\ M. Hacket, J. Pouderoux, and P. Guitton, "A Camera-Based Interface for Interaction with Mobile Handheld Computers," Proceedings of I3D'05 - ACM SIGGRAPH 2005 Symposium on Interactive 3D Graphics and Games, 2005. J. Shi and C. Tomasi, "Good features to track," IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA, 1994. J. Y. Bouguet, "Pyramidal Implementation of the Lucas Kanade Feature Tracker Description of the Algorithm," 2000. K. N. Kutulakos, "Calibration-Free Augmented Reality," IEEE Transactions on Visualization and Computer Graphics, vol. 4, 1998. M. I. A. Lourakis and A. A. Argyros, "Vision-based Camera Motion Recovery for Augmented Reality," Proceedings of the computer graphics international, 2004. M. I. A. Lourakis and A. A. Argyros, "Efficient, causal camera tracking in unprepared environments," Computer Vision and Image Understanding, vol. 99, 2005. R. M. Haralick and L. G. Shapiro, Computer and Robot Vision Vol. II.: Addison Wesley, 1993. D. A. Bowman and L. F. Hodges, "An evaluation of techniques for grabbing and manipulating remote objects in immersive virtual environments," Proceedings of the 1997 symposium on Interactive 3D graphics, Rhode Island, United States, 1997. J. S. Pierce, A. S. Forsberg, M. J. Conway, S. Hong, R. C. Zeleznik, and M. R. Mine, "Image Plane Interaction Techniques in 3D Immersive Environments," Proceedings of the 1997 symposium on Interactive 3D graphics, Rhode Island, United States, 1997. Open Source Computer Vision Library, http://www.intel.com/technology/computing/opencv Coin3D, http://www.coin3d.org

Suggest Documents