A Sketching Interface for 3D Freeform Design - CiteSeerX

6 downloads 128 Views 2MB Size Report
this view is superimposed upon the real world, it gives the strong impression that the ...... CVE'00 (San Francisco, CA 2000), 103-110. 32. Singer A., Hindus, D., ...
3-D Live: Real Time Interaction for Mixed Reality Simon Prince1 Adrian David Cheok1 Farzam Farbiz1 Todd Williamson2 Nik Johnson2 Mark Billinghurst3 Hirokazu Kato4 1

National University of Singapore, {elesp,eleadc,eleff}@nus.edu.sg 2 Zaxel Systems, {toddw,nik}@zaxel.com 3 University of Washington, [email protected] 4 Hiroshima State University, [email protected]

ABSTRACT

We describe a real-time 3-D augmented reality videoconferencing system. With this technology, an observer sees the real world from his viewpoint, but modified so that the image of a remote collaborator is rendered into the scene. We register the image of the collaborator with the world by estimating the 3-D transformation between the camera and a fiducial marker. We describe a novel shape- from-silhouette algorithm, which generates the appropriate view of the collaborator and the associated depth map at 30 fps. When this view is superimposed upon the real world, it gives the strong impression that the collaborator is a real part of the scene. We also demonstrate interaction in virtual environments with a “live” fully 3-D collaborator. Finally, we consider interaction between users in the real world and collaborators in a virtual space, using a “tangible” AR interface.

Figure 1: Observers view the world via a head-mounted display (HMD) with a front-mounted camera. Our system Figure 1: Observers view the world via a head-mounted display detects markers in the scene and superimposes live video (HMD) with a front-mounted camera. Our system detects markers content rendered from the appropriate viewpoint in real time. in the scene and superimposes live video content rendered from the appropriate viewpoint in real time.

Keywords

Video-Conferencing, Augmented Reality, Image Based Rendering, Shape from Silhouette, Interaction

monitor resolution. These limitations disrupt fidelity of communication [34] and turn taking [10], and increase interruptions and overlap [11]. Collaborative virtual environments restore spatial cues common in face-to-face conversation [4], but separate the user from the real world. Moreover, non-verbal communication is hard to convey using conventional avatars, resulting in reduced presence [29]. We define the “perfect video avatar” as one where the user cannot distinguish between a real human present in the scene and a remote collaborator. Perhaps closest to this goal of perfect tele-presence is the Office of the Future work [27], the Virtual Video Avatar of Ogi et al.[25], and the work of Mulligan and Daniilidis [23][24]. All sytems use multiple cameras to construct a geometric model of the participant, and then use this model to generate the appropriate view for remote collaborators. Although impressive, these systems currently do not generate the whole 3D model – one cannot move 360o around the virtual avatar. Moreover, since the output of these systems is mediated via projection screens the display is not portable. The goal of this paper is to present a solution to these problems, by introducing an augmented reality (AR) video-conferencing system. Augmented reality refers to the real-time insertion of computer-generated three-dimensional content into a real scene (see [2], [3] for reviews). Typically, the observer views the world through a head mounted display (HMD) with a camera attached to the front. The video is captured, modified and relayed to the observer in

INTRODUCTION

Science fiction has presaged many of the great advances in computing and communication. In 2001: A Space Odyssey, Dr Floyd calls home using a videophone – an early on-screen appearance of 2-D video-conferencing. This technology is now commonplace. More recently, the Star Wars films depicted 3-D holographic communication. In this paper we apply computer graphics to create what may be the first real-time “holo-phone”. Existing conferencing technologies have a number of limitations. Audio-only conferencing removes visual cues vital for conversational turn-taking. This leads to increased interruptions and overlap [8], and difficulty in disambiguating between speakers and in determining willingness to interact [14]. Conventional 2-D videoconferencing improves matters, but large user movements and gestures cannot be captured [13], there are no spatial cues between participants [29] and participants cannot easily make eye contact [30]. Participants can only be viewed in front of a screen and the number of participants is limited by Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists requires prior specific permission and/or a fee. CSCW’02 November 16-20, 2002, New Orleans, Louisiana, USA. Copyright 2002 ACM 1-58113-560-2/02/0011 $5.00

190 364

real time. Essentially we create a live virtual video avatar and then use AR technology for displaying that avatar over the real world (see Figure 1). In addition to creating an extremely compelling sense of presence, this facilities a wide range of teleconferencing and collaborative applications. In the first part of the paper, we review previous work in AR conferencing. The enabling technology for our system is a novel method for generating arbitrary views of a collaborator at interactive speeds. In the second section, we sketch this algorithm and demonstrate a number of its advantages over competing technologies for real-time communication applications. In the third part of the paper, we introduce a number of other applications of the technology to which our system is particularly suited. These include visualisation of a collaborator in a virtual space, and a novel method for users in real space to interact with virtual collaborators, using tangible user interface techniques.

In a user study that compared AR conferencing to traditional audio and video conferencing subjects reported a significantly higher sense of presence for the remote user in the AR conferencing condition and that it was easier to perceive non-verbal communication cues [6]. Indeed, the compelling nature of the AR conference was amply shown by one user who leaned in close to the monitor during the video conferencing condition, and moved back during the AR condition to give the virtual collaborator the same body space as in a face-to-face conversation. More recent work [7] has presented an AR conferencing interface the supports multiple remote users and applies alpha mapping techniques to extract video of the remote user from the background and create a more natural image (see Figure 2). In a user study with this interface users felt that the AR condition provided significantly more co-presence and improved the understanding of the conversational relationship between participants.

Prior Work

Billinghurst and Kato [6] first explored how AR could be used to support remote collaboration, and to provide gaze and non-verbal communication cues. Users wore a lightweight HMD and were able to see a single remote user appear attached to a real card as a life-sized, live virtual video window. The overall effect was that the conference collaborator appeared projected into the local user’s real workspace (see Figure 2). Since the cards are physical representations of remote participants, our collaborative interface can be viewed as a variant of the Ishii’s tangible interface metaphor [Ishii 97]. Users can arrange the cards about them in space to create a virtual spatial conferencing space and the cards are also small enough to be easily carried, ensuring portability. The user is no longer tied to the desktop and can potentially conference from any location, so the remote collaborators become part of any real world surroundings, potentially increasing the sense of social presence. There are a number of other significant differences between the AR conferencing interface and traditional desktop video conferencing. The remote user can appear as a life-sized image and a potentially arbitrary number of remote users can be viewed at once. Since the virtual video windows can be placed about the user in space, spatial cues can be restored to the collaboration. Finally, the remote user’s image is entirely virtual so a real camera could be placed at the users eye point allowing support for natural gaze cues.

3-D LIVE AUGMENTED REALITY Overview

In this paper, we aim to insert a live image of a remote collaborator into the visual scene (see Figures 1 and 2). As the observer moves his head, this view of the collaborator changes appropriately. This results in the stable percept that the collaborator is three dimensional and present in the space with the observer. In order to achieve this goal, we require that for each frame: (i) the pose of the head-mounted camera relative to the scene is estimated, (ii) the appropriate view of the collaborator is generated and (iii) this view is rendered into the scene, possibly taking account of occlusions. We consider each of these problems in turn. HMD Camera Pose Estimation

The scene was viewed through a Daeyang Cy-Visor DH-4400VP head mounted display (HMD), which presented the same 640x480 pixel image to both eyes. A PremaCam SCM series color security camera was attached to the front of this HMD. This captures 25 images per second at a resolution of 640x480. We employ the marker tracking method of Kato and Billinhurst [18]. We simplify the pose estimation problem by inserting 2-D square black and white fiducial markers into the scene. Virtual content is associated with each marker.

Figure 2: Progression towards more natural augmented reality video-conferencing. Initial work by [6] associated 2-D images of single collaborators with card markers (left). Subsequent work increased the number of collaborators and introduced alpha-mapping to more realistically integrate the 2-D video stream into the world (centre). In this paper we introduce fully live 3-D video conferencing (right).

191 365

Since both the shape and pattern of these markers is known, it is easy to both locate these markers and calculate their position relative to the camera.

A more attractive approach to fast 3-D model construction that has been exmployed by [21] and [22] is to use shape-from-silhouette information. A number of cameras are placed around the subject. Each pixel in each camera is classified as either belonging to the subject (foreground) or the background. The resulting foreground mask is called a “silhouette”. Each pixel in each camera collects light over a (very narrow) rectangular-based pyramid in 3D space, where the vertex of the pyramid is at the focal point of the camera and the pyramid extends infinitely away from this. For background pixels, this space can be assumed to be unoccupied. Shape-from-silhouette algorithms work by initially assuming that space is completely occupied, and using each background pixel from each camera to carve away pieces of the space to leave a representation of the foreground object. Clearly, the reconstructed model will improve with the addition of more cameras. However, the resulting depth reconstruction may not capture the true shape of the object, even with infinite cameras. The best possible reconstructed shape is termed the “visual hull” [20]. Despite this limitation, shape-from-silhouette has three significant advantages over competing technologies. First, it is more robust than stereo-vision. Even if background pixels are misclassified as part of the object in one image, other silhouettes are likely to carve away the offending misclassified space. Second, it is significantly faster than either stereo, which requires vast computation to calculate cross-correlation, or laser range scanners, which generally have a slow update rate. Third, the technology is inexpensive and requires no specialized hardware. For these reasons, the system described in this paper is based on shape-from-silhouette information. We believe that this is the first system that is capable of capturing 3D models and textures from a large number of cameras (15) and displaying them from an arbitrary viewpoint at 25 fps (the capture camera frame rate). To the best of our knowledge, the closest comparable system employs only 5 cameras and model quality suffers accordingly.

In brief, the camera image is thresholded and contiguous dark areas are identified using a connected components algorithm. A contour seeking technique identifies the outline of these regions. Contours that do not contain exactly four corners are discarded. We estimate the corner positions by fitting straight lines to each edge and determining the points of intersection. A projective transformation is used to map the enclosed region to a standard shape. This is then cross-correlated with stored patterns to establish the identity and orientation of the marker in the image For a calibrated camera, the image positions of the marker corners uniquely identify the three-dimensional position and orientation of the marker in the world. This information is expressed as a Euclidean transformation matrix relating the camera and marker co-ordinate systems, and is used to render the appropriate view of the virtual content into the scene. Software for augmented reality marker tracking and calibration can be downloaded from [35]. Virtual Viewpoint Generation Background

In order to integrate the virtual collaborator seamlessly into the real world, we need to generate the appropriate view for each video frame. To achieve this goal, we must generate a model of the 3-D shape of the collaborator on each frame. A novel view can easily be constructed given the shape and several known viewpoints. One approach is to gather depth information using stereo-depth. Stereo reconstruction can now be achieved at interactive speeds [17][23][24]. However, the resulting dense depth map is not robust, and no existing system places cameras all round the subject. Related image-based rendering techniques [28][1] do not explicitly calculate depth, but still require dense matches between images, and are similarly prone to error.

Algorithm Overview

Given any standard 4x4 projection matrix representing the desired virtual camera, the center of each pixel of the virtual image is associated with a ray in space that starts at the camera center and extends outward. Any given distance along this ray corresponds to a point in 3D space. In order to determine what color to assign to a particular virtual pixel, we need to know the first (closest) potentially occupied point along this ray. This 3D point can be projected back into each of the real cameras to obtain samples of the color at that location. These samples are then combined to produce the final virtual pixel color. Thus the algorithm performs three operations at each virtual pixel: • Determine the depth of the virtual pixel as seen by the virtual camera. • Find corresponding pixels in nearby real images • Determine pixel color based on all these measurements.

Figure 3. Virtual viewpoint generation by shape from silhouette. Points which project into the background in any camera are rejected. The points from A to C have already been processed and project to background in both images, so are marked as unoccupied (magenta). The points yet to be processed are marked in yellow. Point D is in the background in the silhouette from camera 2, so it will be marked as unoccupied and the search will proceed outward along the line.

Determining Pixel Depth

The depth of each virtual pixel is determined by an explicit search. The search starts at the virtual camera projection center and proceeds outward along the ray corresponding to

192 366

the pixel center (see Figure 3). Each candidate 3D point along this ray is evaluated for potential occupancy. A candidate point is unoccupied if its projection into any of the silhouettes is marked as background. When a point is found for which all of the silhouettes are marked as foreground, the point is considered potentially occupied, and the search stops. To constrain the search for each virtual pixel, the corresponding ray is intersected with the boundaries of each real image. We project the ray into each image to form the corresponding epipolar line. The points where these epipolar lines meet the image boundaries are found and these boundary points are projected back onto the ray. The intersections of these regions on the ray define a reduced search space. If the search reaches the furthest limit of this region without finding any potentially occupied pixels, the virtual pixel is marked as background.

of view. With this lens, it is possible to capture a space that is 2.5m high and 3.3m in diameter with cameras that are 1.25 meters away.

Determining Pixel Color

Our full system combines the virtual viewpoint and augmented reality software (see Figure 5). For each frame, the augmented reality system identifies the transformation matrix relating marker and camera positions. This is passed to the virtual viewpoint server, together with the estimated camera calibration matrix. The server responds by returning an RGBA image, and adepth estimate associated with each pixel. This simulated view of the remote collaborator is then superimposed on the original image and displayed to the user. In order to increase the system speed, we introduce a single frame delay into the presentation of the augmented reality video. Hence, the augmented reality system starts processing the next frame while the virtual view server generates the view for the previous one. A swap then occurs. The graphics are returned to the augmented reality system for display, and the new transformation matrix is sent to the virtual view renderer. The delay ensures that neither machine wastes significant processing time waiting for the other and a high throughput is maintained. In practice, this means that there is no noticeable delay for users – when they move their head, the view of the collaborator appears to move simultaneously. One advantage of our client-server system is that the network requirements are relatively low. Since for each frame only one image is required, the system is no more

Comparison with other methods

Our system is similar in spirit the work of Matusik et al. [22] who also present an image-based novel view generation algorithm using sillhouette information. The principal difference is that Matusik et al. generate the entire visual hull from the current camera angle, whereas we generate only the visible part. Lok [21] has proposed an alternative volume based approach to reconstruction. Both of the above systems scale linearly with the number of cameras added. Our system scales much more slowly in practice, since the pixel color estimation (which takes the bulk of the rendering time) only uses of a fixed number of camera images. 3-D INTERACTION FOR MIXED REALITY

In general, we would prefer to establish the pixel color based on the data from cameras that are closely aligned with the desired novel view. We rank the cameras based on their proximity and choose the three closest. We now compute where the 3D point lies in each candidate camera’s image. Unfortunately, the real camera does not necessarily see this point in space - another object may lie between the real camera and the point. If the real pixel is occluded in this way, it cannot contribute its color to the virtual pixel. We repeat the depth search algorithm on a pixel from the real camera. If the recovered depth lies close enough in space to the 3D point computed for the virtual camera pixel, we assume the real camera pixel is not occluded - the color of this real pixel is allowed to contribute to the color of the virtual pixel. In practice, we increase system speed by immediately accepting points that are geometrically certain not to be occluded. We take a weighted average of the pixels from the non-occluded cameras, such that the closest camera is given the most weight. System Hardware and Software

Fourteen Sony DCX-390 video cameras were equally spaced around the subject, and one viewed him/her from above (see Figure 4). Five Dual 1Ghz Pentium III video-capture machines received data from three cameras each. The video-capture machines pre-process the video frames by determining the silhouettes and pass these to the rendering server via gigabit Ethernet links. The rendering server was based on a 1.7 GHz Pentium IV Xeon processor. The characteristics of our algorithm allow us to generate very high quality models based on 15 cameras very quickly. The figures in this paper were generated at 384x288 resolution at 25 fps with a latency of

Suggest Documents